bonsai
← Labs
Advanced
~12 min
requires API key

Agent trajectory critique

Grade the agent's path, not just its answer.

Pre-canned agent traces with seeded failure modes — redundant tool calls, wrong-tool selection, hallucinated observations. A trajectory judge scores tool selection, step efficiency, and grounding-in-trace with evidence pointing at specific step indices.

Learning objectives
  • ·Evaluate agents at the trajectory level, not only the final answer.
  • ·Distinguish failures of tool selection, efficiency, and grounding.
  • ·See how often a correct answer hides a broken trajectory (and vice versa).

1. Pick a scenario

Each scenario is a fixed trace with a known issue. The seeded failure is hidden from the judge — your job is to see whether the judge catches it.

2. Inspect the trace

Task
A customer asks: 'I want a refund for order #A-4471, it arrived broken.' Look up the order, check the refund eligibility, and confirm the refund.
step 1
lookup_order
args: {"order_id": "A-4471"}
obs: {"order_id":"A-4471","customer":"jess@example.com","total":48.0,"status":"delivered","delivered_at":"2026-05-04"}
step 2
lookup_order
args: {"order_id": "A-4471"}
obs: {"order_id":"A-4471","customer":"jess@example.com","total":48.0,"status":"delivered","delivered_at":"2026-05-04"}
step 3
check_refund_eligibility
args: {"order_id": "A-4471", "reason": "damaged"}
obs: {"eligible":true,"window_days_remaining":24,"policy":"damaged_on_arrival"}
step 4
lookup_order
args: {"order_id": "A-4471"}
obs: {"order_id":"A-4471","customer":"jess@example.com","total":48.0,"status":"delivered","delivered_at":"2026-05-04"}
step 5
issue_refund
args: {"order_id": "A-4471", "amount": 48.0, "reason": "damaged_on_arrival"}
obs: {"refund_id":"R-90213","status":"processed","amount":48.0}
Final answer
I've issued a full $48.00 refund for order A-4471 (refund ID R-90213). You should see it back on your original payment method within 5–7 business days. Sorry it arrived damaged!

3. Critique the trajectory