Agent trajectory critique

Grade the agent's path, not just its answer.

Pre-canned agent traces with seeded failure modes — redundant tool calls, wrong-tool selection, hallucinated observations. A trajectory judge scores tool selection, step efficiency, and grounding-in-trace with evidence pointing at specific step indices.

Learning objectives

·Evaluate agents at the trajectory level, not only the final answer.
·Distinguish failures of tool selection, efficiency, and grounding.
·See how often a correct answer hides a broken trajectory (and vice versa).

1. Pick a scenario

Each scenario is a fixed trace with a known issue. The seeded failure is hidden from the judge — your job is to see whether the judge catches it.

2. Inspect the trace

Task

A customer asks: 'I want a refund for order #A-4471, it arrived broken.' Look up the order, check the refund eligibility, and confirm the refund.

step 1

lookup_order

args: {"order_id": "A-4471"}

obs: {"order_id":"A-4471","customer":"jess@example.com","total":48.0,"status":"delivered","delivered_at":"2026-05-04"}

step 2

lookup_order

args: {"order_id": "A-4471"}

obs: {"order_id":"A-4471","customer":"jess@example.com","total":48.0,"status":"delivered","delivered_at":"2026-05-04"}

step 3

check_refund_eligibility

args: {"order_id": "A-4471", "reason": "damaged"}

obs: {"eligible":true,"window_days_remaining":24,"policy":"damaged_on_arrival"}

step 4

lookup_order

args: {"order_id": "A-4471"}

obs: {"order_id":"A-4471","customer":"jess@example.com","total":48.0,"status":"delivered","delivered_at":"2026-05-04"}

step 5

issue_refund

args: {"order_id": "A-4471", "amount": 48.0, "reason": "damaged_on_arrival"}

obs: {"refund_id":"R-90213","status":"processed","amount":48.0}

Final answer

I've issued a full $48.00 refund for order A-4471 (refund ID R-90213). You should see it back on your original payment method within 5–7 business days. Sorry it arrived damaged!

1. Pick a scenario

2. Inspect the trace

3. Critique the trajectory