Agent Evaluation Harness
Reproducible trajectories in containerized environments.
Problem
Tool-using agents take long, branching trajectories that touch real APIs. We need an environment where every eval run is deterministic enough to attribute regressions to model/prompt changes, and rich enough to surface tool-specific failures.
Goals
- ✓Snapshot-restore environments per case for full determinism.
- ✓Trace every tool call with inputs/outputs/timings.
- ✓Score plan quality, tool-selection accuracy, step efficiency, and final-task success.
- ✓Support both mocked-tool and live-tool runs.
Non-goals
- ·Multi-agent orchestration eval — covered separately.
Architecture
Walkthrough
1. Snapshot the environment
Each task starts from a sealed container snapshot: filesystem in a known state, mock APIs primed with deterministic responses, browser at a known URL. Restoring between runs eliminates 'works on my run' flakiness.
2. Run the agent with tracing
The agent's thought, tool selection, args, and observations are logged at every step with monotonic timestamps. The trace is the unit of evaluation — outcomes alone don't tell you why something worked or failed.
3. Mock vs. live
Mocked tools (recorded I/O) make runs cheap and deterministic — use these on every PR. Live runs (real APIs, rate-limited) catch integration drift — schedule nightly. When mocked and live diverge, the underlying tool changed; auto-file an issue.
4. Trajectory scoring
The judge scores: tool-selection accuracy (per step, vs. allowed tools), parameter validity, step efficiency (steps used / minimum needed), recovery (handled tool errors gracefully?), termination (stopped at success vs. looped). Final-task success is graded with a task-specific verifier (file diff, DOM check, API state check).
5. Failure-mode taxonomy
Failures are auto-classified: loop trap, goal drift, tool hallucination, premature commitment, cascading error. Track counts per release; a spike in 'goal drift' after a prompt change is a precise actionable signal.
Tradeoffs
Metrics to track
- Task success rate per task category
- Average steps per task; efficiency = optimal_steps / actual_steps
- Tool-selection accuracy
- Failure-mode distribution per release