bonsai
← System Designs
System Design

Agent Evaluation Harness

Reproducible trajectories in containerized environments.

Problem

Tool-using agents take long, branching trajectories that touch real APIs. We need an environment where every eval run is deterministic enough to attribute regressions to model/prompt changes, and rich enough to surface tool-specific failures.

Goals

  • Snapshot-restore environments per case for full determinism.
  • Trace every tool call with inputs/outputs/timings.
  • Score plan quality, tool-selection accuracy, step efficiency, and final-task success.
  • Support both mocked-tool and live-tool runs.

Non-goals

  • ·Multi-agent orchestration eval — covered separately.

Architecture

InputProcessModelStoreJudgeOutput
rendering diagram…
Components
Task SuiteVersioned tasks: goal, success criteria, allowed tools.
Sandbox EnvContainer with snapshot-restore (filesystem, mock APIs, browser, OS).
Agent SUTPlanner + tool-use loop; the candidate stack.
Tool LayerMocked tools (recorded responses) or live (rate-limited).
Trajectory LoggerCaptures every step: thought, tool, args, observation.
Trajectory JudgePer-step + overall scoring against rubric.
ReportSuccess rate, efficiency, failure mode taxonomy.

Walkthrough

1. Snapshot the environment

Each task starts from a sealed container snapshot: filesystem in a known state, mock APIs primed with deterministic responses, browser at a known URL. Restoring between runs eliminates 'works on my run' flakiness.

2. Run the agent with tracing

The agent's thought, tool selection, args, and observations are logged at every step with monotonic timestamps. The trace is the unit of evaluation — outcomes alone don't tell you why something worked or failed.

3. Mock vs. live

Mocked tools (recorded I/O) make runs cheap and deterministic — use these on every PR. Live runs (real APIs, rate-limited) catch integration drift — schedule nightly. When mocked and live diverge, the underlying tool changed; auto-file an issue.

4. Trajectory scoring

The judge scores: tool-selection accuracy (per step, vs. allowed tools), parameter validity, step efficiency (steps used / minimum needed), recovery (handled tool errors gracefully?), termination (stopped at success vs. looped). Final-task success is graded with a task-specific verifier (file diff, DOM check, API state check).

5. Failure-mode taxonomy

Failures are auto-classified: loop trap, goal drift, tool hallucination, premature commitment, cascading error. Track counts per release; a spike in 'goal drift' after a prompt change is a precise actionable signal.

Tradeoffs

Snapshot-restore containers per case
Chose
Heavy isolation
Over
Shared state with reset scripts
Because
Reset scripts always miss something; snapshots are the only way to hit determinism at scale.
Mocked tools by default; live nightly
Chose
Hybrid
Over
All-live
Because
Live APIs are slow, flaky, expensive; mocks let us run on every commit. Nightly live catches integration drift.

Metrics to track

  • Task success rate per task category
  • Average steps per task; efficiency = optimal_steps / actual_steps
  • Tool-selection accuracy
  • Failure-mode distribution per release