bonsai
← System Designs
System Design

Offline Eval Pipeline

From prompt PR to a defensible quality verdict in under 10 minutes.

Problem

Engineers change prompts, retrieval, or model versions dozens of times per week. Each change can silently regress quality. We need a CI-grade pipeline that produces a per-PR quality report, gates merges, and stores all results for trend analysis.

Goals

  • Run on every PR in <10 minutes for the smoke set, <30 min for the full suite.
  • Report per-slice deltas vs. production baseline with statistical significance.
  • Cache judge calls aggressively — most cases are unchanged across runs.
  • Provide trace links so reviewers can audit any failure in one click.

Non-goals

  • ·Online (production) quality monitoring — covered separately.
  • ·Human-in-the-loop labeling — handled by triage tooling.

Architecture

InputProcessModelStoreJudgeOutput
rendering diagram…
Components
PR / commitPrompt, retrieval config, tool, or model change.
Eval RunnerLoads case set, fans out generation, applies scoring.
Case StoreVersioned dataset: prod traces, bug reports, red-team, synthetic.
System Under TestThe candidate stack: prompt + retriever + tools + model.
Judge PoolProgrammatic checks + LLM judges with cached verdicts.
Results WarehouseAppend-only store of every run, sliceable by tag/version/case.
PR ReportDelta-vs-baseline, slice regressions, trace drill-downs.

Walkthrough

1. PR triggers the runner

A GitHub Action invokes the runner with the candidate config (prompt + model version + retriever rev + tool schemas). The runner deterministically picks a case set: a fast 'smoke' set (~50 cases, 2 min) on every commit, the full set (~2000 cases, 25 min) on PR-ready and on main.

2. Generation with traces

Each case is run with n=3 samples. Every call captures the full trace: prompt, retrieval results, tool calls, model version, latency, tokens. Traces are stored with a content-addressed hash so judge results can be cached.

3. Scoring with cached judges

For each (case, output) we run the rubric's checks: programmatic (regex, schema), LLM-as-judge (per-criterion structured output), and pairwise vs. baseline (for win-rate metrics). Judge calls are cached by hash(judge_prompt, output) — most PRs only invalidate ~10% of cases.

4. Statistical comparison

We compare candidate vs. baseline on the same case set with paired bootstrap. Per slice (case tag), we report the delta and 95% CI. The PR report flags slices where the CI excludes zero, separating signal from noise.

5. Gate and merge

Gates: zero regression on the safety slice; no other slice regresses by >2σ; headline metric does not regress at p<0.05. Below those, merge is blocked with an actionable diff. Above those, merge is allowed and a canary plan is auto-attached.

Tradeoffs

Cache judge verdicts by output hash
Chose
Aggressive caching
Over
Re-judge every run
Because
90%+ of cases are unchanged across PRs; re-judging burns budget and adds variance.
Paired bootstrap, not raw mean diff
Chose
Paired statistical test
Over
Threshold on absolute mean
Because
Eval scores have non-trivial variance; raw deltas produce false alarms and missed regressions.
Smoke set on every commit, full set on ready
Chose
Two-tier suite
Over
Full suite always
Because
Engineering velocity matters; smoke catches 80% of regressions in <2 minutes.

Metrics to track

  • p50 / p95 PR turnaround time
  • Judge cache hit rate
  • Per-slice regression detection rate (validated on synthetic regressions)
  • False-alarm rate on no-op PRs