System Design

Offline Eval Pipeline

From prompt PR to a defensible quality verdict in under 10 minutes.

Problem

Engineers change prompts, retrieval, or model versions dozens of times per week. Each change can silently regress quality. We need a CI-grade pipeline that produces a per-PR quality report, gates merges, and stores all results for trend analysis.

Goals

✓Run on every PR in <10 minutes for the smoke set, <30 min for the full suite.
✓Report per-slice deltas vs. production baseline with statistical significance.
✓Cache judge calls aggressively — most cases are unchanged across runs.
✓Provide trace links so reviewers can audit any failure in one click.

Non-goals

·Online (production) quality monitoring — covered separately.
·Human-in-the-loop labeling — handled by triage tooling.

Architecture

InputProcessModelStoreJudgeOutput

rendering diagram…

Components

PR / commitPrompt, retrieval config, tool, or model change.

Eval RunnerLoads case set, fans out generation, applies scoring.

Case StoreVersioned dataset: prod traces, bug reports, red-team, synthetic.

System Under TestThe candidate stack: prompt + retriever + tools + model.

Judge PoolProgrammatic checks + LLM judges with cached verdicts.

Results WarehouseAppend-only store of every run, sliceable by tag/version/case.

PR ReportDelta-vs-baseline, slice regressions, trace drill-downs.

Walkthrough

1. PR triggers the runner

A GitHub Action invokes the runner with the candidate config (prompt + model version + retriever rev + tool schemas). The runner deterministically picks a case set: a fast 'smoke' set (~50 cases, 2 min) on every commit, the full set (~2000 cases, 25 min) on PR-ready and on main.

2. Generation with traces

Each case is run with n=3 samples. Every call captures the full trace: prompt, retrieval results, tool calls, model version, latency, tokens. Traces are stored with a content-addressed hash so judge results can be cached.

3. Scoring with cached judges

For each (case, output) we run the rubric's checks: programmatic (regex, schema), LLM-as-judge (per-criterion structured output), and pairwise vs. baseline (for win-rate metrics). Judge calls are cached by hash(judge_prompt, output) — most PRs only invalidate ~10% of cases.

4. Statistical comparison

We compare candidate vs. baseline on the same case set with paired bootstrap. Per slice (case tag), we report the delta and 95% CI. The PR report flags slices where the CI excludes zero, separating signal from noise.

5. Gate and merge

Gates: zero regression on the safety slice; no other slice regresses by >2σ; headline metric does not regress at p<0.05. Below those, merge is blocked with an actionable diff. Above those, merge is allowed and a canary plan is auto-attached.

Tradeoffs

Cache judge verdicts by output hash

Chose

Aggressive caching

Over

Re-judge every run

Because

90%+ of cases are unchanged across PRs; re-judging burns budget and adds variance.

Paired bootstrap, not raw mean diff

Chose

Paired statistical test

Over

Threshold on absolute mean

Because

Eval scores have non-trivial variance; raw deltas produce false alarms and missed regressions.

Smoke set on every commit, full set on ready

Chose

Two-tier suite

Over

Full suite always

Because

Engineering velocity matters; smoke catches 80% of regressions in <2 minutes.

Metrics to track

p50 / p95 PR turnaround time
Judge cache hit rate
Per-slice regression detection rate (validated on synthetic regressions)
False-alarm rate on no-op PRs