Offline Eval Pipeline
From prompt PR to a defensible quality verdict in under 10 minutes.
Problem
Engineers change prompts, retrieval, or model versions dozens of times per week. Each change can silently regress quality. We need a CI-grade pipeline that produces a per-PR quality report, gates merges, and stores all results for trend analysis.
Goals
- ✓Run on every PR in <10 minutes for the smoke set, <30 min for the full suite.
- ✓Report per-slice deltas vs. production baseline with statistical significance.
- ✓Cache judge calls aggressively — most cases are unchanged across runs.
- ✓Provide trace links so reviewers can audit any failure in one click.
Non-goals
- ·Online (production) quality monitoring — covered separately.
- ·Human-in-the-loop labeling — handled by triage tooling.
Architecture
Walkthrough
1. PR triggers the runner
A GitHub Action invokes the runner with the candidate config (prompt + model version + retriever rev + tool schemas). The runner deterministically picks a case set: a fast 'smoke' set (~50 cases, 2 min) on every commit, the full set (~2000 cases, 25 min) on PR-ready and on main.
2. Generation with traces
Each case is run with n=3 samples. Every call captures the full trace: prompt, retrieval results, tool calls, model version, latency, tokens. Traces are stored with a content-addressed hash so judge results can be cached.
3. Scoring with cached judges
For each (case, output) we run the rubric's checks: programmatic (regex, schema), LLM-as-judge (per-criterion structured output), and pairwise vs. baseline (for win-rate metrics). Judge calls are cached by hash(judge_prompt, output) — most PRs only invalidate ~10% of cases.
4. Statistical comparison
We compare candidate vs. baseline on the same case set with paired bootstrap. Per slice (case tag), we report the delta and 95% CI. The PR report flags slices where the CI excludes zero, separating signal from noise.
5. Gate and merge
Gates: zero regression on the safety slice; no other slice regresses by >2σ; headline metric does not regress at p<0.05. Below those, merge is blocked with an actionable diff. Above those, merge is allowed and a canary plan is auto-attached.
Tradeoffs
Metrics to track
- p50 / p95 PR turnaround time
- Judge cache hit rate
- Per-slice regression detection rate (validated on synthetic regressions)
- False-alarm rate on no-op PRs