bonsai
← System Designs
System Design

CI/CD for Prompt and Model Releases

Shadow → canary → full, with statistical gates the whole way.

Problem

Prompts, model versions, retrieval indices, and tool schemas are all release artifacts. We need a release pipeline that gates on offline evals, then validates online via shadow and canary, with automatic rollback.

Goals

  • Block merges that regress safety or headline metrics with statistical significance.
  • Run shadow comparisons before any user traffic shifts.
  • Canary at 1% → 5% → 25% → 100% with quality and error gates at each step.
  • Automatic rollback and full audit trail for every release.

Non-goals

  • ·Index ingestion pipeline.

Architecture

InputProcessModelStoreJudgeOutput
rendering diagram…
Components
PRPrompt / model / tool change with author + reviewers.
Offline EvalsSmoke + full suite; statistical gates per slice.
Artifact RegistryVersioned prompts, model pins, tool schemas, eval results.
Shadow RunMirror production traffic; compare old vs. new offline.
Canary ControllerStepped rollout 1→5→25→100% with per-step gates.
Online MonitorsQuality, error rate, latency, refusal rate by version.
RollbackRepins prior version on threshold breach.

Walkthrough

1. Offline gates

PR runs the full eval set with statistical comparison vs. production baseline. Gates: zero safety regression, no slice down >2σ, headline not down at p<0.05. PR is blocked otherwise; gates can be force-overridden with a recorded justification (counted in monthly release-quality KPIs).

2. Shadow validation

After offline pass, the artifact deploys in shadow: a copy of production traffic is sent to both old and new versions, both responses are logged and judged, but only the old response is returned to users. Shadow runs for at least 1 hour or 10k requests. Online judge mean must not regress at p<0.01 to proceed.

3. Canary

Canary controller routes 1% → 5% → 25% → 100% traffic, with each step held for ≥30 minutes. At every step we check: error rate, latency p95, online quality score, refusal rate, and a curated 'safety canary' set sampled live. Any threshold breach pins traffic and pages.

4. Audit and registry

Every artifact (prompt SHA, model version, tool schema hash) lands in the registry with offline eval report, shadow scores, canary decisions, and final state. Compliance and post-mortems work directly off this registry; nothing is reconstructed from logs.

Tradeoffs

Shadow before any canary
Chose
Mandatory shadow
Over
Skip-shadow for low-risk changes
Because
Offline evals miss real-traffic distribution; shadow is the cheapest way to surface those gaps without users seeing them.
Force-override is allowed but counted
Chose
Audited override
Over
Hard block
Because
Hard blocks invite branch tricks; audited overrides preserve velocity while making misuse visible at the org level.

Metrics to track

  • PR merge → 100% rollout p50/p95
  • Override rate per team per month
  • Auto-rollback fire rate (and time-to-mitigate)
  • Shadow-detected regressions per quarter