System Design

CI/CD for Prompt and Model Releases

Shadow → canary → full, with statistical gates the whole way.

Problem

Prompts, model versions, retrieval indices, and tool schemas are all release artifacts. We need a release pipeline that gates on offline evals, then validates online via shadow and canary, with automatic rollback.

Goals

✓Block merges that regress safety or headline metrics with statistical significance.
✓Run shadow comparisons before any user traffic shifts.
✓Canary at 1% → 5% → 25% → 100% with quality and error gates at each step.
✓Automatic rollback and full audit trail for every release.

Non-goals

·Index ingestion pipeline.

Architecture

InputProcessModelStoreJudgeOutput

rendering diagram…

Components

PRPrompt / model / tool change with author + reviewers.

Offline EvalsSmoke + full suite; statistical gates per slice.

Artifact RegistryVersioned prompts, model pins, tool schemas, eval results.

Shadow RunMirror production traffic; compare old vs. new offline.

Canary ControllerStepped rollout 1→5→25→100% with per-step gates.

Online MonitorsQuality, error rate, latency, refusal rate by version.

RollbackRepins prior version on threshold breach.

Walkthrough

1. Offline gates

PR runs the full eval set with statistical comparison vs. production baseline. Gates: zero safety regression, no slice down >2σ, headline not down at p<0.05. PR is blocked otherwise; gates can be force-overridden with a recorded justification (counted in monthly release-quality KPIs).

2. Shadow validation

After offline pass, the artifact deploys in shadow: a copy of production traffic is sent to both old and new versions, both responses are logged and judged, but only the old response is returned to users. Shadow runs for at least 1 hour or 10k requests. Online judge mean must not regress at p<0.01 to proceed.

3. Canary

Canary controller routes 1% → 5% → 25% → 100% traffic, with each step held for ≥30 minutes. At every step we check: error rate, latency p95, online quality score, refusal rate, and a curated 'safety canary' set sampled live. Any threshold breach pins traffic and pages.

4. Audit and registry

Every artifact (prompt SHA, model version, tool schema hash) lands in the registry with offline eval report, shadow scores, canary decisions, and final state. Compliance and post-mortems work directly off this registry; nothing is reconstructed from logs.

Tradeoffs

Shadow before any canary

Chose

Mandatory shadow

Over

Skip-shadow for low-risk changes

Because

Offline evals miss real-traffic distribution; shadow is the cheapest way to surface those gaps without users seeing them.

Force-override is allowed but counted

Chose

Audited override

Over

Hard block

Because

Hard blocks invite branch tricks; audited overrides preserve velocity while making misuse visible at the org level.

Metrics to track

PR merge → 100% rollout p50/p95
Override rate per team per month
Auto-rollback fire rate (and time-to-mitigate)
Shadow-detected regressions per quarter