bonsai
← System Designs
System Design

Production Observability for AI Systems

Every request scored, every drift detected, every failure looped back.

Problem

Offline evals are a snapshot. Production is a stream. We need an observability stack that scores live traffic, detects drift, surfaces regressions in hours, and feeds curated cases back into the offline eval set.

Goals

  • Per-request quality score from a cheap online judge.
  • Distribution monitoring for inputs, outputs, and quality.
  • Triage queue that turns user-reported failures into eval cases.
  • Auto-rollback on online quality threshold breach.

Non-goals

  • ·Cost optimization — adjacent concern.

Architecture

InputProcessModelStoreJudgeOutput
rendering diagram…
Components
User TrafficProduction requests with user feedback signals.
API GatewayRoutes traffic, applies version, attaches request ID.
Production StackPrompt + retriever + tools + model — versioned.
Request LoggerFull request, response, trace, version, latency, tokens, feedback.
Online JudgeCheap LLM judge scores a sample (~5%) of requests live.
Drift MonitorsPSI / KL on input embeddings, output stats, quality scores.
Triage QueueUser-reported + auto-flagged cases for human review.
Offline Eval SetCurated regression cases — the canonical truth.
Auto-RollbackTrips on quality / error rate thresholds; pins prior version.

Walkthrough

1. Log everything, version everything

Every production request stores prompt, retrieval results, tool calls, response, trace, model version, prompt revision, latency, tokens, and any user feedback signal (thumbs, edits, abandonment). Version tags make rollback and post-hoc slicing possible.

2. Score a sample live

A cheap LLM judge runs on ~5% of requests within seconds. The judge prompt is small and structured (rubric → JSON). The result is a per-request quality score that powers dashboards and alerting without grading every request.

3. Detect drift on three axes

Input drift (embedding-distribution PSI vs. last week), output drift (length, refusal rate, format distribution), quality drift (online judge mean by version). Alerts fire when PSI > 0.2 on any axis or when judge quality drops by 2σ over a 1h window.

4. Triage and back to evals

Drift slices and user-reported failures hit a triage queue. A human labels root cause (retrieval miss, hallucination, format break, safety) and the case is added to the offline eval set. Next PR runs against this updated set; the bug never returns.

5. Auto-rollback

If online quality drops below a per-version SLO or error rate spikes, traffic is pinned to the previous version automatically. The team is paged with the diff and the affected slices. Manual override re-enables the new version.

Tradeoffs

Sample 5% for online judging
Chose
Sampling
Over
100% live judging
Because
Cost. 5% is enough to detect drift in <1h; 100% would double inference cost.
Auto-rollback on threshold
Chose
Automated
Over
Pager-then-manual
Because
Median time-to-mitigate beats median time-to-page; a false rollback is reversible, a continued bad rollout is not.

Metrics to track

  • Online judge quality score by version
  • Input/output/quality PSI by slice
  • Triage queue → eval-case conversion rate
  • MTTR after rollback fires