Production Observability for AI Systems
Every request scored, every drift detected, every failure looped back.
Problem
Offline evals are a snapshot. Production is a stream. We need an observability stack that scores live traffic, detects drift, surfaces regressions in hours, and feeds curated cases back into the offline eval set.
Goals
- ✓Per-request quality score from a cheap online judge.
- ✓Distribution monitoring for inputs, outputs, and quality.
- ✓Triage queue that turns user-reported failures into eval cases.
- ✓Auto-rollback on online quality threshold breach.
Non-goals
- ·Cost optimization — adjacent concern.
Architecture
Walkthrough
1. Log everything, version everything
Every production request stores prompt, retrieval results, tool calls, response, trace, model version, prompt revision, latency, tokens, and any user feedback signal (thumbs, edits, abandonment). Version tags make rollback and post-hoc slicing possible.
2. Score a sample live
A cheap LLM judge runs on ~5% of requests within seconds. The judge prompt is small and structured (rubric → JSON). The result is a per-request quality score that powers dashboards and alerting without grading every request.
3. Detect drift on three axes
Input drift (embedding-distribution PSI vs. last week), output drift (length, refusal rate, format distribution), quality drift (online judge mean by version). Alerts fire when PSI > 0.2 on any axis or when judge quality drops by 2σ over a 1h window.
4. Triage and back to evals
Drift slices and user-reported failures hit a triage queue. A human labels root cause (retrieval miss, hallucination, format break, safety) and the case is added to the offline eval set. Next PR runs against this updated set; the bug never returns.
5. Auto-rollback
If online quality drops below a per-version SLO or error rate spikes, traffic is pinned to the previous version automatically. The team is paged with the diff and the affected slices. Manual override re-enables the new version.
Tradeoffs
Metrics to track
- Online judge quality score by version
- Input/output/quality PSI by slice
- Triage queue → eval-case conversion rate
- MTTR after rollback fires