Triage is the eval loop

Every AI team draws the diagram: production failures flow back into the eval set, the eval set drives the next release. Almost no team actually runs that loop. Here is why it rots, and the three rituals that keep it alive.

James Kip

Writes about evaluating AI systems in production. Builder of Bonsai.

The first time I saw the loop drawn on a whiteboard, I thought it was obvious and unremarkable. Production runs. Some outputs are bad. Triage looks at the bad ones. The interesting ones become eval cases. The eval set drives the next release. Repeat. Of course. What else would you do?

Four years later, I have audited that loop at maybe forty companies. Two of them are running it.

The diagram is everywhere. The loop almost never closes. And the gap between 'we have the diagram' and 'the loop actually closes' is, in my experience, the single largest determinant of whether an AI team's quality improves over time or plateaus.

Why triage rots first

Of all the work in the loop, triage is the most fragile. It is reactive, not roadmap-driven. It does not have a launch date. It does not have a metric anyone presents at all-hands. It looks like operations, in a culture that rewards features.

So it gets squeezed. The first quarter, triage is a Friday afternoon ritual with the whole team. By the second quarter, it is a single owner. By the third, that owner is on parental leave and nobody covered. By the fourth, the triage queue has 4,000 unreviewed traces and the team has tacitly agreed to pretend it does not exist.

Meanwhile, production drifts. New failure modes appear. Old ones return. The eval set, frozen at whatever the team's vibe was twelve months ago, no longer reflects the product. The dashboard still goes up, because the eval set is now measuring a problem the product no longer has.

The diagram everyone draws and nobody runs

Here is what the loop is supposed to do: every shipped failure becomes a permanent regression test, so that the second time it happens it is caught before users see it. This is the only mechanism by which an eval set gets better rather than just bigger. Without it, the eval set is a snapshot of what the team thought to test on day one — useful, but it cannot keep up with reality.

The loop is the only way the eval set tracks the world. If triage is broken, the loop is broken. If the loop is broken, your evals are stale. If your evals are stale, your green dashboard is fiction.

Most teams' instinct, when they realize the loop is broken, is to tool their way out of it. They buy an observability platform. They install traces. They build a Slack bot. None of that fixes triage. Triage rots because of attention, not tooling. You can have all the traces in the world; if no human is reading them with the question 'should this become a permanent eval case,' the loop is still open.

Ritual one: the weekly traffic-light review

Forty-five minutes, same time every week, calendar-blocked, on the engineering team's calendar (not the PM's, not the data scientist's). Sample fifty production traces — stratified by feature surface, weighted toward low-confidence outputs and user-flagged ones. Three buckets: fine, interesting, regression. Anything in the regression bucket goes into the eval set this week, with an expected-behavior note attached. Anything interesting gets a follow-up ticket.

The ritual is short, scoped, and scheduled. The schedule is the load-bearing part. Every team I have seen succeed has regularly scheduled triage. Every team I have seen fail has 'we triage when it comes up.'

Ritual two: the postmortem-to-eval pipeline

Every AI-related incident generates one or more concrete eval cases. Not 'we should test for this' in the action items. The actual cases, written and merged, before the postmortem is closed.

This is the cheapest, highest-leverage policy I know in this space. It costs the team thirty minutes per incident and gives them a permanent, named regression case that prevents the same class of failure from recurring silently. Without the policy, postmortems generate good intentions and nothing else.

Ritual three: the staleness audit

Once a quarter, sample twenty cases from the eval set at random and ask: is this case still representative of how users use the product today? If the answer is no for more than a small fraction, the eval set has drifted out of sync with reality, and the team needs to refresh — not by deleting old cases, which loses regression coverage, but by adding new cases that reflect the current product surface.

This is the ritual most teams skip. They will add cases forever and never audit them. The result is an eval set that is technically large and substantively narrow — heavy on the failure modes of 2024 and silent on the failure modes of today.

Permanent regressions are the only test set that earns its keep

An eval set composed of cases the team thought up in advance ages out fast. An eval set composed of cases drawn from real production failures cannot age out, because the world keeps generating new ones. The triage loop is the engine that converts the world's failures into your team's permanent immunity.

The teams that win at AI quality are not the teams with the most clever evals or the fanciest tools. They are the teams whose triage loop closes — every week, every postmortem, every quarter. Boring rituals. Outsized compounding.

If you only fix one thing about your AI quality program this year, fix the loop.

The diagram is correct. The work is in making the diagram describe a thing your team actually does on Tuesday afternoons.