The eval set is the product

Models swap. Prompts get rewritten. Harnesses get rebuilt. The eval set is the only artifact that compounds. Most teams treat it like test infrastructure — and pay for it twice.

James Kip

Writes about evaluating AI systems in production. Builder of Bonsai.

Walk into any AI team and ask what their most important asset is. You will hear: the model. Or the prompt. Or, on a self-aware day, the data. You will almost never hear: the eval set. That answer is wrong, and the cost of being wrong about it is the single biggest reason AI products plateau six months after launch.

What actually persists

Look at what survives a year on an AI team. The model you picked? Replaced — usually twice. The prompts? Rewritten when the model changed, then again when product scope shifted. The harness? Rebuilt the moment you outgrew the notebook. The retrieval index? Re-embedded on the new model. Even the team rotates.

The eval set is the only artifact that survives all of that. It encodes the question your product is actually trying to answer: what does good look like, in our domain, for our users, on the cases that matter. Everything else is implementation.

This is the inversion most teams miss. The model is the cheap, swappable layer. The eval set is the expensive, irreplaceable one. We have it backwards in our heads because the model is what we pay for and the eval set is what we build. Cost is not the same as value.

The eval set predicts the next model

The clearest test of whether a team treats evals as product or as infrastructure: how long does it take you to evaluate a new frontier model on your domain?

For a team that treats the eval set as the product, the answer is a Tuesday afternoon. They have stable scoring, calibrated judges, a dataset that reflects their real workload, and a baseline they trust. New model lands? Run it through, read the deltas, decide.

For a team that treats evals as infrastructure, the answer is a quarter. They have to remember which prompts worked, hand-run a few examples, argue about whether vibes-based regressions count, and eventually ship the new model because the CEO read a tweet about it. They are not evaluating models. They are betting on them.

The difference is not engineering talent. It is whether the team treats the eval set as a thing they own and improve, or a thing that sometimes gets touched.

Three signs you've inverted it

1) Your eval set is shorter than your prompt. This is more common than you would think. Teams will iterate on a 2,000-token prompt for weeks while their 'eval suite' is twelve hand-picked examples a PM wrote in a Google Doc.

2) You can't tell me which cases regressed. When a number moves, can you point to the specific examples that flipped? If the answer is 'we'll have to look,' you are flying on aggregate metrics — which is fine for dashboards and useless for shipping decisions.

3) Triage doesn't feed back. Production failures get a Slack thread, maybe a postmortem. They do not get added to the eval set as permanent regression cases. The eval set is frozen in time at whatever vibe the team had when they wrote it, while the product moves on.

What changes when you flip it

When the eval set is the product, the priorities reshuffle. You spend engineering cycles on it the way you spend them on user-facing features. You staff it. You version it. You write deprecation notices when you remove cases. You argue about coverage the way other teams argue about test coverage — because that is what it is.

The payoff: every model release becomes free leverage. Every prompt rewrite becomes a measurable, defensible decision. Every regression caught is a permanent inoculation. The team's velocity stops being bottlenecked by 'does this feel better?' meetings, because the eval set answers that question in minutes instead of weeks.

This is not a tooling argument. You can do it with a CSV and a script. It is a seriousness argument. You either treat the artifact that compounds as the most important thing you own, or you do not.

How to start, on a Monday

Pick the last ten production incidents where the AI behaved badly. Add them to your eval set as named cases, with the expected behavior written down. That is your starting kernel.

Now set a rule: no AI feature ships without adding at least three new cases to the eval set. Most teams refuse to do this because it slows them down. That is the point. You are not slowing down shipping; you are slowing down forgetting. The team that remembers more wins more.

In six months, the eval set will be the artifact people fight to maintain access to. In a year, it will be the thing recruiters mention when a senior IC interviews. In two years, it will be the moat.

The model is rented. The prompts are temporary. The eval set is the only thing your team is actually building. Treat it that way.