Field notes on AI quality
Opinions, observations, and arguments from the front line of evaluating AI systems in production. Written for the engineers and leaders who have to ship.
The eval set is the product
Models swap. Prompts get rewritten. Harnesses get rebuilt. The eval set is the only artifact that compounds. Most teams treat it like test infrastructure — and pay for it twice.
More posts
Your LLM judge is lying to you (quietly, on a schedule)
LLM-as-judge is the most cost-effective scoring tool in AI quality work. It is also the most common source of false confidence. Here are the four ways it deceives you, and the calibration cadence almost no team runs.
Triage is the eval loop
Every AI team draws the diagram: production failures flow back into the eval set, the eval set drives the next release. Almost no team actually runs that loop. Here is why it rots, and the three rituals that keep it alive.