Blog

Field notes on AI quality

Opinions, observations, and arguments from the front line of evaluating AI systems in production. Written for the engineers and leaders who have to ship.

Latest

May 4, 2026 · ~7 min read

evals

strategy

thought-leadership

The eval set is the product

Models swap. Prompts get rewritten. Harnesses get rebuilt. The eval set is the only artifact that compounds. Most teams treat it like test infrastructure — and pay for it twice.

By James Kip

April 18, 2026 · ~9 min

llm-as-judge

calibration

Your LLM judge is lying to you (quietly, on a schedule)

LLM-as-judge is the most cost-effective scoring tool in AI quality work. It is also the most common source of false confidence. Here are the four ways it deceives you, and the calibration cadence almost no team runs.

March 29, 2026 · ~8 min

operations

evals

Triage is the eval loop

Every AI team draws the diagram: production failures flow back into the eval set, the eval set drives the next release. Almost no team actually runs that loop. Here is why it rots, and the three rituals that keep it alive.

The eval set is the product

More posts

Your LLM judge is lying to you (quietly, on a schedule)

Triage is the eval loop