Your LLM judge is lying to you (quietly, on a schedule)

LLM-as-judge is the most cost-effective scoring tool in AI quality work. It is also the most common source of false confidence. Here are the four ways it deceives you, and the calibration cadence almost no team runs.

James Kip

Writes about evaluating AI systems in production. Builder of Bonsai.

Every team I talk to that has 'gone serious' on AI quality has, somewhere, a dashboard with a green number on it. The number is from an LLM judge. The judge is grading the team's outputs against a rubric. The number is going up over time. The team is celebrating.

Most of those numbers are decorative.

This is not a critique of LLM-as-judge as a technique — it is the most cost-effective scoring tool we have, and there is no replacement on the horizon. It is a critique of how teams deploy it: as a fire-and-forget oracle rather than a measurement instrument that needs calibration the same way a thermometer does.

Bias one: position

Show a judge model two responses, A and B, and ask which is better. Many models — including current frontier ones — pick A more often than chance. The bias is small, often single-digit percentage points, but it is real and it does not go away with prompting alone.

The practical consequence: any pairwise eval that always puts the new variant in the same slot will systematically over- or under-rate it. Teams discover this when they swap A and B and the result flips. By then the team has already shipped on the original ordering.

Fix: randomize position per example, and check that you are randomizing. Audit a sample of your judge calls. I have seen at least three teams whose 'random' position was deterministic in a way nobody noticed until I asked.

Bias two: length

Judges reward longer outputs. Not always, not on every rubric, but often enough to matter. The mechanism is plausible: longer outputs feel more thorough, more careful, more 'tried.' The mechanism is also wrong — many of the best outputs in real product work are shorter than the alternatives, because brevity is a feature.

If you are not measuring the correlation between output length and judge score, you are not measuring quality. You are measuring length, plus noise.

Fix: log output length alongside the score. Compute the correlation per rubric dimension. If the correlation is high and length is not actually a quality dimension you care about, your judge is broken. The fastest mitigation is to include a length-controlled comparison in the rubric or to grade pairs of outputs that have been length-matched.

Bias three: verbosity in the input

Different from output length: this is the bias that judges show when the input being judged is verbose. Long, hedge-filled, qualifier-heavy responses get rated as more careful. Confident, terse responses get rated as 'underdeveloped.'

This one is dangerous because it points at the wrong work. Teams notice their judge prefers verbose outputs and start prompting their main model to be more verbose, which makes the product worse and the judge happier. The dashboard goes up. Users complain more. Nobody connects the two for two quarters.

Bias four: model-affinity

Judges score outputs from their own model family higher. A Claude judge mildly prefers Claude outputs. A GPT judge mildly prefers GPT outputs. The bias is small per-example and devastating in aggregate when you are evaluating a model swap.

This is the bias that ends careers. A team running a judge from family X decides whether to switch their main model from family X to family Y. The judge says no. The team stays on X. Six months later a competitor on Y has shipped past them, and nobody can explain it because the eval said the swap would be neutral.

Fix: never use a judge from the same family as either model in a swap evaluation. If you must, use two judges from different families and require agreement.

Why human agreement is not calibration

Most teams' calibration story is: 'we hand-graded a hundred examples, the judge agreed with us 78% of the time, ship it.' That is a snapshot of agreement at one point in time on one slice of inputs. It is not calibration.

Real calibration looks like: agreement at each score level, broken out by relevant slices, recomputed on a cadence, with thresholds for when to recalibrate the rubric. A judge that agrees with humans 95% on the easy cases and 40% on the cases that matter is, on average, agreeing 78% — and is useless.

The minimum cadence I would defend in front of a skeptic: every model version, every rubric change, every quarter, and any time the judge's score distribution shifts more than a few points without a clear cause.

The thing nobody tells you

Judge models drift independently of your product. The judge provider releases a new model snapshot. The judge's defaults change. The judge's calibration on your rubric quietly moves. Your dashboard goes up or down by a few points, and there is no change in your product, your prompt, your data, or your model. The signal is contamination from the measurement instrument.

If your judge is pinned to a model version, this is mostly a non-issue — until you have to upgrade the judge for cost or capability reasons, and you are now comparing scores across two different judges as if they were the same number. They are not.

The defense is the same as for any scientific instrument: pin the version, recalibrate when you change it, and never compare scores across judge versions without a translation table built from a shared calibration set.

LLM-as-judge is not the problem. Treating it as an oracle instead of an instrument is the problem. The teams that win at AI quality are the ones who put the judge under the same scrutiny as the thing being judged.