bonsai
← Labs
Intermediate
~10 min
requires API key

Build an LLM-as-judge

Author a rubric, judge a real generation, see the bias.

Author a structured rubric, generate a candidate response with Claude, then run a judge call that scores the response per-criterion with evidence quotes. Toggle a 'verbose response' to see verbosity bias in action.

Learning objectives
  • ·Write a per-criterion rubric instead of a holistic score.
  • ·Force structured JSON output from the judge.
  • ·Observe verbosity bias by comparing scores on short vs. long responses.

1. Generate a candidate response

Tip: try generating once with concise, score it, then regenerate with verboseand score the same content again. Verbose responses often score higher even when the underlying claims are the same — that's verbosity bias.

2. Author the rubric

3. Judge