Evaluating structured outputs

Parse rate is not correctness — they're two different evals.

Step 1 of 13

Structured outputs — JSON responses, tool calls, function arguments — are table stakes for any AI system that does more than chat. They look easy to evaluate (just JSON.parse) and that is the trap. Parse rate measures syntax; semantic accuracy measures intent. Conflating them is how teams ship 99% 'success' with 30% real correctness.

← LLM-as-judge: useful, biased, calibratable