bonsai
← Curriculum
Foundations
~12 min
labeling
inter-rater-agreement
gold-sets

Human labeling and calibration

Your eval set is only as honest as the humans who labeled it.

Step 1 of 14

Every automated eval eventually traces back to a human judgment. Gold labels train your judges, calibrate your rubrics, and arbitrate every dispute about whether the model is improving. Labels written carelessly contaminate every downstream signal for months, invisibly. Labeling is not a clerical step — it's the foundation the rest of the program stands on.