bonsai
← Quizzes
Advanced
3 questions · ~6 min

Production, drift, and frontier topics

Online quality, drift detection, and what's hard in 2026.

Progress
0 of 3 answered · 0 correct
Q1

You ship a new prompt. Offline evals look good. 4 hours later, online judge quality has dropped 6%. Errors are flat. What's most likely?

Q2

Which is the BEST input to a self-improving eval pipeline?

Q3

For a computer-use agent eval, what's the most important infrastructure investment?