A look at the evaluation harness behind the judge and the scoring rubric we use.
How we surface and correct for systematic bias in model evaluations.