eval harness
The test infrastructure used to run model or agent evaluations repeatedly against examples, expected behavior, tools, and scoring logic. FDEs add real failure cases from production.
The FDE added the customer's top failure cases to the eval harness.