offline eval

Evaluation run outside live user traffic — on a saved dataset — before changes reach production. Used to catch regressions before users see them.
The offline eval caught a regression in citation quality.