online eval
Evaluation using live or near-live traffic, feedback, or production outcomes — measures what actually happens when users interact with the deployed system.
The online eval measured whether users accepted the agent's drafts.
203 approved definitions. Showing 81–100 of 203.
The online eval measured whether users accepted the agent's drafts.
The offline eval caught a regression in citation quality.
The citation requirement came from legal, not engineering.
The FDE required a grounded answer for every compliance recommendation.
The prompt pack changed with the workflow, so the FDE versioned it with the release.
The eval rubric penalized answers without cited policy sections.
The evaluation dataset came from real tickets, stripped of sensitive fields.
The golden dataset included the weird edge cases operators cared about.
The FDE added the customer's top failure cases to the eval harness.
The FDE shipped a bespoke agentic solution, then identified which parts belonged in product.
The agent operating model assigned support ownership before rollout.
The agent guardrail blocked write-back unless the user approved the change.
The agent handoff sent low-confidence cases to a supervisor with the evidence attached.
The FDE reduced agent orchestration to one router and three tools.
The sub-agent handled policy lookup while the main agent drafted the answer.
The FDE added an agent skill for drafting renewal summaries.
The agent rollout started with recommendations before write-back.
The FDE simplified the agent workflow after observing three unnecessary tool calls.
The agentic enterprise needed standard tool permissions before teams built dozens of agents.
The agentic application opened the right claim view and prepared the adjustment recommendation.