AI Agent Evals: How to Test Agents Before They Break in Production
Agents fail in ways that traditional software doesn’t. A unit test can verify that a function returns the right value. An agent can return a plausible-looking answer arrived at through completely wrong reasoning, and no assert statement will catch it.
This makes testing AI agents genuinely harder than testing regular code. The output is non-deterministic. The path to the answer matters, not just the answer itself. And the agent may behave well on your test inputs while failing on inputs it has never seen. You need a different mental model for evals.
Two Approaches to Agent Testing
Offline Evals: Golden Datasets
The foundation of agent evaluation is a golden dataset: a collection of (input, expected behavior) pairs that you run your agent against before shipping changes.
“Expected behavior” is doing a lot of work in that sentence. For a deterministic function, you’d write the expected output. For an agent, you usually specify one of:
- Task completion: Did the agent accomplish the goal? (Binary or scored on a rubric.)
- Output quality: Does the agent’s response meet a set of criteria? This is where LLM-as-judge comes in. You use a second LLM to evaluate the first one’s output against a rubric like “Was the answer factually accurate?” or “Did it follow the user’s instructions?”
- Behavioral traces: Did the agent take the right steps to get there? (More on this below.)
Build your golden dataset from real user interactions. Log your production traffic, find representative examples, and annotate which outcomes were correct. Fifty good examples beat a thousand synthetic ones.
The key metric here is task completion rate: what percentage of golden-set tasks does the agent finish correctly? Track this number over time. A drop tells you something regressed. A rise tells you something improved.
Online Monitoring: Production Sampling
Offline evals catch regressions before deployment. Online monitoring catches failures after deployment.
Sample a fraction of your production traces and review them. You’re looking for:
- Anomalies: The agent spent 10 tool calls doing something that usually takes 2.
- Failure patterns: A class of inputs that consistently produces bad outputs.
- Error rates: Tool calls that return errors at an elevated rate.
Track task success rates over time. If your agent successfully completes 90% of requests on Monday and 75% on Friday, something changed. It might be the model, a tool that started returning different data, or a shift in the distribution of user inputs.
Tools like Braintrust, Arize Phoenix, and Promptfoo specialize in this kind of agent observability. They make it easier to store traces, run evaluations, and track metrics over time. They’re worth knowing about.
What to Measure
For any agent eval, you want data on three things:
- Did the agent complete the task? The top-level success metric.
- Did it hallucinate? Cross-reference the agent’s claims against source material where possible. LLM-as-judge can help here.
- Did it take the right path? For agents that use tools, this is where evaluation gets interesting.
Evaluating Tool-Using Agents
Agents that call external tools have an extra evaluation layer: the tool call itself.
An agent can produce a correct final answer while calling the wrong tools, calling tools unnecessarily, or misinterpreting what a tool returned. These failures matter in production because they burn latency, cost money, and may produce correct answers by accident. An agent that accidentally gets the right answer through wrong tool calls is a liability, not an asset.
The metrics that matter for tool-using agents:
- Tool call precision: When the agent called a tool, was it the right tool for the step? An agent that calls a web scraper to answer a question about stock prices when a stock quote tool is available has low precision.
- Tool call recall: Did the agent call a tool when it should have? Skipping a necessary tool call and guessing instead is a recall failure.
- Input quality: When the agent called the right tool, did it construct sensible inputs? An agent that calls a search tool with a query like “the thing I need to know” has a precision problem at the input level.
- Response interpretation: Did the agent correctly interpret the tool’s output? A tool returning a list of results is an opportunity for the agent to misread the schema, ignore relevant results, or confuse fields.
To evaluate these, you need traces. Each trace should record which tools were called, with what parameters, and what the tool returned. Then you can compare the agent’s actual tool call sequence against the expected sequence for that input.
Anthropic published a guide in 2026 on demystifying evals for agents, and their framing holds: treat each tool call as a testable unit. If you can assert “for this input, the agent should have called tool X with parameters approximately like Y,” you have a meaningful test.
Using AgentPatch as a Monitoring Layer
When you use AgentPatch as your tool layer, every tool call routes through a single API endpoint. This gives you a natural logging point. Every call is recorded.
You can check your tool call history through the AgentPatch API to audit what your agents are actually doing. Credit deductions give you a concrete signal: if an agent is burning more credits than expected on a given task type, it’s calling tools it shouldn’t be, or failing and retrying. Both of those are bugs worth catching.
This kind of monitoring complements your offline evals. Offline evals tell you whether the agent behaves correctly on your test set. Production monitoring tells you whether it behaves correctly on real traffic. You need both.
If your agent calls a search tool 12 times for a task that should take 2 searches, that shows up in the credit history before it shows up in user complaints.
Wrapping Up
AI agent evals are not a solved problem. The field is moving fast, and the tooling is maturing. But the core principles are stable: build a golden dataset, track task completion rate, monitor production traces, and pay close attention to tool call behavior.
For agents that use external tools, tool call precision and recall are the metrics that matter most. Get good at capturing traces and evaluating them.
If you want a tool layer built for monitoring, take a look at agentpatch.ai.