An AI agent that calls external tools will encounter failures. APIs go down. Rate limits get hit. Input validation rejects a request. Timeouts happen. The difference between a production-grade agent and a demo is how it handles these failures.
Good error handling for AI agents involves three layers: classifying the error, deciding on a recovery strategy, and managing the cost impact. Here’s how to think about each one.
Types of Errors
Server Errors (5xx)
The tool’s backend failed. Maybe a database query timed out, a dependency went down, or there’s a bug in the tool’s code. The agent’s request was valid, but the server couldn’t process it.
What the agent sees: HTTP 500, 502, 503, or similar. Sometimes a generic error message, sometimes nothing useful.
Recovery: Retry after a short delay. Server errors are often transient. A request that fails on the first attempt may succeed on the second. Use exponential backoff: wait 1 second, then 2, then 4. Cap at 3 retries.
Timeouts
The tool took too long to respond. For synchronous calls, this means the HTTP request timed out. For asynchronous calls (where the agent polls for results), the job may have been running for too long and been killed by the platform.
What the agent sees: A timeout error from the HTTP client, or a job status of “timeout” when polling.
Recovery: Retry once for synchronous timeouts, since the tool may have been momentarily slow. For async timeouts, check whether the tool supports longer job durations. If timeouts are persistent, the tool may be overloaded or the input may be too complex. Consider simplifying the request (smaller query, fewer results requested).
Rate Limits (429)
The agent has made too many requests in a given time window. The tool is telling you to slow down.
What the agent sees: HTTP 429, usually with a Retry-After header indicating how long to wait.
Recovery: Wait for the duration specified in Retry-After, then retry. Don’t retry immediately. If your agent is consistently hitting rate limits, you need to either reduce call frequency or get a higher rate limit from the provider.
For agents that make many tool calls in a single task, implement a client-side rate limiter. Know the limits for each tool and throttle proactively rather than waiting for 429 responses.
Bad Input (400)
The agent sent a request that doesn’t match the tool’s expected input. A required field is missing, a value is the wrong type, or a parameter is out of range.
What the agent sees: HTTP 400 with an error message describing what’s wrong. Good tools return specific validation errors (“field ‘query’ is required”, “num_results must be between 1 and 100”).
Recovery: This is where AI agents have an advantage over traditional software. The agent can read the error message, understand what went wrong, and construct a corrected request. Let the LLM see the error and try once more with fixed parameters.
Don’t retry bad input blindly. If the agent sends the same malformed request three times, it will fail three times. Self-correction or giving up are the only productive options.
Authentication Errors (401/403)
The API key is invalid, expired, or lacks permission for the requested operation.
What the agent sees: HTTP 401 (unauthorized) or 403 (forbidden).
Recovery: Don’t retry. Authentication errors don’t resolve on their own. Log the error and alert the operator. Something is misconfigured, and a human needs to fix it.
Retry Strategies
Not all retries are the same. Choose the right strategy for the error type.
Exponential backoff. For transient errors (5xx, timeouts). Wait 1s, 2s, 4s between retries. This gives the failing service time to recover without hammering it.
Fixed delay. For rate limits. Wait exactly as long as the Retry-After header says.
Immediate with correction. For bad input (400). No delay needed, but the request must change. The agent reads the error, fixes the input, and retries immediately.
No retry. For auth errors (401/403) and persistent failures (3+ retries with no success). Stop and report the failure.
Set a maximum total time per tool call, including retries. For most tasks, 30 seconds is a reasonable ceiling. Beyond that, the agent should give up and either try a fallback or report failure to the user.
Fallback Tools
When a tool fails persistently, the agent should try an alternative if one exists.
For example, if a Google Search tool returns errors, the agent might fall back to a Bing Search tool. The results won’t be identical, but for most tasks, any search engine is better than no search engine.
To make fallbacks work:
- Group tools by capability. Know which tools can substitute for each other (web search, news search, email).
- Try the preferred tool first. Fallbacks are for failures, not for load balancing.
- Limit fallback depth. One fallback attempt is usually enough. If both the primary and fallback fail, the underlying problem is probably not tool-specific.
Cost Impact of Errors
Tool calls cost money. Failed calls that charge credits without delivering results drain the agent’s budget.
Good platforms handle this with refund policies. On AgentPatch, the policy works like this:
- Server errors (5xx) and timeouts: Full refund. The failure wasn’t the agent’s fault.
- Bad input (4xx): Progressive refund. The first failure is fully refunded. Subsequent consecutive failures to the same tool get decreasing refunds (90%, 80%, 60%, 20%, then 0%). This gives agents room to learn while preventing infinite retry loops from draining credits.
- Successful calls: No refund (obviously).
The progressive penalty model is worth studying. It acknowledges that agents make mistakes, especially when encountering a new tool for the first time, while discouraging repeated bad calls. The penalty streak resets after a successful call or 24 hours of inactivity.
When building your own agent, track the cost of errors separately from the cost of successful calls. If a significant percentage of your tool budget is going to failed calls, something is wrong: bad schemas, incorrect parameter construction, or unreliable tools.
Designing for Failure
A few patterns that help:
Validate before calling. Check the tool’s input schema client-side before making the call. Catch type mismatches, missing required fields, and out-of-range values before they become 400 errors.
Log every failure. Record the tool, input parameters (sanitized), error code, error message, and retry count. This data helps you identify patterns: which tools fail most often, which error types are most common, and whether specific parameter patterns cause problems.
Degrade gracefully. If a tool fails and there’s no fallback, the agent should tell the user what happened and what it couldn’t do, rather than silently dropping the step. “I wasn’t able to search for recent news because the search tool is currently unavailable” is better than pretending the search never needed to happen.
Set circuit breakers. If a tool fails repeatedly over a short period, stop calling it. A circuit breaker prevents your agent from wasting time and credits on a tool that’s clearly down. Check again after a cooldown period.
Error handling isn’t glamorous, but it’s what separates agents that work in production from agents that work in demos. Get the classification right, choose the right recovery strategy, and manage the cost impact. Your agent will be more reliable for it.