Handling Errors in Agent Skills: Retries and Fallbacks

I was mid-demo with a big client. The agent called a weather API. The API took too long and timed out. Instead of a weather report, the agent proudly displayed [object Object] on the screen. The room went silent. I had spent weeks on the prompt but forgot the plumbing. AI tool failures aren’t like regular code crashes. They’re messier because the model tries to “help” by making things up. If you don’t handle errors, your agent will lie to your face. It’s embarrassing and completely avoidable with the right retries.

Why do AI tool failures feel so much worse?

In normal code, an unhandled error just crashes the process. You get a stack trace. You fix the bug. With agents, the model doesn’t crash. It reads your error as data. If your tool returns undefined, the model might act like that’s a valid result. It’ll keep going.

The Scenario: You’re building a travel bot for a flight agency. The airline API is down for maintenance. Instead of stopping, the bot tells the user flights are “currently free” because it saw a null response and assumed the price was zero. Now you have an angry customer and a massive billing headache.

The model is only as good as the info it gets. If your error handling is lazy, the model will produce confidently wrong answers. It’s a mess.

What actually breaks when an agent runs?

You need to know what kind of fire you’re fighting before you grab the extinguisher.

Network errors: The API is dead or slow. These are usually temporary.
Bad data: The API is fine, but the data is empty or weird.
Logic errors: You have a bug in your code. Or the model sent garbage inputs.

The Scenario: You’re at a coffee shop with spotty Wi-Fi. You ask your agent to summarize a long PDF. The connection drops for three seconds. The tool fails. Without a retry loop, the agent tells you the PDF is empty. You waste ten minutes trying to re-upload a file that was never the problem.

How do I stop my agent from hallucinating on errors?

The first rule is simple. Never let a tool function throw an exception. Always return a clear object.

// ✅ Good — always returns an object
async function get_weather({ city }) {
  try {
    const response = await fetch(`https://api.example.com/weather?city=${city}`);

    if (!response.ok) {
      return { error: `Weather API returned ${response.status}. Try again later.` };
    }

    const data = await response.json();

    if (!data.current) {
      return { error: `No weather data found for "${city}".` };
    }

    return {
      city: data.location.name,
      temperature: `${data.current.temp_c}°C`,
      condition: data.current.condition.text
    };
  } catch (err) {
    return { error: `Could not reach weather service: ${err.message}` };
  }
}

The Scenario: Your agent is searching a database for an invoice. The database is locked. The code crashes. The agent catches the crash and tells the user “I couldn’t find your invoice because you don’t have any.” That’s a lie. You need to return a structured error so the AI knows the difference between “empty” and “broken.”

Can I fix a timeout without the user noticing?

Network blips happen. You shouldn’t bother the user with a “503 Service Unavailable” if a retry three seconds later would have worked. Use exponential backoff.

async function withRetry(fn, maxAttempts = 3, baseDelayMs = 500) {
  let lastError;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      const result = await fn();
      if (result?.error && result?.retryable) throw new Error(result.error);
      return result;
    } catch (err) {
      lastError = err;
      if (attempt < maxAttempts) {
        const delay = baseDelayMs * Math.pow(2, attempt - 1);
        await new Promise(resolve => setTimeout(resolve, delay));
      }
    }
  }

  return { error: `Failed after ${maxAttempts} attempts: ${lastError.message}` };
}

The Scenario: You’re using a free-tier API that rate-limits you if you send three requests at once. The agent triggers all three. Two fail. With a retry wrapper, the agent just waits a heartbeat and tries again. The user never sees the “Rate Limit Exceeded” message. It just works.

What happens when my primary API just dies?

If your main data source is down, try a backup. If that fails too, give a degraded response. Don’t just give up.

The Scenario: You’re running a stock market bot. Your premium data provider is having an outage. Instead of telling the user “I don’t know,” the bot switches to a free, slightly delayed source. It adds a note: “Prices are delayed by 15 minutes.” The user is still happy because they have data.

async function get_weather({ city }) {
  const primary = await tryOpenMeteo(city);
  if (!primary.error) return primary;

  const secondary = await tryWttr(city);
  if (!secondary.error) return secondary;

  return {
    city,
    temperature: "unavailable",
    condition: "Service down. Check a weather app directly.",
    degraded: true
  };
}

Should the AI know the API key expired?

Not every error needs a detailed explanation. If it’s something the user can fix, tell them. If it’s your fault, hide the mess.

Situation	What to return
Rate limit	`{ error: "Service busy. Try again in a minute." }`
City not found	`{ error: "No results for 'Atlantis'. Check the spelling." }`
API Key Expired	`{ error: "Internal error. Contact support." }`
Critical Bug	`{ error: "Internal error. Try a different request." }`

The Scenario: You forgot to renew your OpenAI credits. The agent starts throwing authentication errors. If you pass the raw error to the user, they see your billing issues. Just tell them “The service is temporarily unavailable.” It’s more professional.

What does a “broken” tool look like to an LLM?

It’s the difference between a confused AI and a helpful one.

With a raw crash: The model gets a blank response. It apologizes for ten turns. It gets stuck.

With a structured error: The model says: “I couldn’t get the price for that stock right now. The provider is down. Do you want me to try again in a few minutes?”

That’s a good agent. The user knows what happened. They know what to do next.