M
MeshWorld.
Agent Skills Agentic AI Tutorial Node.js Testing Debugging Beginners 9 min read

Testing and Debugging Agent Skills Before You Deploy

By Vishnu Damwala

I built a GitHub issue creator skill. Called it directly with mock inputs — it worked perfectly every time. Deployed it to the agent, gave it a natural language request. The model kept passing null for the repo field, which my code didn’t handle.

The bug wasn’t in the skill code. It was in the description field — the model didn’t understand it needed to provide a repo. I’d never tested how the model would interpret the tool definition.

That distinction matters: testing the function is not the same as testing the skill.


Three levels of testing

Agent skills have three distinct layers, and bugs can exist independently at each one.

Level 1 — The function
           Does get_weather({ city: "Mumbai" }) return the right data?

Level 2 — The tool definition
           Does the model understand when to call get_weather and what to pass?

Level 3 — The agent loop
           Does the full dispatch cycle work end-to-end without spending tokens every time?

Most developers only test Level 1 and assume the rest works. That’s how you end up debugging in production.


Level 1 — Unit testing the function

This is standard — just test the JavaScript function directly. No AI involved.

// weather.js — the tool function
export async function get_weather({ city }) {
  if (!city) return { error: "City is required" };

  try {
    const geo = await fetch(
      `https://geocoding-api.open-meteo.com/v1/search?name=${encodeURIComponent(city)}&count=1`
    ).then(r => r.json());

    if (!geo.results?.length) return { error: `City not found: ${city}` };

    const { latitude, longitude, name, country } = geo.results[0];
    const weather = await fetch(
      `https://api.open-meteo.com/v1/forecast?latitude=${latitude}&longitude=${longitude}&current_weather=true`
    ).then(r => r.json());

    return {
      city: `${name}, ${country}`,
      temperature: `${weather.current_weather.temperature}°C`,
      condition: "Clear"
    };
  } catch (err) {
    return { error: err.message };
  }
}
// weather.test.js — unit tests with Node's built-in assert
import assert from "node:assert/strict";
import { get_weather } from "./weather.js";

// Test 1: valid city
const result = await get_weather({ city: "Mumbai" });
assert.ok(!result.error, `Should not error: ${result.error}`);
assert.ok(result.temperature, "Should return temperature");
assert.ok(result.city.includes("Mumbai"), "Should return city name");
console.log("✅ Valid city:", result);

// Test 2: missing input
const missing = await get_weather({});
assert.equal(missing.error, "City is required");
console.log("✅ Missing input handled:", missing);

// Test 3: city that doesn't exist
const notFound = await get_weather({ city: "Atlantis12345" });
assert.ok(notFound.error, "Should return error for unknown city");
console.log("✅ Unknown city handled:", notFound);

Run with:

node weather.test.js

No test framework needed. Add vitest if you want watch mode and better output:

npm install -D vitest
npx vitest run weather.test.js

Level 2 — Testing the tool definition

This is the level most people skip. You’re testing whether the model understands your tool’s description and input_schema well enough to call it correctly.

The test: send a carefully controlled prompt to the real API and assert what the model decided to call and with what arguments.

// test-definition.js
import Anthropic from "@anthropic-ai/sdk";
import assert from "node:assert/strict";

const client = new Anthropic();

const weatherTool = {
  name: "get_weather",
  description:
    "Get current weather conditions for a city. " +
    "Use this when the user asks about weather, temperature, rain, " +
    "or what to wear outdoors.",
  input_schema: {
    type: "object",
    properties: {
      city: { type: "string", description: "The city name, e.g. 'Mumbai'" }
    },
    required: ["city"]
  }
};

async function assertToolCall(prompt, expectedTool, expectedInputKeys) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 256,
    tools: [weatherTool],
    messages: [{ role: "user", content: prompt }]
  });

  const toolCall = response.content.find(b => b.type === "tool_use");

  // Assert the model called the right tool
  assert.ok(toolCall, `Expected a tool call for: "${prompt}"`);
  assert.equal(toolCall.name, expectedTool, `Wrong tool called`);

  // Assert required input keys are present and not null
  for (const key of expectedInputKeys) {
    assert.ok(toolCall.input[key] != null, `Missing input: ${key} for prompt: "${prompt}"`);
  }

  console.log(`✅ "${prompt}"`);
  console.log(`   → ${toolCall.name}(${JSON.stringify(toolCall.input)})`);
  return toolCall;
}

// Test cases
await assertToolCall("What's the weather in Delhi?", "get_weather", ["city"]);
await assertToolCall("Will it rain in Mumbai today?", "get_weather", ["city"]);
await assertToolCall("Should I bring a jacket in Chennai?", "get_weather", ["city"]);

// Negative test: the model should NOT call weather for this
const noToolResponse = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 256,
  tools: [weatherTool],
  messages: [{ role: "user", content: "What is the capital of France?" }]
});
const noToolCall = noToolResponse.content.find(b => b.type === "tool_use");
assert.ok(!noToolCall, "Should NOT call weather for a geography question");
console.log("✅ No false trigger for unrelated questions");

This costs a few API calls but tells you exactly how your description reads from the model’s perspective. If the model passes null for city on a particular phrasing, you know your description needs to be more specific.


Level 3 — Mocking the AI to test the dispatch loop

Testing the full loop without spending tokens every time. Intercept the API client and return synthetic responses.

// mock-loop.test.js
import assert from "node:assert/strict";
import { get_weather } from "./weather.js";

// A synthetic tool_use response — exactly what Claude returns when it wants to call a tool
function makeSyntheticToolUse(toolName, input) {
  return {
    id: "msg_mock_001",
    type: "message",
    role: "assistant",
    stop_reason: "tool_use",
    content: [
      { type: "text", text: "Let me check the weather for you." },
      {
        type: "tool_use",
        id: "toolu_mock_001",
        name: toolName,
        input
      }
    ]
  };
}

// A synthetic final response — what Claude returns after receiving the tool result
function makeSyntheticFinalResponse(text) {
  return {
    id: "msg_mock_002",
    type: "message",
    role: "assistant",
    stop_reason: "end_turn",
    content: [{ type: "text", text }]
  };
}

// Minimal mock client
function makeMockClient(toolName, toolInput, finalText) {
  let callCount = 0;
  return {
    messages: {
      create: async () => {
        callCount++;
        if (callCount === 1) return makeSyntheticToolUse(toolName, toolInput);
        return makeSyntheticFinalResponse(finalText);
      }
    }
  };
}

// The agent loop (same logic as production)
async function runLoop(client, tools, toolFunctions, userMessage) {
  const messages = [{ role: "user", content: userMessage }];
  let response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, tools, messages });

  while (response.stop_reason === "tool_use") {
    const toolBlock = response.content.find(b => b.type === "tool_use");
    const fn = toolFunctions[toolBlock.name];
    const result = fn ? await fn(toolBlock.input) : { error: "Unknown tool" };

    messages.push(
      { role: "assistant", content: response.content },
      { role: "user", content: [{ type: "tool_result", tool_use_id: toolBlock.id, content: JSON.stringify(result) }] }
    );
    response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, tools, messages });
  }

  return response.content[0].text;
}

// Test: verify the loop calls get_weather correctly and uses the result
const mockClient = makeMockClient(
  "get_weather",
  { city: "Mumbai" },
  "Mumbai is currently 31°C and partly cloudy."
);

const tools = [{
  name: "get_weather",
  input_schema: { type: "object", properties: { city: { type: "string" } }, required: ["city"] }
}];

const result = await runLoop(mockClient, tools, { get_weather }, "What's the weather?");
assert.ok(result.includes("31°C") || result.includes("Mumbai"), "Loop should use tool result in response");
console.log("✅ Agent loop works correctly:", result);

No API calls, no tokens spent. You can run this in CI on every commit.


Debugging bad tool descriptions

When the model calls your tool with wrong or missing arguments, the issue is almost always the description. Here’s how to diagnose it:

Step 1 — Log what the model actually sent

Add this immediately before executing the tool:

console.log("[Tool call]", toolBlock.name, JSON.stringify(toolBlock.input, null, 2));

Run your agent with a real prompt. See exactly what the model passed. If you see { city: null } when you expected a city name, your description didn’t tell the model where to find it.

Step 2 — Compare your description to the test prompt

If the prompt says “What will it be like outside in Ahmedabad tomorrow?” and your description only mentions “current weather” — the model might not call the tool because “tomorrow” doesn’t match “current.”

Update the description to be explicit:

"Get current or forecasted weather for a city. Use this when the user asks about
weather, temperature, rain, sunshine, what to wear, or whether to bring an umbrella
— for any time frame (today, tomorrow, this week)."

Step 3 — Run the Level 2 definition test with the exact failing prompt

Add the failing prompt as a new test case. If it fails, you have a reproducible bug you can fix iteratively.


Common debug scenarios

SymptomLikely causeFix
Model never calls the toolDescription too narrow or vagueAdd more trigger phrases to description
Model calls the wrong toolDescriptions overlapAdd “Do NOT use this for X” to each
Model passes null for a required fieldDescription doesn’t explain where to get itSpecify: “Use the city the user mentioned”
Model passes wrong type (e.g. number instead of string)Schema description unclearAdd "type": "string" and an example in description
Model calls tool in an infinite loopTool result is empty or ambiguousReturn more specific success/error messages

The testSkill harness

Here’s a reusable harness you can drop into any project:

// test-harness.js
export async function testSkill(fn, testCases) {
  let passed = 0;
  let failed = 0;

  for (const { label, input, assert: check } of testCases) {
    try {
      const result = await fn(input);
      check(result);
      console.log(`  ✅ ${label}`);
      passed++;
    } catch (err) {
      console.log(`  ❌ ${label}: ${err.message}`);
      failed++;
    }
  }

  console.log(`\n${passed} passed, ${failed} failed`);
  if (failed > 0) process.exit(1);
}

Use it:

import assert from "node:assert/strict";
import { testSkill } from "./test-harness.js";
import { get_weather } from "./weather.js";

await testSkill(get_weather, [
  {
    label: "returns temperature for valid city",
    input: { city: "London" },
    assert: r => assert.ok(r.temperature, "Missing temperature")
  },
  {
    label: "handles missing city",
    input: {},
    assert: r => assert.ok(r.error, "Should have error")
  },
  {
    label: "handles nonexistent city",
    input: { city: "Nonexistentville99" },
    assert: r => assert.ok(r.error, "Should have error")
  }
]);

What’s next

Handle errors gracefully in your skills: Handling Errors in Agent Skills: Retries and Fallbacks

Give your agent persistent memory: Agent Skills with Memory: Persisting State Between Chats

Back to fundamentals: What Are Agent Skills? AI Tools Explained Simply