Evaluating Tool-Call Quality: Beyond "It Worked Once"

A specific moment, repeated across teams: the developer tweaks a tool description, runs the agent on a couple of test prompts, sees the agent call the tool, and ships. A week later, customer support starts forwarding tickets. The tool works — every call returns 200. The agent uses it correctly most of the time. But the success rate has dropped, and nobody can pinpoint when, why, or by how much.

The reason this happens is that tool quality and tool-call quality are two different things, and almost all the testing effort goes into the first. A tool can be working perfectly while the agent's use of it is degrading — picking it for the wrong situations, calling it with subtly wrong arguments, ignoring it when it should be the first choice, calling it twice when once would do, recovering from its own malformed calls in ways that make the conversation longer than it needed to be.

This post is the discipline of measuring the second thing. The MCP server is the boundary at which this measurement is most clearly visible — every tool call travels through your code, every result you produce shapes the next turn, every metric you instrument here is a metric the agent's behavior depends on.

This is the deeper companion to the observability post (which covered the telemetry plumbing) and the tool-design post (which covered what makes a tool well-described). It assumes you have those in place; the question this post tackles is given the data, what do you measure, and what do you do when the metric drops?

What "tool-call quality" actually means

Sharper definitions before any methodology.

Tool-call selection quality. Given a user's intent, did the agent pick the right tool? A binary classification per turn. The hard case is "right tool" — defined relative to a ground-truth label, not the agent's confidence.

Tool-call argument quality. Given that the right tool was picked, were the arguments correct? Closer to a structured comparison — exact match for some fields, semantic match for others (a query parameter is "correct" if it captures the user's intent, not if it is verbatim what the labeler would have written).

Tool-call sequencing quality. Given that this is a multi-step task, did the agent's sequence of calls reach the goal efficiently? Counts the redundant calls, the abandoned attempts, the dead-end branches.

Tool-call recovery quality. When the agent makes a malformed or wrong call, does it recover gracefully? This one is sneaky — agents that recover too gracefully hide problems that should have been visible.

End-to-end task quality. Did the agent complete the user's task, regardless of how it got there? The customer-facing metric, the downstream consequence of all the others.

Each of these is its own measurement problem. Most teams collapse them into "did Claude do the thing" and lose the ability to diagnose where the breakdown is happening.

The eval dataset

The foundation of every tool-quality program. A curated set of inputs, with ground-truth labels, run regularly against your real tools.

What goes in. A few hundred test cases at minimum, representative of the user intents your server actually sees. Each case has:

A user message (or sequence of messages).
The conversation history, if any, that came before it.
Ground-truth labels: which tool should the agent call, with what arguments, in what order if multiple.
Expected backend state at the end, if applicable.
A category tag (so you can break down quality by intent type, by user segment, by edge case).

Where the cases come from. Three sources, ranked by value:

Real conversations, anonymized, labeled by humans. The most expensive, the most representative, the most valuable. A tier of internal users who consent to having their conversations used for evals is gold.
Synthetic cases generated to cover specific scenarios — edge cases, error cases, ambiguous prompts. Useful for filling gaps in real data.
Adversarial cases designed to probe for prompt injection, ambiguity, malicious phrasing. The smallest set but the most informative for security regressions.

What gets labeled. Be honest about ambiguity. Some cases have a single right answer (searchByMetadata is clearly the right tool for "show me all orders from last March"). Others have multiple acceptable answers (a question about a customer might reasonably hit getCustomer or searchOrders first). Label the cases accordingly — single answer, set of acceptable answers, ranked preferences. Pretending every case has one right answer leads to evals that punish reasonable behavior.

Versioning the dataset. As your tool catalog evolves, the dataset evolves with it. Tag each case with the tool version it was labeled against. Before any major change, replay the dataset against the old version to establish a baseline; after the change, replay against the new version and compare. Most regressions show up here as a numerical drop on a specific category, hours before they would have shown up in production.

The eval harness

The infrastructure that turns the dataset into measurements. The shape we use:

javascript

[ Test runner ]
    -> spawns/connects to MCP server (or test instance)
    -> for each case in dataset:
        -> instantiates a small evaluator agent
            (a real LLM, with the dataset's history loaded)
        -> records every tool call the agent makes
        -> records every tool result it returns
        -> compares calls against ground-truth labels
        -> scores per-category, per-tool, per-quality-dimension
    -> writes a summary report

Three implementation details worth being deliberate about.

The evaluator agent uses a small, fast model. The frontier models are too expensive and too slow for daily eval runs. A 4B-class local model or a cheaper cloud model (Haiku-class) is fine for routine eval; a frontier model gets used for tricky cases or when the cheaper model's score is suspect.

Determinism. Set temperature = 0 (or whatever your model's deterministic mode is). The evals are statistical anyway, but pinning temperature reduces noise and makes regressions easier to attribute. Run each case multiple times and treat majority-correct as pass; this absorbs the residual non-determinism without pretending it isn't there.

The comparison logic is configurable per case. Some cases want exact tool name match. Some want semantic argument match. Some want sequence-of-calls match in any order; others want a strict order. The harness exposes a per-case match function, and the dataset specifies which one applies. Don't write one comparison function and force every case through it. You will be wrong about half the cases.

Golden runs

A specific kind of eval worth naming separately. Golden runs are end-to-end traces of a complete task — user starts here, agent does these things, ends there — captured at a known-good moment and replayed forever after.

The pattern: pick the five or ten most important workflows your product supports. For each, capture a high-quality run of an agent completing the workflow. Save the user prompts, the model's responses, the tool calls, the tool results, the final outcome. Tag this as the golden version.

Every release, replay each golden run with the current version of the server and tools. Compare the new run against the golden:

Did the agent make the same tool calls?
Did it produce a comparable final result?
Did the path length stay the same? (More calls than the golden = regression.)
Did the latency stay the same? (Longer than the golden = degradation.)

A failing golden run is a stop-the-line event. The shape of the workflow has changed. Investigate before shipping.

The discipline: golden runs are small in number (five to ten per server) and high in confidence (each one approved as "this is what good looks like"). They are not a substitute for the eval dataset; they are the floor below which the system should never go. The dataset gives you breadth; the goldens give you the contract.

Quiet failures: the dangerous category

The kind of regression that traditional metrics miss.

A quiet failure is when the agent makes a malformed or suboptimal tool call, the call fails or returns an unhelpful result, and the agent recovers by reformulating and trying again. From an end-to-end perspective, the task succeeds. From a backend perspective, you have wasted a call. From a model-behavior perspective, you have learned that the description, the schema, or the tool itself was confusing enough that the agent stumbled.

These failures are dangerous because:

The end-to-end metric does not catch them (the task completed).
The error logs do not catch them (the recovery was clean).
The agent itself does not "remember" the stumble in a way the user can see.
They cost real money — extra LLM tokens, extra backend calls, extra latency.

How to catch them:

Per-turn call counts. A user prompt that should be answerable in one tool call but takes three is a quiet failure. Instrument the host (or the agent harness) to log the number of tool calls per user turn, and watch the distribution.

Validation-error rates. From the observability post, result_kind includes validation_error. A non-zero rate means the agent is sending arguments that fail your schema. Even if the agent recovers, the rate is a signal that the description-schema-handler triple is leaving room for confusion.

Re-call patterns. A tool called with one set of arguments, then called again within the same turn with slightly different arguments, is a recovery pattern. Log the second call distinguishably from the first; aggregate the counts; investigate the tools with high recovery rates.

Comparison to the golden. A golden run takes three calls; the current run takes five. The two extra calls are quiet failures. The golden tells you what good looks like; everything above it is investigatable.

The instinct: quiet failures are the regressions that compound. A tool that has a 5% quiet-failure rate is one that, over thousands of calls, has cost real money and added real latency, all without showing up as a problem on any chart you were watching. Hunt them deliberately.

Per-tool quality scoring

A scoreboard that updates over time, per tool. The single chart that catches description regressions before users do.

The score for tool T over a window:

Selection: of all turns where the ground-truth says T should have been called, what fraction did call T?
Args: of all calls to T in that window, what fraction had correct arguments?
Recovery: of all calls to T, what fraction were re-calls — second tries within the same turn?
Latency: P50/P95 of T's handler duration.
End-to-end: what fraction of conversations involving T reached the user's goal?

Roll these into a per-tool composite at whatever weighting feels right for your product. Track over time. Bisect against deploys.

A tool whose composite drops 5% after a description tweak is a tool whose tweak hurt. That is the metric the rest of this post has been building toward. Without it, description tweaks are vibes-driven; with it, they are measured.

What "good" looks like, numerically

For calibration, rough numbers that have served us well as floors:

Selection accuracy on a real-traffic-derived eval set: 80% to 90% per tool. Lower than 80% suggests the tool description or the catalog organization needs work. Higher than 90% may mean the dataset is too easy.
Argument correctness given correct selection: 90% to 95%. The tail is where the interesting failures live.
Quiet-failure rate: under 5% per tool. Above that is a description-or-schema problem.
Golden run path length stability: ±1 call from the captured golden. Bigger drift is a workflow regression.
End-to-end task success on a representative dataset: depends entirely on task complexity, but a stable number across releases matters more than the absolute level. Watch the trend, not the snapshot.

These are not benchmarks to chase. They are floors for "things are healthy" detection. A team that sustains them is shipping a system that is broadly working; a team whose numbers drop on a release has a regression to investigate before ship.

Closing the loop with description tweaks

A pattern we use whenever a quality drop is traced to a description change.

Confirm the drop on the eval dataset, not just on a single test case.
Compare the old description to the new. What changed? Word choice, parameter framing, example usage?
Form a hypothesis about why the model behaves differently with the new description. Lots of them are about example usage being clearer in one version.
Edit the description to address the hypothesis. Run the eval again. Compare.

The loop closes when the eval recovers. The cost of one round of this is a few minutes of compute and a few minutes of human attention; the alternative — shipping a description that quietly degrades the agent's performance — is days or weeks of customer support pain.

This is the tool-design post's discipline made measurable. The post argues description quality is the dominant lever; this loop is how you actually pull it.

When to evolve the dataset

Eval datasets rot. New user behaviors emerge; new edge cases appear; the tool catalog changes. A dataset that was representative six months ago may now be measuring the wrong thing.

A useful cadence: monthly review of a sample of recent production traffic, comparing it to the eval dataset's distribution. If users are now asking questions the dataset does not represent, add new cases. If a category has stopped occurring (a feature was removed, a workflow was deprecated), prune those cases.

The dataset is software. Treat it like software — version control, code review on additions and changes, tests that the dataset itself is internally consistent.

Where this fits

The observability post is the data-collection prerequisite — without per-call structured telemetry, none of these metrics are computable. The testing post covers the lower layers of the test pyramid; eval datasets and golden runs are the top layer of that pyramid, and they share infrastructure with it.

The tool-design post is the closest neighbor — quality evaluation is the feedback loop that makes deliberate tool design tractable. The versioning post interacts: a major version bump is exactly the moment to re-baseline the eval dataset against the new tool surface.

For multi-tenant deployments, the multi-tenant checklist implies per-tenant quality scoring — the same tools may be performing differently for different customers, and the rolled-up average can hide a single tenant whose experience has tanked.

Tool-call quality is the metric the field is silently lacking. Most teams ship MCP servers and trust that "the agent calls our tools" is the same as "the agent uses our tools well." It is not. The teams that pull ahead are the ones who instrument the difference, watch it over time, and treat description tweaks as measured experiments rather than stylistic edits.