Writing MCP Tool Descriptions LLMs Actually Use

There is a counterintuitive fact at the heart of MCP work: the part of an MCP server that the language model interacts with is not the code. It is the writing. The handler implementation could be a thousand lines or zero, the SDK could be the latest version or the previous one, the database could be Postgres or DuckDB or a JSON file — the model has no idea about any of it. The model sees three fields per tool, and decides on the basis of those three fields whether and how to call.

That is the entire decision surface, and it is the single most underestimated piece of MCP engineering. Most teams spend two days on tool code and three minutes on tool descriptions, and then wonder why the model keeps reaching for read_file instead of the tool they wrote. The descriptions are doing the navigation. The code is invisible. Get the descriptions wrong and a working tool sits unused; get them right and a model that ignored the tool yesterday picks it correctly today, with no other change.

This post is the working set of patterns we have converged on for writing those descriptions — and an evaluation procedure that will tell you whether yours are working before any user ever touches them.

What the model actually sees

For each tool you register, the model receives exactly three fields:

The name of the tool.
The description you wrote.
The input schema (with each field's description).

That is the entire surface. The handler is invisible. The repo, folder structure, supporting helpers, internal comments — all invisible. If those three pieces do not make the right call obvious, the model will not make it.

javascript

   YOUR SOURCE FILE                 WHAT THE MODEL EVER SEES
   ───────────────────              ───────────────────────────

   server.registerTool(             {
     "flutter_test",                  name:        "flutter_test",
     {                                description: "...",
       title: "...",                  input_schema: {
       description: "...",              filter:  "string, ..."
       inputSchema: {                   flavor:  "string, ..."
         filter: z.string()           }
           .describe("..."),        }
         flavor: z.string()
           .describe("..."),        ↑ this is everything.
       },                            the handler is invisible
     },                              to the LLM. always.
     async (args) => {              ─────────────────────────
       // 80 lines of               If the right call isn't
       // process management        obvious from these three
       // result parsing            fields, the model won't
       // edge cases                make it.
     }
   );

This sounds like documentation. It is not. Documentation is something a human reads to learn. A tool description is the interface contract the model consults every time it considers calling. Every conversation is a fresh interaction with a stranger who only sees what you wrote. There is no team Slack to ask "should I use this one or read_file?". The description is the whole answer.

The implication is that tool descriptions are the unwritten product surface of an MCP server. They are more important than the code. Fix the description and a model that ignored your tool will start reaching for it. Fix the code without fixing the description and the model still will not call it.

The five-word problem

A quick & tempting description could be simply:

Run flutter test.

Three words. Not a lie. Not useful, either.

It tells the model what the tool does. It does not tell the model when to use it. The model has eight other tools available; why would it pick this one? It does not say what the tool returns, so the model has to guess whether the result is useful. It does not reveal capabilities — the filter argument, the flavor argument, the streaming JSON parser, the structured summary that is a hundred times cheaper than dumping raw test output into the context. None of that is in the description. The model is making decisions in the dark.

Compare with what the description eventually became:

Runs flutter test in the configured project. Returns a structured summary: pass/fail counts, total duration, and full details for any failing tests (name, error message, stack trace). Optionally filter by test-name substring or run for a specific build flavor. Output is parsed from flutter test --machine so the agent sees only the summary, not raw test output — typically a 50–500x token reduction vs. piping raw test output.

Three sentences. Each one earns its keep.

Sentence one says what the tool does and establishes the boundary — it runs in the configured project, not arbitrary directories. Sentence two tells the model exactly what it is getting back, in concrete terms it can plan against: "if there are failures, I will have stack traces; I will not need a follow-up call to read the test output." Sentence three sells the tool against the alternative (read_file, raw subprocess) by being honest about the cost difference.

After this rewrite, the same model that ignored the tool reached for it on every test-related prompt. Same code, same schema. Completely different behavior.

The patterns

These are not gospel. They are the patterns people converged on after a few projects and a lot of staring at why the model was not calling the desired tool.

Lead with when, not what

The model decides between tools by matching the user's intent against the descriptions. Front-load the trigger condition.

Bad: Lists files in the configured directory, returning their names and modification timestamps.

Better: Use this when the user asks what is in a folder, or as a first step before reading specific files. Returns a directory listing with [dir] prefixes for folders.

The "use this when" does not have to be those literal words — it just has to be the first thing the model encounters. Returns a structured summary. Use to enumerate. Run when. The verb makes the trigger clear.

State the return shape in plain English

The model is going to plan a sequence of calls. To plan, it needs to know what each call gives it.

Returns a structured summary: pass/fail counts, total duration, and full details for any failing tests (name, error message, stack trace).

Not "returns a JSON object" — that is worthless to the model. The fields the JSON has, in plain English, in one breath. Now the model knows it does not need a follow-up call to inspect failures.

Differentiate from the obvious alternative

The model has read_file. It has bash. For most "do X with my code" prompts, those are the default fallbacks. If your tool is better than the fallback for a specific situation, say so.

Output is parsed from `flutter test --machine` so the agent sees only the summary, not raw test output — typically a 50–500x token reduction vs. piping raw test output.

The model genuinely cares about token cost. Telling it your tool is cheap is sometimes the deciding factor in a tight context.

Mark the dangerous tools

Some tools should not be called speculatively. If your tool writes, deletes, sends, or mutates anything outside the user's expectation, the description is where you make that obvious.

WARNING: this overwrites existing goldens. Only set after manually confirming the visual changes are intended.

Frontier models pattern-match against words like WARNING, MUST, NEVER, "confirm with the user." Use them when they are earned. Do not sprinkle them on every tool — the model learns to ignore them if they are everywhere. Earned warnings work; performative ones decay quickly.

Argument descriptions are not optional

Every Zod field in your input schema gets a .describe(). Every one. Skipping it because "the field name is obvious" is how the model passes the wrong values.

Bad: flavor: z.string().optional() — the model has to infer what flavor even refers to.

Better: flavor: z.string().optional().describe("Build flavor (passed as --flavor). Omit if the project has no flavors.") — now the model knows what to put there and knows when to leave it empty.

Argument descriptions are doubly important for optional fields. The model has to decide whether to include them at all. "Optional, omit if X" is information it can act on. "Optional" alone is information it cannot.

Treat descriptions as part of the prompt budget

A subtler pattern, and one that becomes visible only at scale: every tool description is replicated into the model's context on every turn. With ten tools and rich descriptions, you have spent a thousand tokens on tool definitions before the user has typed anything. That is a real cost, and it is a hidden one.

The implication is not "write shorter descriptions." Short descriptions cost too — in failed tool calls, in wrong tool selection, in roundtrips. The implication is to write descriptions that are dense. Each sentence carries information the model cannot derive from anywhere else. Boilerplate ("this is a tool you can call with the following arguments") is pure overhead. Cut it.

A useful test: read a description aloud. If a sentence does not change the model's likely behavior — when it reaches for the tool, what arguments it picks, how it interprets the result — that sentence is rent.

The evaluation procedure

Here is the unglamorous but reliable test we run before considering any tool description done. Take the tool description and schema, paste them into a fresh chat with whatever model the deployment will use, and ask:

"Here is a tool you have access to. Without showing me anything else, describe in your own words: when should you call this, what arguments does it take, what do you get back, and what should you be careful about?"

If the model cannot answer those four questions clearly, the description is broken. Edit until it can. This takes about ten minutes per tool. It is the most leveraged ten minutes of the entire project.

A more rigorous variant — useful for production work and for any client engagement — is what we sometimes call the stub test. Replace every handler in the server with a stub that returns canned data. Ship the stubbed server through a real host with a real LLM at the wheel. Try ten realistic prompts. Watch which tools the model picks, in what order, with what arguments.

If the model still routes correctly with stubbed handlers, the descriptions are doing their job. The handlers are not the bottleneck. If the model picks the wrong tool, or the wrong arguments, or fails to use a tool that obviously applies, the code was never the problem. The writing was. This separation of concerns is impossible to get any other way, and it costs nothing once you have the SDK loaded. Almost every team that runs the stub test once runs it permanently.

Writing for the second-best model

A frame worth carrying: write descriptions for the second-best model on the market today.

Whatever model your client is using right now, there is a non-trivial chance that in six months half of the agent traffic will route through something cheaper. Haiku for the cost-sensitive path, a self-hosted Qwen for the on-prem one, a smaller model for the high-volume background work. Token economics push hard in this direction, and the AI ecosystem keeps producing capable models that fit on smaller hardware.

If your descriptions only work because today's frontier model is impressive enough to figure them out, your tools will degrade silently when the agent gets downgraded. There will be no error message. The tools will still exist, still validate, still execute correctly. The model just will not pick them as often, or will pick them with worse arguments, and the agent will feel a little less reliable for reasons no log file will explain.

Writing descriptions a smaller model could still get right is the cheapest hedge against that future. It costs nothing extra at write time and pays back every time the deployment topology changes underneath you.

Why this is the actual hard part

The MCP SDK is small. The transport is solved. The schema validation is a one-line Zod call. None of these are where the hard work is.

The hard work is figuring out what tools the domain actually needs, how narrow each one should be, and what to write so the model picks the right one without help. Three problems, all of them in the description text. None of them in the code.

The seasoned MCP developer's secret is that they have gotten good at writing those descriptions. Not because they read a guide. Because they have watched a model misuse a tool, traced the misuse back to a vague description, fixed the description, and watched the misuse stop. Hundreds of times. That feedback loop is the skill — write the description, hand it to a model, watch what happens, edit. The code mostly takes care of itself once the descriptions are right.

This is also why MCP work resists the standard tutorial format. There is nothing visual to demo. The artifact is a paragraph of text that, after editing, makes a different paragraph happen later. Most agencies and most freelancers gloss over this and bill for the SDK plumbing instead. We mention it because, when scoping client engagements at Amazing Resources, the tool-description work is the line on the invoice that produces the most behaviour change per hour billed — and the line that most clients are most surprised to learn matters.

Where this fits

If you came here from the pillar, you now have one more lens for what good MCP work looks like. If your next concern is security — and tool design is half of MCP security — the security post is the natural next read; it picks up the narrow-funnel principle that quietly underlies most of what is written above. If you have not yet written a no-SDK MCP server, that exercise pairs surprisingly well with this one — once you have seen the JSON Schema on the wire, the description-as-product-surface framing becomes a lot less abstract.

Writing MCP Tool Descriptions LLMs Can Reason About