Testing MCP Servers Without an LLM in the Loop

There is a recurring failure mode in young MCP projects. The team writes a server, runs it against Claude Desktop a few times, sees the agent call its tools correctly, and ships. Two weeks later something breaks in subtle ways — a description tweak that drifted the agent's behavior, a schema change that broke a path the team had not noticed was being used, a transport bug that only manifests on Streamable HTTP because all manual testing happened on stdio.

The cause is the same in every case: the only test the team had was "we tried it once with a real LLM and it worked." That is not a test strategy. It is a smoke test, run once, against a stochastic system that may not behave the same way on the next run. Real testing for an MCP server has four layers, each with a clear purpose, and only the top layer needs an LLM in the loop.

This post is the test strategy we have settled on, anchored on Vitest because it is the test runner of choice for the Node.js Clean Architecture template we recommend as a backend foundation. The patterns translate to Jest, ava, or whatever else with minor cosmetic changes; the discipline is what matters.

The pyramid for MCP

The shape:

javascript

                 ┌───────────────────┐
                 │  Live LLM tests   │  expensive, flaky, few
                 └───────────────────┘
              ┌─────────────────────────┐
              │  Transport integration  │  realistic but no model
              └─────────────────────────┘
           ┌───────────────────────────────┐
           │  Handler unit tests           │  pure, fast, many
           └───────────────────────────────┘
        ┌─────────────────────────────────────┐
        │  Description / schema snapshot      │  cheapest, run on every commit
        └─────────────────────────────────────┘

Bottom layer is the largest and the cheapest. Top layer is the smallest and the most expensive. Most teams start at the top, run a few flaky LLM tests, get frustrated, and stop. The right move is to push as much testing as possible toward the bottom of the pyramid — most regressions can be caught without ever calling a model — and reserve the top layer for the things only a model can validate.

Layer 1: Description and schema snapshots

The cheapest, most under-used test in the pyramid. It catches the most subtle regression.

The setup: every tool's metadata — name, description, input schema, output schema — is captured in a snapshot file. Every test run, the current metadata is compared to the snapshot. If anything has changed, the test fails until a human reviews the change and explicitly updates the snapshot.

typescript

import { test, expect } from "vitest";
import { server } from "../src/server";

test("tool descriptions are stable", () => {
  const tools = server.listToolsForTest();
  expect(tools).toMatchSnapshot();
});

This looks trivial. The value is psychological more than technical: the snapshot file is in version control, every change to a tool description shows up in the PR diff, and a reviewer who is not the one who made the change has to look at it and approve it.

The reason this matters: from the versioning post, tool descriptions are behavior changes for the system. A description tweak that nobody looked at is a behavior change that nobody approved. The snapshot test forces every description tweak to enter human awareness.

What to put in the snapshot:

Tool name.
Tool description (full text).
Input schema (the canonical form, not the Zod object — call .shape and serialize, or use whatever the SDK exposes).
Output schema if your tool advertises one.
The same for resources and prompts.

What to leave out: timestamps, version strings that change every release, anything else that would cause spurious diffs. The point of the snapshot is to catch meaningful changes; if every commit produces a snapshot diff, the team will start ignoring the test, which is the test failing in slow motion.

Layer 2: Handler unit tests

The bulk of the pyramid. Every tool handler is, at heart, a pure-ish function: input args go in, a side effect happens (calling a backend, reading a file), a result comes out. Most of those handlers have non-trivial logic — argument validation, path-traversal guards, error handling, result-shape construction — that is fully testable without anything MCP-related in the loop.

The pattern:

typescript

import { test, expect, vi } from "vitest";
import { publishArticleHandler } from "../src/handlers/publishArticle";

test("publishArticle rejects path traversal", async () => {
  const backend = { publishDraft: vi.fn() };
  const handler = publishArticleHandler({ backend });

  await expect(
    handler({ name: "../../etc/passwd" }),
  ).rejects.toThrow("path escape");

  expect(backend.publishDraft).not.toHaveBeenCalled();
});

test("publishArticle calls backend with sanitized name", async () => {
  const backend = { publishDraft: vi.fn().mockResolvedValue({ url: "/blog/x" }) };
  const handler = publishArticleHandler({ backend });

  const result = await handler({ name: "post-1.md" });

  expect(backend.publishDraft).toHaveBeenCalledWith("post-1.md");
  expect(result).toEqual({ url: "/blog/x" });
});

The handler is a function. The test calls the function. No transport, no JSON-RPC, no SDK. The mock is a stand-in for the backend the handler delegates to.

To make this work, write your handlers as factories that take their dependencies as arguments. publishArticleHandler({ backend, draftsRoot, logger }) is testable. publishArticleHandler that imports the backend at module level is not, or is testable only with module-mocking witchcraft. The cost of factoring this way is one constructor; the benefit is every handler becomes a unit-testable function.

The cases worth testing for every tool handler:

The happy path. A typical input produces a typical output. One assertion per output field.
Validation failures. Each parameter's invalid forms produce the expected rejection. Empty strings, oversized inputs, wrong types if your handler accepts looser types than its schema.
The security-relevant edge cases. Path traversal for any file-touching tool, scope checks for any tool with permissions, parameter escaping for any tool that builds shell commands or queries.
Backend failure modes. The backend rejects, the backend times out, the backend returns malformed data. Each one should produce a result the handler is happy with — usually a structured error, never an unhandled throw.

A useful target: every tool handler should have a minimum of one happy-path test, one validation-failure test, and one backend-failure test. More if the tool has surface area that earns more.

Layer 3: Transport integration tests

The layer most teams skip and the one that catches "it worked on stdio, broken on HTTP."

The point: run your real server, including its real transport, against a test client that speaks JSON-RPC but is not an LLM. The client is just code, exchanging well-formed messages with the server, asserting on the responses.

For stdio:

typescript

import { test, expect } from "vitest";
import { spawn } from "node:child_process";
import { JsonRpcClient } from "./test-helpers/jsonrpc-client";

test("stdio: full lifecycle and tool call", async () => {
  const proc = spawn("node", ["./dist/server.js"], { stdio: "pipe" });
  const client = new JsonRpcClient(proc.stdin, proc.stdout);

  await client.send("initialize", { protocolVersion: "2024-11-05", capabilities: {}, clientInfo: { name: "test", version: "0" } });
  await client.notify("notifications/initialized");

  const tools = await client.send("tools/list", {});
  expect(tools.tools.map((t: any) => t.name)).toContain("publishArticle");

  const result = await client.send("tools/call", {
    name: "publishArticle",
    arguments: { name: "test-fixture.md" },
  });
  expect(result.content[0].text).toMatch(/published/);

  proc.kill();
});

For Streamable HTTP, replace the spawn with starting the server as an Express app and the JsonRpcClient with one that wraps fetch. Otherwise the structure is identical.

What this layer catches that the unit tests do not:

Stdio framing bugs. A console.log somewhere in the codebase that corrupts the stdout stream. The unit tests pass; the integration test fails on the first response.
JSON-RPC framing errors. The handler returned an object the SDK could not serialize, or used a status code that breaks the spec.
Lifecycle bugs. The server fails to handle initialize correctly, or does not advertise the capabilities it actually supports, or rejects requests that should succeed because the lifecycle state machine is wrong.
Transport-specific issues. The HTTP transport handles session IDs differently than the stdio one. Both have to work. Both need integration tests.

Writing the test helper — the JsonRpcClient that speaks the protocol — is a one-time investment of a couple of hours. Most projects can lift one from the no-SDK post and adapt it.

A useful pattern: run the same suite of integration tests against both transports. The handlers should be transport-agnostic, the test suite should prove it.

Layer 4: Live LLM tests

The top of the pyramid. Expensive, slow, flaky in the way LLMs are flaky, but uniquely valuable for a small set of cases.

The point: catch the regressions where the agent's behavior changes even though the server is technically correct.

Two kinds of test belong here.

Tool-selection tests. Given a user message and the current tool catalog, does the model pick the right tool with reasonable arguments? Run the user message through a small LLM (a 4B-class local model is fine for routine cases; a larger cloud model is needed for nuance), assert on the tool name and key arguments. Run it ten times, treat majority correct as pass, single-failure as flake. These are inherently statistical tests; do not pretend they are deterministic.

Workflow tests. Given a multi-step user goal, does the agent's sequence of tool calls reach the expected end state? Harder to assert on, more valuable when they run. Use a deterministic mock backend so the side effects are recordable, run a real model against the real server, assert on the resulting backend state. Even at one or two such tests per major workflow, the regressions they catch are the ones nothing else will.

The infrastructure: an evaluation harness that loads the server, connects a mock host, runs a model with prompts, scores results. We will go deeper on this in the tool-call evaluation post; for the testing pyramid, the relevant point is these tests exist, they are slow and flaky, you run them on a daily cadence rather than per-commit, and they catch the regressions the lower layers cannot.

What does not belong in tests

A few patterns we have seen that look like good ideas and are not.

"Test that the tool description is what we wrote." That is the snapshot test. Do not also have a literal expect(description).toBe("...") test in the unit suite — the snapshot is already doing that job, and a duplicate test means descriptions get changed in two places.

"Test that the LLM picks the right tool 100% of the time." It will not. Tests that demand determinism from a stochastic system are flake factories. Statistical thresholds are the only honest assertion shape for live-model tests.

"Test the SDK." If you find yourself writing tests that exercise the @modelcontextprotocol/sdk rather than your handlers, stop. The SDK has its own tests. Your tests should treat the SDK as a trusted dependency and exercise the things you wrote.

"Test by running the real Claude Desktop and clicking around." Useful as a smoke check before release. Not a test. There is no assertion, no record, no rerun.

CI pipeline shape

Putting it together, the pipeline we run on the real projects:

On every commit: snapshot tests, unit tests, transport integration tests. Should be under two minutes total. Block the merge on red.
On every PR to main: the above plus a small subset of live-LLM tests against a fixed model version. Five to ten test cases. Treat 80% pass rate as green; investigate when it dips below.
Daily, on a scheduled job: the full live-LLM test suite, including workflow tests. Results posted to a slow-moving channel; nobody is on-call for them, but trends matter.
Pre-release: a manual smoke pass against the actual host the customers use, plus a fresh review of the snapshot diff for any tool description changes since last release.

The cadence point matters as much as the test layout. Fast tests on every commit means developers actually run them. Slow tests on a daily job means nobody waits for them but the trends still get caught. Mixing the two cadences in the same pipeline is the most common mistake — slow tests on every commit get skipped; fast tests once a day miss the regressions they would have caught.

What testing buys you, beyond catching bugs

The test strategy above does the obvious thing — catches regressions before they ship — but it also buys two less-obvious things that compound over time.

A stable baseline for description changes. When a developer wants to tweak a description, the workflow is "edit, run snapshot, see the diff, update the snapshot if it looks right, push." That workflow is brief enough that the description stays mutable, and structured enough that the change does not happen invisibly. Without the snapshot, descriptions either ossify (nobody dares change them) or churn invisibly (every PR ships a description change nobody noticed).

Confidence in the transport-swappability story. The transport post makes the case that swapping stdio for Streamable HTTP is "a few hours of work" if your handlers are clean. The integration-test layer is what proves that. A test suite that runs against both transports tells you, on every commit, that the abstraction has not leaked.

Both of these compound over a year of a server's life into a system you can change without fear. The test strategy is the thing that buys you the right to keep tweaking the tool catalog instead of freezing it.

Where this fits

The closest neighbor is the versioning post — version churn without test coverage is the cause of most "the agent broke" incidents. The observability post is the production half; tests catch what they catch, telemetry catches the rest. The tool-design post is what the snapshot tests are protecting; the tool-call evaluation post is the deeper companion to layer 4.

For teams with multiple MCP servers, the bounded-context post implies a per-server test suite — each server is a unit of testability, and combining several into one mega-suite is a smell.

A test strategy for an MCP server is one of those things that looks like overhead in week one and pays back in week ten. The teams that have it can ship description tweaks on Tuesday afternoon. The teams that do not are the ones whose Tuesday afternoon tweaks become Wednesday morning incidents.

Testing MCP Servers Without an LLM in the Loop

The pyramid for MCP

Layer 1: Description and schema snapshots

Layer 2: Handler unit tests

Layer 3: Transport integration tests

Layer 4: Live LLM tests

What does not belong in tests

CI pipeline shape

What testing buys you, beyond catching bugs

Where this fits

Related Topics

Ready to build your app?