Observability for MCP Servers: What to Log, What to Trace, What to Watch

Most production incidents involving an MCP server, in our experience, follow a pattern that is unusually hard to debug without the right telemetry in place beforehand.

The shape: a tool's output quality has been slowly degrading. The agent's behavior is almost what it was last week. Users start reporting "Claude keeps suggesting the wrong thing" or "my agent skipped a step it used to do." There is no exception in the logs. The HTTP 200s are still HTTP 200s. The CPU graphs are flat. The question is not "what broke" — nothing broke. The question is "what changed," and without per-call structured telemetry, the answer is unknowable.

This is the post about the telemetry we wish we had at the beginning. It is not a generic "add Prometheus" walkthrough — there are plenty of those. It is the MCP-specific concerns: what tool calls look like as observable events, the role of stdout discipline in stdio servers, the patterns that point at prompt-injection in your input args, and the per-tool quality metric that catches degradation before users do.

The stack we use, and the one this post anchors on, is pino for structured logs and OpenTelemetry for traces. It maps cleanly to the Node.js Clean Architecture template we recommend as a backend foundation, which means an MCP server wrapping that backend can share its observability surface end-to-end.

---

The MCP server's place in the picture

Before any tool, the right way to think about where the MCP server sits in the request flow:

javascript

[ Host (Claude Desktop, Cursor, custom) ]
            |
            | JSON-RPC over stdio or HTTP
            v
[ MCP server (your code) ]
            |
            | HTTP / DB / shell / ...
            v
[ Backend (existing API, DB, third-party service) ]

Three logical hops. Each one is its own observability surface. The MCP server is the middle layer — the place that sees both what the agent asked for and what the backend was told to do. That dual visibility is the most valuable thing about logging at the MCP layer. The host sees what the agent said. The backend sees what got executed. Only the MCP server sees the translation, and the translation is where most of the interesting failures live.

A tool call should produce one log line per layer it touched, all linked by a trace ID. By the time the call has returned, you should be able to ask "what happened on this call" and get the host's view, the MCP layer's view, and the backend's view as one coherent record.

---

Logging: structured, stderr, fields not strings

Two non-negotiables for an MCP server's logs.

Structured. JSON, one event per line, parseable by every log aggregator that exists. pino is the right Node.js choice. The reason this matters is that you will want to query "every tool call where tenantId == X returned a result containing error: ..." and string-grep is not a query. Structured logs let you treat the log stream as a queryable event store, which is what it actually is.

Stderr only, for stdio servers. This is the stdout-is-the-wire gotcha, repeated because it bites every team once. console.log and console.info go to stdout. Stdout is the JSON-RPC channel. A console.log in your tool handler will corrupt the protocol, the host will get a JSON parse error, and the session will fall over. Configure pino to write to stderr (its default for pino.destination(2) in Node) and verify with a test that runs the server, sends a tool call, and asserts no garbage on stdout.

For Streamable HTTP servers, this concern goes away — stdout is just stdout, and you can log wherever — but the discipline is still good practice. Strict separation of "protocol channel" and "operations channel" means you can run the same server in either transport without surprises.

A useful field schema we have settled on per tool-call event:

json

{
  "level": "info",
  "ts": "2026-05-06T14:23:11.421Z",
  "service": "mcp-blog-publisher",
  "version": "0.4.2",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "session_id": "sess_a3...",
  "caller_subject": "mihai@amgres.com",
  "tool_name": "publishArticle",
  "tool_args": { "name": "post-1.md" },
  "result_kind": "success",
  "duration_ms": 184,
  "result_summary": "published 4.2KB to /blog/post-1"
}

A few specific calls in that schema worth flagging.

trace_id and span_id are W3C Trace Context. Same trace IDs flow through to backend calls, which is what makes the cross-layer story work.

tool_args is logged but with a deliberate caveat — see the next section. Some args are sensitive and should not appear in plain logs.

result_kind is a small enum: success, validation_error, handler_error, denied. Not the raw HTTP-style code; the kind of thing that happened from the agent's perspective. This is the field you will spend most of your time querying.

result_summary is a short textual summary, not the full result payload. The full result might be hundreds of KB and contain user data; the summary is "what kind of thing happened" in a sentence.

---

What not to log

The unsexy half of structured logging.

Sensitive args. A tool that takes a query like searchKnowledgeBase({ query: "patient health record for John Smith" }) is logging protected information in plain text by default. The query is not the tool's "metadata" — it is the user's intent, and the user's intent is sometimes the most sensitive part of the call.

The pattern that works: a per-tool log redaction policy. Each tool declares which arg fields are loggable verbatim, which are hashed, and which are replaced with [REDACTED]. The redaction happens before the log call, never after. For a tool wrapping anything healthcare, financial, or legal, the default is "redact everything," with explicit opt-in for the fields you have decided are safe.

Full result payloads. Same logic. A tool returning 500 KB of retrieved chunks does not need to log all 500 KB. Log a summary, log a hash, log the count. If you need the full payload for debugging, store it elsewhere — a debug-mode-only object store, accessible to humans with audit trails, never the default log stream.

Long-lived secrets. Passwords, API keys, OAuth tokens. The pino-redaction patterns are a good baseline; the safer move is to never construct log objects that contain secrets in the first place. If a secret never enters the log object, it cannot be accidentally logged.

The single most common log-side leak we have seen: a developer adds a debug log during a tense incident, the log captures the full request including auth headers, the leak persists into production logs forever. The discipline is to not add ad-hoc logs in production code, and to have a structured-log review as part of code review.

---

Tracing: linking the three hops

Logs answer "what happened." Traces answer "how does this call relate to that call." For MCP, the trace is the chain host → MCP server → backend, and the value of having one is that incident triage stops being a join across log indexes and becomes a single timeline.

OpenTelemetry is the standard. The Node SDK is mature. The integration with pino is one line — use pino-otel or set the trace context as a mixin so every log line includes the active trace ID.

A tool handler instrumented:

typescript

import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("mcp-blog-publisher");

async function publishArticleHandler(args: { name: string }) {
  return tracer.startActiveSpan("tool.publishArticle", async (span) => {
    try {
      span.setAttribute("mcp.tool.name", "publishArticle");
      span.setAttribute("mcp.draft.name", args.name);

      const result = await backendClient.publishDraft(args.name);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

Not heavy. The backendClient.publishDraft call propagates the trace context through to the backend automatically (the OTel HTTP instrumentation handles this), so the backend's logs and spans show up in the same trace.

Two attribute conventions worth standardizing across all your MCP servers:

mcp.tool.name for the tool name. Always set.
mcp.session.id for the host session ID, when available. The host sends this in HTTP headers; for stdio, it is the process lifetime.

When you query traces by mcp.tool.name == "publishArticle", you get every call to that tool across servers, with the full backend chain attached to each. That is the view that turns a debugging session from forty minutes into four.

---

Metrics: the dashboards that earn their keep

Five charts cover most of what you actually want to watch on an MCP server. Anything beyond these tends to be noise until you have a specific incident pattern to instrument for.

Per-tool call rate. Calls per minute, broken out by tool. Spikes and drops are the thing you notice first. A tool whose rate dropped to zero at 2pm is a regression in the model's preference for that tool, which is a tool-description regression in disguise.

Per-tool error rate. Errors per minute, broken out by tool and by result_kind. Validation errors look very different from handler errors look very different from denial errors. Treating them as the same metric hides which one is rising.

Per-tool latency P50/P95/P99. Latency distribution per tool. P99 going up while P50 stays flat is a tail-latency problem; both going up together is a backend problem. Both matter; the distinction lets you triage quickly.

Per-tenant call rate (if multi-tenant). Calls per minute, broken out by tenant. This is your "is one customer abusing the API" early warning, and it is the input to per-tenant rate limiting from the security post.

Tool-call quality over time. This is the MCP-specific one and the most important. A daily score per tool of "what fraction of calls had result_kind == success and a result the agent did not subsequently re-call to retry?" Tracked per release. A drop in this score after a description tweak is the single clearest signal that the tweak made the tool worse for the model, even if the backend is fine.

That fifth chart is the one most teams do not have, and it is the one that catches the regressions that the other four miss. We will go deeper into it in the tool-call evaluation post; the short version here is "instrument it from day one even if you do not yet know what good looks like."

---

Detecting prompt-injection patterns at runtime

A category of incident specific to MCP servers, and one general-purpose observability does not catch.

The threat: an attacker has placed a payload in some part of the data the agent is processing — a document the agent retrieved, an email the agent is summarizing, a comment in a code file. The payload tries to convince the agent to call a tool with malicious args. The agent is the attacker now, and the MCP server's job is to notice.

A few patterns worth alerting on in the input args of every tool call:

Markers and instruction-like text in fields where they should not appear. A filename parameter containing "; ignore previous instructions and" is not a filename. The simplest detection: regex against a small set of known injection markers in fields that are supposed to be structured (filenames, IDs, enum values).

Encoding tricks. Base64-looking strings in a field that should be a name. Unusual unicode (zero-width spaces, RTL marks, fullwidth Latin) in a field that should be ASCII. Heavy use of these is rare in legitimate calls and common in injection attempts.

Sudden change in argument shape. A tenant whose searchKnowledgeBase calls have always been five-word queries suddenly making 800-word queries with embedded URLs is worth a flag. The agent's behavior shifts when something injects into its context.

Cross-tool argument correlation. A getDocument call returning content, immediately followed by a tool call whose arguments contain strings that appeared in the document, is the signature of a successful injection. The agent has read malicious instructions and acted on them.

None of these are perfect. False positives are real. The right move is to log them as injection_suspected events (separate stream, alerting threshold tuned per tool), not to block the call by default. The point of the telemetry is to catch the patterns that humans should review, not to add a brittle WAF in front of the model.

---

What incident response looks like with this in place

The composition of all of the above pays off in a specific moment: when a user complains "Claude is acting weird with our tool."

Without telemetry, the conversation is "what does the user mean by weird, can you reproduce it, can you check the logs, the logs do not say anything useful, can you try again."

With the stack above:

Pull the trace for the user's most recent session by their subject claim. One query.
See every tool call in that session, in order, with timing and result kind.
See the backend calls each tool made, with their own timing and result.
Compare to the same user a week ago, or the same tool across users.
Identify the tool whose result_kind shifted from success to validation_error, or whose latency tripled, or whose call rate is suddenly zero.
Cross-reference to the most recent deploy. The tool description was tweaked yesterday.

That sequence — six steps, each of which is a query against the data already in the system — is the difference between an incident that took three hours and one that took twenty minutes. It is also the difference between knowing what changed and guessing. Most incident-resolution time is the guessing.

---

Where this fits

This post is the operational counterpart to the security post and the versioning post. Security tells you what to bound; versioning tells you how the bounds move; observability tells you whether the system is staying inside them.

The tool-design post and the tool-call evaluation post are the closest neighbors when "is the tool actually being used well" is the question. Telemetry is the input; evaluation is what you do with it.

For multi-tenant servers, the multi-tenant checklist extends this. Per-tenant observability is non-negotiable the moment more than one customer is on the same server.

---

Observability for MCP looks like observability for any backend, with three additions — stdout discipline, prompt-injection patterns, and tool-call quality scoring. Each addition pays back in the moment of the first weird incident. Teams that wait to add them until after that moment regret it; teams that build them in from the start treat the first weird incident as a five-minute query instead of an evening of debugging.

Setting up the telemetry stack for an MCP server in production? [We do this professionally at Amazing Resources →](/services/mcp-development)