Flutter as an MCP Host with an On-Device LLM

Every public conversation about MCP assumes the host is one of three things: Claude Desktop, Cursor, or "an agent platform." All three run on a laptop or in a data center, talk to Anthropic or OpenAI over HTTPS, and treat the user's machine as a thin terminal for a model that lives elsewhere.

That assumption is doing damage. It makes MCP look like a desktop-developer feature. It hides the fact that the host role in the protocol — the thing that owns the LLM, manages the conversation, decides which servers to connect to, and routes tool calls — is a piece of software anyone can write. Nothing in the spec privileges Anthropic's apps. The interesting consequence is that an MCP host can run on a phone, with a model that runs on the phone too, talking to MCP servers that may or may not need a network at all.

This is the capstone post in this series. It is the one where everything underneath — JSON-RPC framing, tool design, transports, security — gets pulled together into a single shipped artifact: a Flutter application that hosts an on-device LLM and speaks MCP to one or more servers (flutter-pipeline-mcp being the obvious one we have already built).

It is also the post where we are most honest that the rough edges have not been sanded. On-device LLM tool calling is good but not great as of mid-2026. Some of the things you will want to do you cannot do yet. Some of them you can, but the SDK story is thinner than for the desktop hosts. The post is structured around what works, what is on a knife's edge, and what is genuinely open.

Why this is even worth doing

Three reasons keep coming up:

Offline capability. A user on a plane, on a train through a tunnel, in a building with no signal, in a country where data is rationed. An MCP host that owns its model and talks to local stdio servers does not care. The user gets the same experience whether the network is there or not. For a class of products — field service, journalism, defense, healthcare in low-connectivity settings — this is a feature the cloud LLMs cannot offer, full stop.

Privacy in a way that is real, not marketing. When the model and the data and the tools are all on the device, there is no API call to a third party. No transcripts in someone else's logs. No "we use your conversations to improve the service" small-print. For users who have actually been burned — therapists, lawyers, journalists with sources, anyone in a regulated industry — that is the difference between "I cannot use AI for this" and "I can."

Latency floor. The round trip from a phone to a cloud LLM and back is almost never the model's compute time; it is the network and the queueing. An on-device 4-billion-parameter model returning the first token in 100 ms is faster than a 500-billion-parameter cloud model returning the first token in 800 ms, regardless of how much smarter the larger model is, for any UX where streaming starts the perceived response.

None of this means on-device wins on capability. The largest cloud models are vastly more capable than anything that fits on a phone, and there is no immediate end to that gap. What it means is that some products are better served by a smaller local model with the right tools than by a larger remote model with no tools at all. MCP is the protocol that lets a small local model punch above its weight, because the smartest thing a small model can do is reach for an external capability when it sees one offered.

The host's job, from scratch

A useful first thing to do, before writing any Flutter code, is to enumerate everything an MCP host actually does. The desktop hosts make this look like one feature. It is more like seven.

A host:

Manages the LLM. Loads the model weights, runs inference, streams tokens, handles the context window, manages the system prompt.
Maintains the conversation. Tracks user and assistant turns, persists them, handles multi-turn state, manages summarization or context-trimming when the window fills.
Spawns and connects to MCP servers. Either as child processes over stdio or HTTP clients with auth. Performs the lifecycle handshake. Surfaces failures.
Aggregates capabilities. Asks every connected server for its tools, resources, and prompts. Combines them into a single capability set that the LLM sees.
Routes tool calls. When the LLM emits a tool_use block, the host figures out which server it belongs to, calls it, gets the result, threads it back into the conversation as a tool_result.
Surfaces resources and prompts to the user. UI affordances — pickers, slash commands, attachment menus — that map onto the protocol's primitives.
Renders the conversation. Markdown, code blocks, streaming tokens, tool-call indicators, attached resources. The chat UI.

Job 1 (the LLM) and jobs 3-5 (MCP) are the only ones the desktop hosts have wrapped well. Jobs 2, 6, and 7 they ship; you build. So a Flutter MCP host is roughly two-thirds product (chat UI, conversation persistence, attachment UX) and one-third protocol (MCP client, model orchestration). The protocol third is what this post is about.

Picking the on-device model

The cleanest mental model: pick the inference engine first, the model second, the format third. The format constrains both.

Inference engines. Three serious options on Flutter today:

`llama.cpp` via FFI. The mature path. Compile llama.cpp for iOS and Android, expose its C API through Dart FFI, run GGUF-quantized models. Painful to set up the first time, rock solid once it is up.
MLC LLM. A whole-stack solution targeting mobile, with prebuilt binaries and a Flutter plugin in development. Faster to start with, fewer dials.
Apple's MLX (iOS only). Native, Apple-blessed, fastest on Apple Silicon. Useless on Android.

For a cross-platform Flutter app, llama.cpp via FFI is the choice that ages best. The other two are easier today and may not be the same easy in six months. The trade-off is a real engineering investment in the FFI bridge — see the FFI memory map for what that involves at the boundary level.

Models. As of mid-2026, the realistic candidates for a phone-class device are:

Qwen2.5-3B-Instruct (Q4_K_M quant, ~2 GB on disk). Decent multilingual, decent tool calling.
Llama-3.2-3B-Instruct. The default English-first option.
Phi-3.5-mini-instruct. Smaller, surprisingly capable on reasoning.
Kimi K2 (mobile variants, where available). Strong tool-calling specialization but more setup work.

For tool calling specifically, the model needs to have been fine-tuned on tool use. Most of the above advertise this; not all of them do it well. We have had the best results with Qwen2.5 and the Llama 3.2 3B series when the prompt format matches what the model was trained on exactly — drift on the special tokens by one character and tool calling silently degrades to "the model emits something that looks like a tool call but isn't parseable."

Formats. GGUF is the format the open ecosystem has standardized on. Quantization down to Q4_K_M is the sweet spot — Q3 noticeably degrades, Q5 gives back too little for the size cost. Expect 2-4 GB on disk for a 3B-parameter model in Q4.

Wiring inference into Flutter

The shape of the integration, leaving a lot of detail on the floor:

dart

final engine = LlamaEngine.load(
  modelPath: '/data/.../qwen2.5-3b-instruct-q4_k_m.gguf',
  contextSize: 8192,
);

final stream = engine.completeStreaming(
  prompt: formatPromptForQwen(systemPrompt, conversation, availableTools),
  stop: ['<|im_end|>'],
);

await for (final token in stream) {
  // append to UI, parse for tool calls as we go
}

The hard part is formatPromptForQwen. Each model family has a chat template — the exact sequence of <|im_start|>system, <|im_start|>user, <|im_start|>assistant, the tool-use special tokens, the JSON shape the model expects for tool definitions. Get it wrong and the model behaves almost right, in a way that takes hours to diagnose. Reference the model's tokenizer config from Hugging Face and keep the format in one well-tested function.

Inference itself runs in a Dart isolate. The main isolate is for UI; the inference isolate owns the model handle, takes prompts over a port, streams tokens back. This pattern matches how Flutter handles any heavy synchronous work — see the event loop article for the underlying machinery — and it is non-negotiable for inference, because a 3B model running on a phone produces tokens at 10-30 per second and blocking the UI thread on that loop will jank everything.

The MCP client side

Flutter does not have a first-party MCP SDK as of mid-2026. There is community work in flight, and the protocol is small enough that writing the client part directly is reasonable. The shape:

Stdio client. For a server like flutter-pipeline-mcp running as a child process — viable on desktop Flutter, awkward on iOS (no Process.start for signed apps), workable on Android (with caveats around bundling the Node runtime). For mobile, expect to either bundle an embedded server as a library rather than a process, or to keep MCP servers remote.

Streamable HTTP client. The realistic path for mobile. The host opens an HTTPS connection to a Streamable HTTP MCP server, runs the OAuth 2.1 + PKCE flow (see the PKCE post), and starts sending JSON-RPC requests. This is just an HTTP client with dio or http, plus session-id management, plus the JSON-RPC framing.

The lifecycle handshake on initial connection:

dart

final response = await client.send(JsonRpcRequest(
  id: 1,
  method: 'initialize',
  params: {
    'protocolVersion': '2024-11-05',
    'capabilities': {},
    'clientInfo': {'name': 'flutter-host', 'version': '0.1.0'},
  },
));

await client.notify('notifications/initialized');

final tools = await client.send(JsonRpcRequest(
  id: 2,
  method: 'tools/list',
  params: {},
));

That gives the host the full set of tools the server exposes. Aggregate across servers, deduplicate, hand the union to the model in its prompt-format-specific tool-definition section.

The route through a tool call

The single most important diagram in any host implementation. When the model decides to call a tool:

javascript

User → host UI → assembled prompt → inference engine
                                          ↓
                                   tokens streaming
                                          ↓
                                   parser detects
                                   tool_use block
                                          ↓
                              host pauses streaming
                                          ↓
                          host routes call to MCP server
                                          ↓
                            server runs, returns result
                                          ↓
                       host injects tool_result into context
                                          ↓
                              inference resumes from there
                                          ↓
                                final assistant text
                                          ↓
                                       UI

Three places this loop tends to fail in practice:

The parser. Detecting a complete tool_use block in a token stream is fiddly. The model emits special tokens, then JSON, then closing tokens. You do not know it is a tool call until you have seen the opening marker, and you do not know it is complete until the closing marker has streamed. Build the parser as a state machine, not a regex. Test it against malformed cases; the model will eventually produce all of them.

The "resume from there" step. After the tool result is injected, inference resumes — but resuming requires the right prompt format. Most engines support a continue mode that takes the prior context plus the new turn and produces the next assistant message. Using it correctly is model-specific. Read the chat template carefully.

Concurrent tool calls. Some models emit multiple tool calls in one turn. The host has to run them in parallel where the protocol allows, sequentially where it does not (some servers expect strict ordering). The desktop hosts solve this by parallel-dispatching all tool calls in a single assistant turn and waiting for all results before resuming. Mirror that.

A worked use case: debugging a Flutter test on a phone

Concrete is better than abstract. Here is the use case the Flutter MCP host opens up most clearly:

A developer is on a train, with no laptop, but their flutter-pipeline-mcp server is reachable over a Tailscale-style overlay network. They open the Flutter MCP host on their phone. They ask: "Why is `auth_test.dart` failing on main?"

The on-device model:

Picks flutter_test from the available tools, calls it with { projectPath: "...", testPattern: "auth_test" }.
Receives back the parsed --machine JSON output — failures, stack traces.
Picks pubspec_read to verify the dependency versions.
Picks a read_resource for the failing source file.
Synthesizes a hypothesis: "The mock for AuthService was changed in commit X to return a Future<Either> instead of a Future<User?>, and the test still expects the old shape."
Suggests a fix.

None of this involves the cloud. None of it leaves the developer's network. The model is small and the answer is correct because the information the model is reasoning over is exact — the test output, the source file, the pubspec — rather than guessed at from training data.

This is the shape of product the on-device + MCP combination unlocks. Not a chatbot. A grounded small assistant that owns its tools and reaches for them honestly because it does not have a better option.

What is genuinely hard right now

For a working developer evaluating this seriously, the unsolved parts:

Model size vs. quality. A 3B model is good at routine tool calling, mediocre at multi-step planning, and bad at any kind of subtle reasoning. The 7B class is meaningfully better, sometimes runs on a high-end phone, and is on the edge of viable. The 13B class is broadly out of reach for phones and on a knife's edge for a tablet. Pick the smallest model that gets your task done; do not optimize for "the smartest possible local model."

Prompt template drift. Engines, models, and tool formats keep moving. A configuration that worked in May does not work in October because the model maintainer pushed a quantization with a slightly different tokenizer config. Pin everything. Test on every update. Treat your formatPromptFor... functions as load-bearing infrastructure.

FFI memory model. Inference engines do nontrivial pointer work — large mmapped weights, KV-cache buffers, sometimes pinned memory for the GPU. The Dart side has to manage that without leaking and without holding onto pointers across reload cycles. The FFI memory map is the right reference; the rules in there apply to llama.cpp the same way they apply to any other native lib.

iOS background restrictions. A long-running inference job on iOS will get killed by the OS if the app backgrounds. There are workarounds (audio session tricks, BackgroundTasks framework) but none of them is honest. Plan the UX around foreground-only inference until iOS gives a real "AI workload" lifecycle, which it has not yet.

Tool calling on small models is brittle. Even a model that benchmarks well on tool calling will produce malformed calls regularly under real-world prompts. Validate every call against the tool's schema before dispatching; reject and ask the model to retry on validation failures; budget retries so a confused model cannot loop forever. This is the layer that makes the difference between "demo" and "shippable."

Architecture sketch for the actual app

Pulling the pieces together, the layering for a production-ready Flutter MCP host looks like this:

`lib/llm/` — inference engine bindings, prompt formatting per model family, isolate orchestration.
`lib/mcp/` — JSON-RPC client, transport implementations (stdio, streamable HTTP), capability aggregation, tool-call routing.
`lib/conversation/` — turn management, persistence, summarization, context-window accounting.
`lib/tool_loop/` — the inference-loop-with-tool-calls state machine. The most important file in the project.
`lib/ui/` — chat, attachments, slash commands, settings.

Six folders. Each one is a domain with its own seam. Each one tested independently. The same Clean Architecture instincts that apply to a Node.js modular monolith — bounded contexts, port/adapter naming, dependency rule pointing inward — apply here too. An MCP host is a small app pretending to be a big one. It looks small, then you start shipping it, and the seams matter.

Where this fits in the bigger picture

The reason this post is the capstone is that everything else in the series feeds into it.

The pillar gave you the server shape.
The no-SDK post gave you the JSON-RPC vocabulary you now need on the client side.
The tool-design post is sharper here, because a 3B local model has less room to recover from a bad description than a frontier cloud model does.
The transport post is what you reach for when deciding whether your phone host talks to local or remote servers.
The PKCE post is required reading the moment you connect to a remote server, because the phone is a public client by definition.
The security post is doubly relevant on mobile, where the user's data is more sensitive and the recovery story is harder.

Building a Flutter MCP host is not a weekend project. It is a real product with real interlocking parts. But none of the parts are mysterious any more, and the combination — small local model, MCP servers, Flutter UX — is the most interesting thing in the AI mobile space in 2026.

The host role in MCP is the part of the protocol that has been gatekept by default. There is no real reason for that. Anyone can build a host, on any platform, with any model. The protocol does not care. The first wave of mobile MCP hosts is going to look weird and underbaked; the second wave is going to ship the experiences cloud LLMs cannot.

Building a Flutter app that hosts an on-device LLM and speaks MCP, or evaluating whether you should? We do this professionally at Amazing Resources →