local-mcp-toolbelt

Let any MCP-compatible AI assistant — Claude Desktop, Cursor, Cline, Zed, and others — delegate lightweight tasks to a local oMLX inference server. Save tokens. Stay private. Run offline-capable grunt work on your own Apple Silicon machine.

Six tools: summarize, summarize-long, summarize-long-chunked, classify, extract, transform. Single backend: MlxHttpBackend → oMLX serving Qwen3-4B/8B/14B (MLX 4-bit) on Apple Silicon. KV-cache persistent across requests; OpenAI Structured Outputs strict mode for grammar enforcement.

Layout

This is a monorepo with two packages:

packages/core — local-mcp-toolbelt, the MCP server and companion CLI. Works with any MCP client. Installable via npm and usable standalone.
packages/claude-desktop (coming soon) — a .mcpb one-click installer that wraps the core for Claude Desktop users who do not want to edit JSON.

The split exists because the Model Context Protocol is client-neutral: other clients (Cursor, Cline, Zed, …) also consume MCP servers, so the bridging logic lives in a framework-agnostic package.

Tools

All six tools are available over any MCP-compatible client (Claude Desktop, Cursor, Cline, Zed, …). They all share the same security pipeline and emit _meta telemetry on every response.

`summarize`

summarize(text?: string, source_uri?: string, style?: string) → prose summary

Delegates to Tier B (Qwen3-4B-Instruct-2507-4bit, non-thinking variant). Best for documents up to ~4 K tokens. Either text or source_uri must be provided (mutually exclusive).

`summarize-long`

summarize-long(text?: string, source_uri?: string, style?: string) → structured summary

Routes to Tier C (Qwen3-8B-4bit MLX, numCtx=32768) for long-context documents (1–2 sentence lead + 3–6 bullets). Either text or source_uri must be provided.

Qwen3-8B is a thinking model; the bridge auto-injects \n/no_think into user content so the model emits the summary directly without first burning the per-tool cap on a <think>...</think> reasoning trace.

numCtx=32768 admits ~25 K words of source in a single call. Server-side residency is ~5 GB weights + ~3 GB KV at full context, so total ~8 GB on the 16 GB Mac (fits when Tier D 14B is not also loaded). Documents longer than ~25 K words exceed the model's context — use summarize-long-chunked for those.

`summarize-long-chunked`

summarize-long-chunked(
  text?:        string,
  source_uri?:  string,
  style?:       string,
  max_chunks?:  number = 100,
) → coherent final summary

Map-reduce chunked summarization for documents that exceed Tier C's single-call ceiling (~25 K words). Splits the source into overlapping chunks (default 2 000 tokens, configurable), summarizes each in parallel via p-limit, then recursively combines chunk summaries until one bucket fits a single REDUCE call.

Same Tier C model as summarize-long.
Per-call soft timeout 50 s; chained AbortSignal per chunk so client disconnects propagate cleanly without leaking work.
Recursion depth ≤ 3 covers up to ~200 K-token inputs; beyond that returns partial: true with the first bucket reduced.
Fast-path: if the source fits Tier C in one call, the tool runs as a single equivalent call (no chunking tax). Strict superset of summarize-long.

Client-timeout reality. Claude Code's MCP request timeout is a hard ~60 s that cannot be extended via settings.json or any documented env var (MCP_TIMEOUT controls only server startup — see anthropics/claude-code #5221, #22542). The chunking work itself takes minutes on a 16 GB Mac, so from Claude Code this tool is useful in fast-path mode only — for documents up to ~12-15 KB Chinese / ~25 KB English. Larger documents force the full chunking path, exceeding 60 s of total wall time and timing out the MCP request even though each individual local-model call stays under the per-call 50 s budget.

The chunked path is reachable from clients with longer timeouts (Claude Desktop: 240 s default; custom integrations: configurable). The smoke suite (tests/smoke-bridge.mjs) exercises it end-to-end with a 600 s harness timeout.

`classify`

classify(
  text:           string,
  categories:     string[],
  allow_multiple: boolean = false,
  explain:        boolean = false,
) → { labels: string[], reason?: string }

local-mcp-toolbelt

Popularity

What's Inside

README