local-mcp-toolbelt

Let any MCP-compatible AI assistant — Claude Desktop, Cursor, Cline, Zed, and
others — delegate lightweight tasks to a local oMLX inference server.
Save tokens. Stay private. Run offline-capable grunt work on your own
Apple Silicon machine.
Six tools: summarize, summarize-long,
summarize-long-chunked, classify, extract, transform.
Single backend: MlxHttpBackend → oMLX
serving Qwen3-4B/8B/14B (MLX 4-bit) on Apple Silicon. KV-cache persistent
across requests; OpenAI Structured Outputs strict mode for grammar
enforcement.
Layout
This is a monorepo with two packages:
packages/core — local-mcp-toolbelt, the MCP
server and companion CLI. Works with any MCP client. Installable via npm
and usable standalone.
packages/claude-desktop (coming soon) — a .mcpb one-click installer
that wraps the core for Claude Desktop users who do not want to edit JSON.
The split exists because the Model Context Protocol is client-neutral: other
clients (Cursor, Cline, Zed, …) also consume MCP servers, so the bridging
logic lives in a framework-agnostic package.
Tools
All six tools are available over any MCP-compatible client (Claude Desktop,
Cursor, Cline, Zed, …). They all share the same security pipeline and emit
_meta telemetry on every response.
summarize
summarize(text?: string, source_uri?: string, style?: string) → prose summary
Delegates to Tier B (Qwen3-4B-Instruct-2507-4bit, non-thinking
variant). Best for documents up to ~4 K tokens. Either text or
source_uri must be provided (mutually exclusive).
summarize-long
summarize-long(text?: string, source_uri?: string, style?: string) → structured summary
Routes to Tier C (Qwen3-8B-4bit MLX, numCtx=32768) for long-context
documents (1–2 sentence lead + 3–6 bullets). Either text or source_uri
must be provided.
Qwen3-8B is a thinking model; the bridge auto-injects \n/no_think into
user content so the model emits the summary directly without first
burning the per-tool cap on a <think>...</think> reasoning trace.
numCtx=32768 admits ~25 K words of source in a single call.
Server-side residency is ~5 GB weights + ~3 GB KV at full context, so
total ~8 GB on the 16 GB Mac (fits when Tier D 14B is not also loaded).
Documents longer than ~25 K words exceed the model's context — use
summarize-long-chunked for those.
summarize-long-chunked
summarize-long-chunked(
text?: string,
source_uri?: string,
style?: string,
max_chunks?: number = 100,
) → coherent final summary
Map-reduce chunked summarization for documents that exceed Tier C's single-call
ceiling (~25 K words). Splits the source into overlapping chunks (default
2 000 tokens, configurable), summarizes each in parallel via p-limit, then
recursively combines chunk summaries until one bucket fits a single REDUCE call.
- Same Tier C model as
summarize-long.
- Per-call soft timeout 50 s; chained
AbortSignal per chunk so client
disconnects propagate cleanly without leaking work.
- Recursion depth ≤ 3 covers up to ~200 K-token inputs; beyond that returns
partial: true with the first bucket reduced.
- Fast-path: if the source fits Tier C in one call, the tool runs as a
single equivalent call (no chunking tax). Strict superset of
summarize-long.
Client-timeout reality. Claude Code's MCP request timeout is a hard ~60 s
that cannot be extended via settings.json or any documented env var
(MCP_TIMEOUT controls only server startup — see
anthropics/claude-code #5221,
#22542). The
chunking work itself takes minutes on a 16 GB Mac, so from Claude Code this
tool is useful in fast-path mode only — for documents up to ~12-15 KB
Chinese / ~25 KB English. Larger documents force the full chunking path,
exceeding 60 s of total wall time and timing out the MCP request even though
each individual local-model call stays under the per-call 50 s budget.
The chunked path is reachable from clients with longer timeouts (Claude
Desktop: 240 s default; custom integrations: configurable). The smoke suite
(tests/smoke-bridge.mjs) exercises it end-to-end with a 600 s harness
timeout.
classify
classify(
text: string,
categories: string[],
allow_multiple: boolean = false,
explain: boolean = false,
) → { labels: string[], reason?: string }