From qe-framework
Guides local LLM infrastructure setup and optimization including Ollama, vLLM, llama.cpp, quantization, API integration, and MCP bridging for Claude Code.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qe-framework:Qlocal-llm-setupThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
cc---
This skill provides end-to-end guidance for setting up and optimizing local LLM infrastructure. It covers installation, model management, quantization strategies, API integration, and MCP bridging for Claude Code workflows.
brew install ollama
# Start daemon
ollama serve
curl -fsSL https://ollama.com/install.sh | sh
# Start daemon (systemd)
sudo systemctl start ollama
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec ollama ollama pull mistral
| Command | Purpose |
|---|---|
ollama pull <model> | Download model from registry |
ollama run <model> | Run model interactively (blocking) |
ollama list | Show downloaded models and size |
ollama rm <model> | Delete model |
ollama show <model> | Display model metadata |
ollama cp <src> <dst> | Copy/rename model |
Create a Modelfile to customize models:
FROM mistral
PARAMETER temperature 0.7
PARAMETER num_predict 256
PARAMETER top_k 40
PARAMETER top_p 0.9
SYSTEM "You are a helpful coding assistant."
Common parameters:
temperature: Randomness (0.0–2.0); 0 = deterministictop_k: Sample from top K tokenstop_p: Nucleus sampling thresholdnum_predict: Max output lengthnum_ctx: Context window sizenum_gpu: GPU layers to offloadBuild and use:
ollama create my-mistral -f Modelfile
ollama run my-mistral "Explain recursion"
Ollama exposes multiple endpoint formats:
/api/chat)curl http://localhost:11434/api/chat -d '{
"model": "mistral",
"messages": [{"role": "user", "content": "Hello"}],
"stream": false
}'
/api/generate)curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Tell me about Rust",
"stream": false
}'
/api/embeddings)curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "machine learning"
}'
/v1/chat/completions)curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [{"role": "user", "content": "Hi"}]
}'
Set environment variables before starting Ollama:
# Use N GPU layers
export OLLAMA_NUM_GPU=35
# Limit concurrent loaded models
export OLLAMA_MAX_LOADED_MODELS=2
# Force CPU-only
export OLLAMA_NUM_GPU=0
# Start server with config
OLLAMA_NUM_GPU=35 ollama serve
For NVIDIA GPUs:
OLLAMA_NUM_GPU: Number of layers to offload (higher = faster inference, more VRAM)For Metal (macOS):
OLLAMA_NUM_GPU to control layer offloadingpip install vllm
# With CUDA support
pip install vllm[cuda12]
vllm serve mistral-7b-instruct-v0.2 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 2048
| Quantization | Size Reduction | Speed | VRAM | Use Case |
|---|---|---|---|---|
| AWQ | 3–4× | Very fast | Low | Production inference |
| GPTQ | 3–4× | Fast | Low | Balanced quality/speed |
| FP8 | 2× | Fast | Medium | RTX 4000+ (Ampere+) |
| FP16 (baseline) | 1× | Medium | High | Multi-GPU, high accuracy |
AWQ quantization example:
# Use pre-quantized model
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq \
--port 8000
GPTQ:
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
--quantization gptq \
--port 8000
FP8 (auto-quantized):
vllm serve mistral-7b-instruct-v0.2 \
--quantization fp8 \
--port 8000
vllm serve mistral-7b-instruct-v0.2 \
--port 8000 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--kv-cache-dtype fp8_e5m2 \
--batch-size 32
Key parameters:
--tensor-parallel-size N: Shard model across N GPUs--kv-cache-dtype fp8_e5m2: KV cache quantization (saves 50% VRAM)--gpu-memory-utilization: Target GPU memory (0.9 = aggressive)--max-model-len: Max context length--batch-size: Batch size for inferencefrom vllm import LLM, SamplingParams
# Load model
llm = LLM(
model="mistral-7b-instruct-v0.2",
tensor_parallel_size=1,
gpu_memory_utilization=0.8,
quantization="awq", # optional
)
# Configure sampling
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256
)
# Generate
outputs = llm.generate(
["Explain machine learning in 100 words"],
sampling_params
)
for output in outputs:
print(output.outputs[0].text)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# macOS with Metal acceleration
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release
# Linux with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
| Format | Bits | Size | Quality | Speed | Use Case |
|---|---|---|---|---|---|
| Q2_K | 2 | 3.3GB (7B) | Poor | Fastest | Edge devices, demo |
| Q3_K | 3 | 5.2GB (7B) | Fair | Very fast | 4GB VRAM devices |
| Q4_K | 4 | 6.6GB (7B) | Good | Fast | 8GB VRAM, general use |
| Q5_K | 5 | 8.7GB (7B) | Very good | Medium | 12GB VRAM, balanced |
| Q6_K | 6 | 10.7GB (7B) | Excellent | Medium | 16GB VRAM, quality focus |
| Q8_0 | 8 | 14.6GB (7B) | Near-original | Slower | Reference, high accuracy |
Download quantized models:
# TheBloke repository (extensive GGUF collection)
ollama pull neural-chat # Or use llama.cpp directly:
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
./build/bin/llama-server \
--model mistral-7b-instruct-v0.2.Q4_K_M.gguf \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 33 \
--batch-size 512 \
--parallel 4
Parameters:
--ctx-size: Context window (max tokens)--n-gpu-layers: GPU layer offloading (higher = faster)--batch-size: Batch size for parallel processing--parallel: Run N independent sequencesOpenAI-compatible API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [{"role": "user", "content": "Hi"}],
"temperature": 0.7
}'
For a 7B-parameter model:
Base model memory = (7 × 10^9 params) × (bits per param) / 8
Q4_K_M (4.5 bits): ~3.5 GB
Q5_K (5 bits): ~4.2 GB
Q8_0 (8 bits): ~6.7 GB
+ KV cache overhead (~context_length × 2 layers × 0.1 GB per 4K ctx)
Example: Q4_K_M + 8K context ≈ 4 GB + 0.5 GB = 4.5 GB
| Task | Recommended Models | Min Quantization | VRAM |
|---|---|---|---|
| Code Generation | Qwen2.5-Coder-32B, DeepSeek-Coder-V2-16B, Llama 3.1-8B | Q4_K | 8–16GB |
| Chat | Mistral-7B, Llama 3-8B, Qwen2.5-14B | Q4_K | 6–12GB |
| Reasoning | DeepSeek-R1-14B, Mistral-Large, Llama 3.1-70B (quantized) | Q5_K | 12GB+ |
| Embeddings | nomic-embed-text (768-dim), BGE-small (384-dim) | F16 | 2–4GB |
| Lightweight Chat | Phi-3-mini (3.8B), Qwen2.5-3B | Q4_K | 2–4GB |
| Model | Params | Best For | License |
|---|---|---|---|
| Qwen2.5-Coder-32B | 32B | Code + CJK support | Apache 2.0 |
| DeepSeek-Coder-V2 | 16B (MoE) | Code, reasoning | MIT |
| Llama 3.1 | 8B/70B/405B | Chat, instruction-following | Llama License |
| Mistral-7B-Instruct | 7B | Fast chat | Apache 2.0 |
| Phi-3-mini | 3.8B | Edge inference | MIT |
| nomic-embed-text | 137M | Semantic search | OpenRAIL |
| BGE-small-en-v1.5 | 33M | Dense embedding | MIT |
Small Language Models (SLM): <7B parameters, designed for consumer hardware without offloading.
| Benefit | Impact |
|---|---|
| Low latency | 20–50ms first token on CPU |
| Privacy | Full on-device, no external calls |
| Cost | $0 inference cost |
| Battery | Suitable for mobile/edge |
| Offline | Works without internet |
| Model | Params | Context | Strengths | VRAM |
|---|---|---|---|---|
| Phi-3-mini | 3.8B | 128K | Fast, broad knowledge | 2GB |
| Qwen2.5-3B | 3B | 32K | CJK support, efficient | 2GB |
| Gemma 2B | 2B | 8K | Well-tuned, instruction-following | 1.5GB |
| TinyLlama-1.1B | 1.1B | 2K | Ultra-lightweight | 1GB |
# Ollama
ollama run phi
# vLLM
vllm serve microsoft/phi-3-mini --quantization awq --port 8000
# llama.cpp
./llama-server --model phi-3-mini-q4.gguf --ctx-size 2048
Model Context Protocol (MCP): A standardized bridge for Claude to access external LLM services, tools, and resources. Perfect for integrating local Ollama/vLLM/llama.cpp with Claude Code.
Install:
claude mcp add ollama-mcp -- npx ollama-mcp --model mistral
claude_desktop_config.json:
{
"mcpServers": {
"ollama": {
"command": "npx",
"args": ["ollama-mcp", "--model", "mistral"],
"env": {
"OLLAMA_BASE_URL": "http://localhost:11434"
}
}
}
}
Usage in Claude Code:
// MCP automatically exposes ollama tools
// Claude can call: generate_text, chat, embeddings
Install:
git clone https://github.com/sammcj/OllamaClaude
cd OllamaClaude
npm install
Configure:
const OllamaClaude = require('./src/index.js');
const client = new OllamaClaude({
ollamaUrl: 'http://localhost:11434',
model: 'mistral',
});
const response = await client.chat('Explain quantum computing');
Install:
pip install mcp-local-llm
Configure for multiple backends:
{
"mcpServers": {
"local-ollama": {
"command": "python",
"args": ["-m", "mcp_local_llm", "--backend", "ollama", "--port", "11434"]
},
"local-vllm": {
"command": "python",
"args": ["-m", "mcp_local_llm", "--backend", "vllm", "--port", "8000"]
}
}
}
Build with MCP support:
cmake -B build -DGGML_MCP=ON
cmake --build build --config Release
Run server with MCP:
./llama-server --model model.gguf --mcp-port 3000
// workflows/my-workflow.ts
import { MCPClient } from './mcp-client';
const llm = new MCPClient('ollama');
async function analyzeCode(code: string) {
// Delegate to local Mistral
const review = await llm.generate({
prompt: `Review this code:\n${code}`,
model: 'mistral',
temperature: 0.3,
});
return review;
}
| Model | Params | Strengths | Notes |
|---|---|---|---|
| Qwen2.5 | 3B–72B | Excellent CJK support, multilingual | Recommended for Korean |
| EEVE-Korean-10.8B | 10.8B | Fine-tuned for Korean | Based on Mistral |
| KULLM3 | 13B | Korean-specific instruction tuning | Samsung research |
| Llama 3-8B | 8B | Good CJK, multilingual base | Fine-tune for Korean |
# Ollama: Pull Qwen with Korean support
ollama pull qwen2.5:14b
# Modelfile for Korean instruction-following
cat > Modelfile.ko <<EOF
FROM qwen2.5:14b
PARAMETER temperature 0.7
SYSTEM "당신은 한국어를 유창하게 이해하고 응답하는 AI 어시스턴트입니다."
EOF
ollama create ko-qwen -f Modelfile.ko
ollama run ko-qwen "한국 국회의 역할을 설명해주세요"
# Using OpenCLIP for Korean text recognition
pip install open-clip-torch pillow
from PIL import Image
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-32',
pretrained='openai'
)
image = preprocess(Image.open('korean_text.png')).unsqueeze(0)
# Then use with OCR pipeline (Tesseract or EasyOCR)
Problem: RuntimeError: CUDA out of memory
Solutions (in order):
--gpu-memory-utilization (0.9 → 0.7)--max-model-len 2048 (from 4096)--kv-cache-dtype fp8_e5m2--batch-size 16 (from 32)Problem: First token takes 5–10 seconds
Causes & fixes:
OLLAMA_NUM_GPU > 0sysctl hw.thermal and coolingOptimal Mac setup:
# Macbook Pro 16" M3 Max
ollama pull mistral:7b-instruct-q4_K_M
OLLAMA_NUM_GPU=12 ollama serve
# Expected: ~150ms first token, ~50ms/token streaming
Problem: Error: failed to load model
Debug steps:
shasum -a 256 model.gguffile model.gguf (should be GGUF 3)du -sh model.ggufls -la ~/.ollama/models/ or ~/.cache/huggingface/rm && re-download# Verify model
llama-cpp -m model.gguf --test
Problem: 1–2 seconds before first output
Causes:
Solutions:
# Pre-load model into GPU
ollama run mistral "hi" && echo "Warmed up"
# For vLLM: ensure model is loaded
python -c "from vllm import LLM; llm = LLM('mistral')" # Blocks until loaded
# Batch requests to amortize cost
# Instead of 10 requests × 2s TTFT = 20s total
# Queue 10 requests → 2s TTFT + 0.1s/token = 3s total
Formula:
KV cache size = 2 × (seq_length × num_heads × head_dim × dtype_bytes)
Example: seq_len=4096, heads=32, head_dim=128, fp16:
KV cache ≈ 2 × (4096 × 32 × 128 × 2) / (1024^3) ≈ 2 GB
Context length strategies:
Recommendation: Start at 4K; increase only if needed.
Trade: Throughput vs Latency
# Low batch size (1): Fast TTFT, low throughput
vllm serve model --max-num-seqs 1
# Medium batch size (8): Balanced
vllm serve model --max-num-seqs 8
# High batch size (32+): High throughput, high latency (APIs)
vllm serve model --max-num-seqs 32 --disable-log-requests
Default (FP16): Each token cached at full precision
With FP8 quantization:
vllm serve model --kv-cache-dtype fp8_e5m2
# Reduces to ~2 GB (50% savings)
With sliding window:
# Only keep recent N tokens in KV cache
vllm serve model --kv-cache-dtype fp8_e5m2 --max-model-len 4096
For NVIDIA (CUDA):
# Estimate layers to offload = (available_vram_gb - 1) / (model_size_gb / num_layers)
# Example: 12 GB VRAM, 7B model (~4 GB, 32 layers)
# Layers = (12 - 1) / (4 / 32) ≈ 88 (offload all)
vllm serve mistral \
--gpu-memory-utilization 0.85 \
--tensor-parallel-size 2 # Use 2 GPUs
For Metal (macOS):
# Metal has automatic GPU detection
# Adjust OLLAMA_NUM_GPU (0–100+ depending on chip)
OLLAMA_NUM_GPU=30 ollama serve # M3 Max: ~30 layers
# Measure TTFT and tokens/second
time curl -X POST http://localhost:11434/api/generate \
-d '{"model": "mistral", "prompt": "Explain AI", "stream": false}' \
| jq '.eval_count, .eval_duration'
# Profile with vLLM metrics
vllm serve model --enable-prometheus --port 8000
# Visit http://localhost:8000/metrics
# Ollama
brew install ollama && ollama serve
# vLLM
pip install vllm[cuda12] && vllm serve mistral --port 8000
# llama.cpp
git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && cmake -B build && cmake --build build
# MCP bridge
claude mcp add ollama-mcp -- npx ollama-mcp
# Ollama
ollama pull mistral
ollama run mistral "Hi"
ollama list
ollama rm mistral
# Check running models
curl http://localhost:11434/api/tags | jq .
# Kill hung process
lsof -i :11434 | grep LISTEN | awk '{print $2}' | xargs kill -9
# Ollama
export OLLAMA_NUM_GPU=35
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_MODELS=~/.ollama/models
# vLLM
export VLLM_GPU_MEMORY_UTILIZATION=0.9
export VLLM_ATTENTION_BACKEND=flashinfer
# llama.cpp
export LLAMA_CUDA_FORCE_DMMV=1
export LLAMA_METAL_MAX_BATCH_SIZE=512
npx claudepluginhub inho-team/qe-framework --plugin qe-frameworkCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.