From dgx-spark
Manage AI models on the DGX Spark — list, pull, serve, stop, and recommend models across Ollama and vLLM backends. Use when deploying models, checking what's running, pulling new models, or getting recommendations for a use case. Triggers on: model names (Qwen, Llama, DeepSeek, Gemma), "serve model", "pull model", "what models are running", "deploy model on Spark".
How this skill is triggered — by the user, by Claude, or both
Slash command
/dgx-spark:spark-modelsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Manage AI models across Ollama (quick interactive use) and vLLM (production serving with tool-calling support for Claude Code).
Manage AI models across Ollama (quick interactive use) and vLLM (production serving with tool-calling support for Claude Code).
| Backend | Best For | Performance | Tool Calling |
|---|---|---|---|
| Ollama | Quick experiments, trying models, chat | Good for < 30B | Limited |
| vLLM | Claude Code integration, production serving, batched inference | Optimized with NVFP4 | Full support |
Always check what's running before deploying:
# Use MCP tool
spark_list_models
# Or directly
ollama list
docker ps --filter "ancestor=nvcr.io/nvidia/vllm:latest"
# Use MCP tool
spark_pull_model { "model": "qwen3.5:32b" }
# Models are async — check progress with spark_list_models
# Use MCP tool
spark_start_model {
"model": "Qwen/Qwen3-Coder-Next",
"port": 8000,
"extraArgs": ["--max-model-len", "32768"]
}
Required flags for Claude Code compatibility:
--enable-auto-tool-choice — enables tool calling--tool-call-parser hermes — use Hermes format (verify per model)These are added automatically by the MCP tool.
spark_stop_model { "containerName": "vllm-qwen3-coder" }
| Model | Size | Quantization | tok/s (approx) | Notes |
|---|---|---|---|---|
| Qwen3-Coder-Next | ~32B | NVFP4 | ~15 | Purpose-built for agentic coding |
| Qwen3.5 | 32B | NVFP4 | ~15 | Strong general + coding |
| GLM-4.7-Flash | 35B MoE | NVFP4 | ~18 | Fast MoE architecture |
| Llama 3.1 8B | 8B | FP16 | ~20 | Fast baseline, limited reasoning |
| Model | Size | tok/s (approx) | Notes |
|---|---|---|---|
| llama3.1:8b | 8B | ~20 | Good starter model |
| qwen3.5:32b | 32B | ~10 | Strong all-rounder |
| deepseek-r1:14b | 14B | ~15 | Reasoning focus |
VLLM_GPU_MEMORY_UTILIZATION=0.7)nvcr.io/nvidia/vllm), NOT vllm/vllm-openainpx claudepluginhub jeremyeder/dgx-agentskills --plugin dgx-sparkProvides patterns for LLM inference infrastructure with serving frameworks like vLLM, TGI, TensorRT-LLM; quantization, batching strategies, KV cache, and streaming responses. Use for optimizing latency and scaling deployments.
Packages and builds custom AI models with Cog for deployment on Replicate. Covers cog.yaml, predict.py, GPU/CUDA setup, and Docker image creation.