Skill

nvidia-delegate

Use whenever the user wants a second opinion across LLMs, an ensemble comparison ("ask three models", "compare answers", "what do other LLMs think"), a cheap or fast alternative to Claude for a self-contained subtask, deep chain-of-thought reasoning from a reasoning-tuned model, or any explicit mention of NVIDIA models, NIM endpoints, Llama 3.x or Llama 4 Maverick, Kimi K2.6, DeepSeek V4 (Pro or Flash), Nemotron Super/Mini, Mixtral, Qwen3-Next, or "multi-agent". Routes via the nvidia-models MCP server (tools: list_models, nvidia_chat, nvidia_compare). Trigger phrases: "second opinion", "another model", "compare answers", "fan out", "ensemble", "ask three models", "delegate to llama", "use deepseek", "kimi", "qwen", "nemotron", "nvidia compare", "what would model X say".

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/nvidia-models:nvidia-delegate

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Three MCP tools wired by this plugin:

SKILL.md

71 lines · ~974 tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitMay 16, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Delegate to NVIDIA-hosted LLMs

Three MCP tools wired by this plugin:

mcp__nvidia-models__list_models — short keys → full NIM paths.
mcp__nvidia-models__nvidia_chat(model, prompt, system?, max_tokens?, temperature?) — single-model call.
mcp__nvidia-models__nvidia_compare(prompt, models?, system?, max_tokens?) — fan the same prompt across N models in one call. Default trio: llama-4-maverick, deepseek-v4-pro, qwen3-next-80b (three different vendors / architectures).

When to invoke

User intent	Tool	Suggested model
"Second opinion" / "what would another LLM say"	`nvidia_compare`	default 3
"Compare answers from X, Y, Z"	`nvidia_compare`	user-specified list
"Quick draft" / "cheap inference"	`nvidia_chat`	`llama-3.1-70b` or `mixtral-8x7b`
Tiny / very fast	`nvidia_chat`	`nemotron-mini-4b`
Reasoning-heavy / chain-of-thought	`nvidia_chat`	`deepseek-v4-pro` or `deepseek-v4-flash`
Long-context generalist	`nvidia_chat`	`llama-4-maverick`
NVIDIA-tuned generalist	`nvidia_chat`	`nemotron-super-49b`
Qwen-family preference	`nvidia_chat`	`qwen3-next-80b`
"Ensemble" / "voting" / "multi-agent"	`nvidia_compare`	3–5 generalists
Explicit "use "	`nvidia_chat`	the one named

Defaults to pick when the user is vague

General Q&A → llama-4-maverick (or llama-3.3-70b for cheaper)
Reasoning → deepseek-v4-pro (premium) or deepseek-v4-flash (fast)
Cheap throwaway draft → nemotron-mini-4b
Comparison → nvidia_compare (default trio: llama-4-maverick, deepseek-v4-pro, qwen3-next-80b)

After the call

Don't dump raw model output unless the user asked for it.
For nvidia_compare: summarize agreements + disagreements, then surface the most useful answer.
For nvidia_chat: integrate the response into Claude's own answer; cite which model said what when it matters.

When NOT to invoke

The user is in a tight feedback loop debugging the current Claude conversation — don't hand off mid-flow.
A trivial factual lookup that doesn't benefit from a second model.
The user explicitly said "you answer this" or "don't delegate".
The prompt contains secrets or PII the user wouldn't want sent to a third party — surface that risk first.

Failure modes

NVIDIA_API_KEY not set → tell the user: setx NVIDIA_API_KEY "nvapi-..." on Windows, or export NVIDIA_API_KEY=... on macOS/Linux, then restart Claude Code.
Unknown model key → the server passes it through verbatim; surface the upstream error and suggest mcp__nvidia-models__list_models to find the right short key.
Rate / quota errors → suggest waiting or routing to a cheaper model (e.g., llama-3.1-70b instead of llama-3.1-405b).

Bounds enforced server-side

The MCP server rejects requests that exceed: prompt 32,000 chars, system 8,000 chars, max_tokens 1–8192, temperature 0–2, models list 1–12 entries. Don't try to bypass — if the user asks for a longer prompt, chunk it first.

nvidia-delegate

Invocation

Context Preview

SKILL.md

nvidia-delegate

Invocation

Context Preview

SKILL.md

Delegate to NVIDIA-hosted LLMs

When to invoke

Defaults to pick when the user is vague

After the call

When NOT to invoke

Failure modes

Bounds enforced server-side

Similar Skills

Delegate to NVIDIA-hosted LLMs

When to invoke

Defaults to pick when the user is vague

After the call

When NOT to invoke

Failure modes

Bounds enforced server-side

Similar Skills