From llm-gateway
Runs a task through multiple LLMs (Claude, Codex, Gemini, Grok, Mistral) independently, requiring unanimous agreement before proceeding. Use for high-stakes generation, conflict resolution, or final quality gates.
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-gateway:multi-llm-consensusThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
When correctness matters more than speed, send the same task to Claude, Codex, Gemini, and (optionally) Grok and Mistral Vibe independently, then compare results. All agents must agree before proceeding. Adding Grok gives an independent fourth model from a different vendor (xAI), and Mistral Vibe gives a fifth — useful when consensus needs diversity to defend against shared-blind-spot failures ...
When correctness matters more than speed, send the same task to Claude, Codex, Gemini, and (optionally) Grok and Mistral Vibe independently, then compare results. All agents must agree before proceeding. Adding Grok gives an independent fourth model from a different vendor (xAI), and Mistral Vibe gives a fifth — useful when consensus needs diversity to defend against shared-blind-spot failures across the OpenAI/Google/Anthropic/xAI family.
Mistral note: the gateway always emits
--agent <mode>(defaultauto-approvefor programmatic callers); setpermissionModeexplicitly when needed. Continuity-bearing consensus loops also need[session_logging] enabled = truein~/.vibe/config.toml.
Apply these on every dispatch unless the caller has explicitly overridden a rule in the current turn:
model — let the gateway use its configured default per CLI. Nominating a model risks deprecated IDs (o3, o3-pro, gpt-4o, …) and capability mismatches.approvalStrategy:"mcp_managed" is the skill dispatch default (the gateway schema default is "legacy"). For Codex, also pass fullAuto:true when it needs file/shell access.idleTimeoutMs is a separate no-output safeguard.NOT APPROVED or conditional approval from any reviewer triggers the fix-and-re-review loop to all reviewers. Escalate after 3 rounds. This rule does not apply to pure implementation or non-review analysis dispatches.Each LLM generates a solution independently. Compare outputs structurally.
claude_request_async({
prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
approvalStrategy: "mcp_managed",
correlationId: "gen-claude"
})
codex_request_async({
prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
fullAuto: true,
approvalStrategy: "mcp_managed",
correlationId: "gen-codex"
})
gemini_request_async({
prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
approvalStrategy: "mcp_managed",
correlationId: "gen-gemini"
})
grok_request_async({
prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
approvalStrategy: "mcp_managed",
correlationId: "gen-grok"
})
Poll all four. When complete, compare:
All three LLMs must approve the same artifact. Used for final review gates.
claude_request_async({
prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
approvalStrategy: "mcp_managed",
correlationId: "review-claude"
})
codex_request_async({
prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
fullAuto: true,
approvalStrategy: "mcp_managed",
correlationId: "review-codex"
})
gemini_request_async({
prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
approvalStrategy: "mcp_managed",
correlationId: "review-gemini"
})
grok_request_async({
prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
approvalStrategy: "mcp_managed",
correlationId: "review-grok"
})
Verdict rules:
Multiple valid approaches exist. Each LLM proposes independently, then a designated LLM synthesizes.
claude_request({
prompt: "Three LLMs proposed solutions for [problem]:\n\nClaude's proposal: [proposal]\nCodex's proposal: [proposal]\nGemini's proposal: [proposal]\n\nSynthesize the best approach, explaining why. If they agree, confirm. If they conflict, choose the strongest with justification.",
approvalStrategy: "mcp_managed"
})
Always use async tools for consensus — you need all results before deciding:
// Fire all four (each with approvalStrategy:"mcp_managed"; Codex also fullAuto:true)
job1 = claude_request_async({...})
job2 = codex_request_async({...})
job3 = gemini_request_async({...})
job4 = grok_request_async({...})
// Poll every 60 seconds (no wallclock timeout; cancel only on explicit instruction or hard failure)
llm_job_status({jobId: job1.job.id})
llm_job_status({jobId: job2.job.id})
llm_job_status({jobId: job3.job.id})
llm_job_status({jobId: job4.job.id})
// Collect results when all complete
result1 = llm_job_result({jobId: job1.job.id})
result2 = llm_job_result({jobId: job2.job.id})
result3 = llm_job_result({jobId: job3.job.id})
result4 = llm_job_result({jobId: job4.job.id})
Results are durable (default 30 days). If your polling wrapper dies, re-issue the same *_request_async calls — auto-dedup snaps each new call back onto the live job. Use forceRefresh:true only if you've genuinely changed the inputs.
Compare results structurally, not textually. Two implementations may look different but be functionally equivalent.
For code generation:
For reviews:
| LLM | Generation Strength | Review Strength |
|---|---|---|
| Claude | Architecture, patterns, documentation | Design quality, maintainability |
| Codex | Implementation correctness, algorithms | Logic bugs, edge cases, test gaps |
| Gemini | Security-aware generation, edge cases | Security audit, OWASP, crash cases |
| Grok (xAI) | Independent perspective from a different vendor family | Tie-breaker / diversity reviewer when the other three converge on a blind spot |
Dispatch default: omit model on every call. The gateway's configured default per CLI is the right choice in the vast majority of cases. Only nominate a model when the caller explicitly named a specific variant in the current turn.
Avoid stale hardcoded model IDs such as o3, o3-pro, and gpt-4o; omit model or call list_models instead.
promptPartsConsensus dispatch is the textbook use case for the structured promptParts field. Every reviewer sees the same spec / context block; only the task (or even nothing) varies. Switch from prompt to promptParts so each CLI receives byte-identical stable prefix bytes:
claude_request_async({
promptParts: {
system: "<long stable review brief>",
context: "<spec + file dump under review>",
task: "Review for: grammar, extraction, tests, perf, security. End with APPROVED or NOT APPROVED with findings."
},
approvalStrategy: "mcp_managed",
correlationId: "consensus-r1-claude"
})
codex_request_async({
promptParts: { /* same as above */ },
fullAuto: true,
approvalStrategy: "mcp_managed",
correlationId: "consensus-r1-codex"
})
// …same promptParts to gemini / grok / mistral
prompt and promptParts are mutually exclusive — the runtime returns provide exactly one of \prompt` or `promptParts`if both are supplied. The gateway concatenates in canonical ordersystem → tools → context → taskand hashes the stable prefix into the flight recorder. After the round, you can readcache-state://prefix/{hash}` to confirm that every reviewer hit the same prefix and to see CLI × model hit-rate breakdown — useful for spotting "Claude cached, Gemini didn't" anomalies in a consensus round.
For re-review rounds (round 2+), keep system + context identical and mutate only task (or append a "previous findings" block to context). Holding the prefix stable across rounds is what makes the consensus loop affordable at scale.
correlationId to group consensus rounds: "consensus-r1-claude", "consensus-r1-codex", "consensus-r1-gemini", "consensus-r1-grok"resumeLatest:true or sessionId:<UUID> (the UUID from ~/.codex/sessions/); otherwise re-state context inlinesessionId for resumable follow-up roundsLLM_GATEWAY_JOB_RETENTION_DAYS). If the orchestrator dies mid-round, re-issue the same calls — auto-dedup reattaches to the running jobs and you don't restart the consensus round.npx claudepluginhub verivus-oss/llm-cli-gateway --plugin llm-gatewayCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.