Skill

multi-llm-consensus

Runs a task through multiple LLMs (Claude, Codex, Gemini, Grok, Mistral) independently, requiring unanimous agreement before proceeding. Use for high-stakes generation, conflict resolution, or final quality gates.

developer-tools

automation

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/llm-gateway:multi-llm-consensus

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

When correctness matters more than speed, send the same task to Claude, Codex, Gemini, and (optionally) Grok and Mistral Vibe independently, then compare results. All agents must agree before proceeding. Adding Grok gives an independent fourth model from a different vendor (xAI), and Mistral Vibe gives a fifth — useful when consensus needs diversity to defend against shared-blind-spot failures ...

SKILL.md

205 lines · ~2.6k tokens

Stats

LanguageTypeScript

Stars9

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

Multi-LLM Consensus

Mistral note: the gateway always emits --agent <mode> (default auto-approve for programmatic callers); set permissionMode explicitly when needed. Continuity-bearing consensus loops also need [session_logging] enabled = true in ~/.vibe/config.toml.

Dispatch Defaults

Apply these on every dispatch unless the caller has explicitly overridden a rule in the current turn:

Omit model — let the gateway use its configured default per CLI. Nominating a model risks deprecated IDs (o3, o3-pro, gpt-4o, …) and capability mismatches.
approvalStrategy:"mcp_managed" is the skill dispatch default (the gateway schema default is "legacy"). For Codex, also pass fullAuto:true when it needs file/shell access.
No wallclock timeout; poll every 60 s — idleTimeoutMs is a separate no-output safeguard.
Iterate until unconditional APPROVED (review dispatches only) — every review prompt must end with "End with APPROVED or NOT APPROVED with findings." Consensus requires all reviewers to return unconditional APPROVED; any NOT APPROVED or conditional approval from any reviewer triggers the fix-and-re-review loop to all reviewers. Escalate after 3 rounds. This rule does not apply to pure implementation or non-review analysis dispatches.

When to Use

Generating code that will run in production without human review
Resolving ambiguous specification decisions
Final quality gates requiring unanimous sign-off
Any task where a single LLM's blind spot could cause harm

Consensus Patterns

Pattern 1: Independent Generation + Comparison

Each LLM generates a solution independently. Compare outputs structurally.

claude_request_async({
  prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
  approvalStrategy: "mcp_managed",
  correlationId: "gen-claude"
})
codex_request_async({
  prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
  fullAuto: true,
  approvalStrategy: "mcp_managed",
  correlationId: "gen-codex"
})
gemini_request_async({
  prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
  approvalStrategy: "mcp_managed",
  correlationId: "gen-gemini"
})
grok_request_async({
  prompt: "Given this specification:\n[spec]\n\nGenerate the implementation. Output only code.",
  approvalStrategy: "mcp_managed",
  correlationId: "gen-grok"
})

Poll all four. When complete, compare:

If all four agree → very high confidence, proceed
If three agree, one differs → investigate the outlier (often a real edge case the majority missed)
If split 2-2 → escalate; the spec is ambiguous, clarify before proceeding
If all four differ → spec is fundamentally underspecified

Pattern 2: Unanimous Approval Gate

All three LLMs must approve the same artifact. Used for final review gates.

claude_request_async({
  prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
  approvalStrategy: "mcp_managed",
  correlationId: "review-claude"
})
codex_request_async({
  prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
  fullAuto: true,
  approvalStrategy: "mcp_managed",
  correlationId: "review-codex"
})
gemini_request_async({
  prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
  approvalStrategy: "mcp_managed",
  correlationId: "review-gemini"
})
grok_request_async({
  prompt: "Review [path] for: grammar correctness, extraction completeness, test coverage, performance, security. End with APPROVED or NOT APPROVED with findings.",
  approvalStrategy: "mcp_managed",
  correlationId: "review-grok"
})

Verdict rules:

All reviewers APPROVED → gate passes
Any NOT APPROVED → fix issues, re-submit to all reviewers (not just the one that rejected)
Any unable to review → provide evidence, re-submit to that reviewer

Pattern 3: Conflict Resolution

Multiple valid approaches exist. Each LLM proposes independently, then a designated LLM synthesizes.

All three propose solutions (async, parallel)
Collect all proposals
Send all proposals to Claude for synthesis:

claude_request({
  prompt: "Three LLMs proposed solutions for [problem]:\n\nClaude's proposal: [proposal]\nCodex's proposal: [proposal]\nGemini's proposal: [proposal]\n\nSynthesize the best approach, explaining why. If they agree, confirm. If they conflict, choose the strongest with justification.",
  approvalStrategy: "mcp_managed"
})

Execution Flow

Parallel Dispatch

Always use async tools for consensus — you need all results before deciding:

// Fire all four (each with approvalStrategy:"mcp_managed"; Codex also fullAuto:true)
job1 = claude_request_async({...})
job2 = codex_request_async({...})
job3 = gemini_request_async({...})
job4 = grok_request_async({...})

// Poll every 60 seconds (no wallclock timeout; cancel only on explicit instruction or hard failure)
llm_job_status({jobId: job1.job.id})
llm_job_status({jobId: job2.job.id})
llm_job_status({jobId: job3.job.id})
llm_job_status({jobId: job4.job.id})

// Collect results when all complete
result1 = llm_job_result({jobId: job1.job.id})
result2 = llm_job_result({jobId: job2.job.id})
result3 = llm_job_result({jobId: job3.job.id})
result4 = llm_job_result({jobId: job4.job.id})

Results are durable (default 30 days). If your polling wrapper dies, re-issue the same *_request_async calls — auto-dedup snaps each new call back onto the live job. Use forceRefresh:true only if you've genuinely changed the inputs.

Comparison

Compare results structurally, not textually. Two implementations may look different but be functionally equivalent.

For code generation:

Same algorithm/approach → agreement
Different approach, same output → agreement (note the alternatives)
Different output → disagreement (investigate)

For reviews:

All APPROVED → pass
Same issues found → agreement on problems
Different issues found → union of all issues (review each)

LLM Strengths in Consensus

LLM	Generation Strength	Review Strength
Claude	Architecture, patterns, documentation	Design quality, maintainability
Codex	Implementation correctness, algorithms	Logic bugs, edge cases, test gaps
Gemini	Security-aware generation, edge cases	Security audit, OWASP, crash cases
Grok (xAI)	Independent perspective from a different vendor family	Tie-breaker / diversity reviewer when the other three converge on a blind spot

Model Selection

Dispatch default: omit model on every call. The gateway's configured default per CLI is the right choice in the vast majority of cases. Only nominate a model when the caller explicitly named a specific variant in the current turn.

Avoid stale hardcoded model IDs such as o3, o3-pro, and gpt-4o; omit model or call list_models instead.

Prefix Sharing with `promptParts`

Consensus dispatch is the textbook use case for the structured promptParts field. Every reviewer sees the same spec / context block; only the task (or even nothing) varies. Switch from prompt to promptParts so each CLI receives byte-identical stable prefix bytes:

claude_request_async({
  promptParts: {
    system:  "<long stable review brief>",
    context: "<spec + file dump under review>",
    task:    "Review for: grammar, extraction, tests, perf, security. End with APPROVED or NOT APPROVED with findings."
  },
  approvalStrategy: "mcp_managed",
  correlationId: "consensus-r1-claude"
})
codex_request_async({
  promptParts: { /* same as above */ },
  fullAuto: true,
  approvalStrategy: "mcp_managed",
  correlationId: "consensus-r1-codex"
})
// …same promptParts to gemini / grok / mistral

prompt and promptParts are mutually exclusive — the runtime returns provide exactly one of \prompt` or `promptParts`if both are supplied. The gateway concatenates in canonical ordersystem → tools → context → taskand hashes the stable prefix into the flight recorder. After the round, you can readcache-state://prefix/{hash}` to confirm that every reviewer hit the same prefix and to see CLI × model hit-rate breakdown — useful for spotting "Claude cached, Gemini didn't" anomalies in a consensus round.

For re-review rounds (round 2+), keep system + context identical and mutate only task (or append a "previous findings" block to context). Holding the prefix stable across rounds is what makes the consensus loop affordable at scale.

Tips

Use correlationId to group consensus rounds: "consensus-r1-claude", "consensus-r1-codex", "consensus-r1-gemini", "consensus-r1-grok"
For Codex: real continuity is available via resumeLatest:true or sessionId:<UUID> (the UUID from ~/.codex/sessions/); otherwise re-state context inline
For Gemini and Grok reviews: pass sessionId for resumable follow-up rounds
If one LLM is unavailable, proceed with the rest but note the gap
Consensus is expensive (3–4x tokens). Use it for high-stakes decisions, not routine tasks.
When re-submitting after fixes, re-submit to ALL reviewers (not just the one that rejected)
Durable results: deferred consensus jobs persist (default 30 days, LLM_GATEWAY_JOB_RETENTION_DAYS). If the orchestrator dies mid-round, re-issue the same calls — auto-dedup reattaches to the running jobs and you don't restart the consensus round.

multi-llm-consensus

Popularity

Invocation

Context Preview

SKILL.md

multi-llm-consensus

Popularity

Invocation

Context Preview

SKILL.md

Multi-LLM Consensus

Dispatch Defaults

When to Use

Consensus Patterns

Pattern 1: Independent Generation + Comparison

Pattern 2: Unanimous Approval Gate

Pattern 3: Conflict Resolution

Execution Flow

Parallel Dispatch

Comparison

LLM Strengths in Consensus

Model Selection

Prefix Sharing with `promptParts`

Tips

Similar Skills

Multi-LLM Consensus

Dispatch Defaults

When to Use

Consensus Patterns

Pattern 1: Independent Generation + Comparison

Pattern 2: Unanimous Approval Gate

Pattern 3: Conflict Resolution

Execution Flow

Parallel Dispatch

Comparison

LLM Strengths in Consensus

Model Selection

Prefix Sharing with `promptParts`

Tips

Similar Skills

multi-llm-consensus

Popularity

Invocation

Context Preview

SKILL.md

multi-llm-consensus

Popularity

Invocation

Context Preview

SKILL.md

Multi-LLM Consensus

Dispatch Defaults

When to Use

Consensus Patterns

Pattern 1: Independent Generation + Comparison

Pattern 2: Unanimous Approval Gate

Pattern 3: Conflict Resolution

Execution Flow

Parallel Dispatch

Comparison

LLM Strengths in Consensus

Model Selection

Prefix Sharing with promptParts

Tips

Similar Skills

Multi-LLM Consensus

Dispatch Defaults

When to Use

Consensus Patterns

Pattern 1: Independent Generation + Comparison

Pattern 2: Unanimous Approval Gate

Pattern 3: Conflict Resolution

Execution Flow

Parallel Dispatch

Comparison

LLM Strengths in Consensus

Model Selection

Prefix Sharing with promptParts

Tips

Similar Skills

Prefix Sharing with `promptParts`

Prefix Sharing with `promptParts`