Skill

competing-hypotheses

Use when debugging with unclear root cause and multiple plausible explanations that need parallel adversarial testing to converge on the answer

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/team-orchestrator:competing-hypotheses

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

When the root cause is unclear, a single investigator anchors on the first plausible explanation. Multiple investigators testing competing theories — and actively trying to disprove each other — converge faster and more accurately.

SKILL.md

178 lines · ~1.6k tokens

Stats

LanguageShell

Stars0

MaintenanceGood

Last CommitFeb 8, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Competing Hypotheses: Adversarial Debugging Pattern

Overview

Core principle: Adversarial debate eliminates weak hypotheses. The theory that survives structured attack is most likely correct.

Management theory: Psychological Safety (agents MUST challenge each other), Tuckman's Storming (debate is the mechanism, not a problem), Belbin Monitor-Evaluator (devil-advocate role is mandatory).

When to Use

Bug with unclear root cause
Multiple plausible explanations exist
Single investigator would likely anchor on first theory
Reproducing the bug is possible but cause is ambiguous

Don't use when:

Root cause is obvious (just fix it)
Only one plausible explanation
Bug cannot be reproduced
Debugging requires sequential steps (A reveals B reveals C)

Team Composition

coordinator (lead, acts as judge)
├── debugger × 2-5     (each tests one hypothesis)
└── devil-advocate × 1 (challenges all hypotheses)

Belbin coverage:

Thinking: devil-advocate (Monitor-Evaluator)
Action: debuggers (Shaper — drives toward root cause)
People: coordinator (Coordinator — judges evidence)

Sizing by complexity:

Bug Complexity	Debuggers	Notes
2 plausible causes	2	Minimum viable debate
3-4 theories	3-4	Standard case
Systemic/unknown	4-5 + devil-advocate	Maximum investigation

The Process

Phase 1: Forming — Hypothesis Generation

Coordinator:

Describes the bug symptoms clearly
Lists all plausible hypotheses (or asks team to generate them)
Assigns one hypothesis per debugger
Spawns devil-advocate to challenge all of them

Spawn prompt template for debugger:

You are testing the hypothesis: [HYPOTHESIS]

Bug symptoms: [SYMPTOMS]
Reproduction steps: [STEPS]

Your job:
1. Find evidence FOR your hypothesis (prove it)
2. Find evidence AGAINST your hypothesis (disprove it)
3. Be honest — if your hypothesis is wrong, say so
4. Report: evidence for, evidence against, confidence (0-100%)

You MUST report disconfirming evidence. Hiding evidence that
disproves your hypothesis is a critical failure.

Devil-advocate spawn prompt:

You will review each debugger's findings.

For EACH hypothesis:
1. What evidence would definitively prove it? Did they find it?
2. What evidence would definitively disprove it? Did they look?
3. Are there alternative explanations for their evidence?
4. What tests would distinguish this hypothesis from others?

Your success metric is finding flaws. Approving a weak hypothesis
without challenge is a failure.

Phase 2: Storming — Parallel Investigation

Each debugger independently:

Investigates their assigned hypothesis
Collects evidence (code, logs, test results)
Reports both confirming AND disconfirming evidence
Assigns confidence score

Critical: Debuggers must report disconfirming evidence. The prompt explicitly requires this to prevent confirmation bias.

Phase 3: Norming — Adversarial Debate

For each hypothesis:
  1. Debugger presents: evidence for + against + confidence
  2. Devil-advocate challenges: gaps, alternative explanations
  3. Other debuggers challenge: "my evidence contradicts yours because..."
  4. Coordinator scores: STRONG / WEAK / DISPROVEN

Elimination round:
  - DISPROVEN hypotheses are discarded
  - WEAK hypotheses get one more investigation round
  - STRONG hypotheses proceed to verification

The debate IS the value. Sequential investigation suffers from anchoring. Parallel adversarial investigation eliminates weak theories faster.

Phase 4: Performing — Verification

For the surviving hypothesis:

Debugger implements the fix
Verify: does the fix resolve the symptoms?
Regression: does the fix break anything else?
Devil-advocate: "Is this really the root cause, or are we masking the symptom?"

Phase 5: Adjourning — Record & Reflect

Document: the winning hypothesis, eliminated hypotheses, and why
Record to .claude.md: what theories were wrong and why (prevents future anchoring)
Trigger team-orchestrator:session-reflection

Example: App Exits After One Message

Coordinator identifies 5 hypotheses:
  H1: WebSocket connection closing prematurely
  H2: Event loop draining with no listeners
  H3: Unhandled promise rejection causing exit
  H4: Session timeout misconfigured
  H5: Message handler throwing uncaught error

Team investigates in parallel:

  Debugger 1 (H1): "WebSocket stays open — DISPROVEN"
  Debugger 2 (H2): "Event loop has active listeners — DISPROVEN"
  Debugger 3 (H3): "Found unhandled rejection in auth middleware — STRONG (80%)"
  Debugger 4 (H4): "Timeout is 30min, app exits in 1s — DISPROVEN"
  Debugger 5 (H5): "Message handler has try-catch — WEAK (30%)"

  Devil-advocate: "H3 is strong but — does the rejection happen on EVERY
    message or just the first? If first-only, the auth token refresh
    might be the real cause, not the handler."

  → Follow-up reveals: auth token refresh throws on first use because
    token is not yet set. H3 was close but the real root cause is
    token initialization order.

Without adversarial debate, team would have patched the rejection
handler without fixing the token initialization.

Common Mistakes

Mistake	Fix
Debugger hides disconfirming evidence	Prompt explicitly requires both-sides reporting
All hypotheses are variations of one idea	Ensure hypotheses are truly independent
Skipping debate — just picking highest confidence	Debate reveals flaws that confidence scores don't
Devil-advocate too soft	Prompt: "approving a weak hypothesis is a failure"
Not recording eliminated hypotheses	They prevent future anchoring — record them
Fixing symptom, not root cause	Devil-advocate's final question prevents this

Integration

Pre-requisite: team-orchestrator:orchestrating-work routes here Post-requisite: team-orchestrator:session-reflection records learnings Related: superpowers:systematic-debugging for single-agent debugging

competing-hypotheses

Invocation

Context Preview

SKILL.md

competing-hypotheses

Invocation

Context Preview

SKILL.md

Competing Hypotheses: Adversarial Debugging Pattern

Overview

When to Use

Team Composition

The Process

Phase 1: Forming — Hypothesis Generation

Phase 2: Storming — Parallel Investigation

Phase 3: Norming — Adversarial Debate

Phase 4: Performing — Verification

Phase 5: Adjourning — Record & Reflect

Example: App Exits After One Message

Common Mistakes

Integration

Similar Skills

Competing Hypotheses: Adversarial Debugging Pattern

Overview

When to Use

Team Composition

The Process

Phase 1: Forming — Hypothesis Generation

Phase 2: Storming — Parallel Investigation

Phase 3: Norming — Adversarial Debate

Phase 4: Performing — Verification

Phase 5: Adjourning — Record & Reflect

Example: App Exits After One Message

Common Mistakes

Integration

Similar Skills