From cortex-core
Systematic 4-phase debugging for skills, hooks, lifecycle, and overnight runner issues. Use when: "debug this", "fix this bug", "why is this failing", "investigate this error", "make this test pass", "why isn't this triggering", "skill not working", "hook not running", "lifecycle bug", "overnight runner stall", "diagnose this", "diagnose", or any unexpected behavior in the agentic layer components. Finds root cause, fixes, and verifies with a structured loop.
How this skill is triggered — by the user, by Claude, or both
Slash command
/cortex-core:diagnoseThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
ALWAYS find root cause before attempting fixes. No fixes without completing Phase 1.
ALWAYS find root cause before attempting fixes. No fixes without completing Phase 1.
BEFORE attempting ANY fix:
Read Error Messages Carefully
Reproduce Consistently
Check Recent Changes
Gather Evidence at Component Boundaries
When the system has multiple components (e.g., overnight runner → task agent → skill → hook):
Add diagnostic instrumentation before proposing fixes:
For EACH component boundary:
- Log what input enters the component (echo to stderr, set -x in scripts)
- Log what output exits the component
- Verify file paths, permissions, env variables at each layer
Run once to gather evidence showing WHERE it breaks.
THEN analyze evidence to identify the failing component.
THEN investigate that specific component.
Common boundaries to check:
Trace Backward to Root Cause
See Backward Tracing technique below.
Quick version: where does the bad value or wrong behavior originate? Trace backward through callers until you find the source. Fix at the source, not the symptom.
Optional: Competing-Hypotheses Team (Phase 1 Early Trigger)
If root cause is genuinely unclear and 2+ distinct plausible theories have already emerged from the error output and initial investigation, consider spawning a competing-hypotheses team to investigate in parallel rather than testing theories sequentially in Phase 3.
Skip this offer entirely when running autonomously (overnight/no human available).
Availability check:
Run: printenv CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS
1: Agent Teams is available, proceed with the offerIf available and 2+ theories exist, present an explicit offer:
"Multiple competing theories are present. Spawn a competing-hypotheses team now to investigate in parallel, or continue with sequential hypothesis testing?"
Wait for confirmation before spawning. If declined, continue to Phase 2 normally.
If confirmed — spawn the team (3–5 teammates):
Choose team size based on the number of distinct plausible theories (minimum 3). Each teammate receives:
Note: there is no fix attempt history at this stage — teammates work from error output and initial investigation only.
Convergence check: After the team completes, review each teammate's structured conclusion (root cause assertion, supporting evidence, rebuttal of competing theories).
After this phase completes, write or update the debug session artifact. See Debug Session Artifact section.
Find the pattern before fixing:
Find Working Examples
Compare Against the Reference
Identify Differences
Understand Dependencies
After this phase completes, write or update the debug session artifact. See Debug Session Artifact section.
Scientific method:
Form a Single Hypothesis
sandbox.filesystem.allowWrite uses ~/Workspaces/myrepo/lifecycle/sessions/
but the sandbox allowlist expects an absolute path (e.g.
/Users/me/Workspaces/myrepo/lifecycle/sessions/)"Test Minimally
Verify Before Continuing
When You Don't Know
After this phase completes, write or update the debug session artifact. See Debug Session Artifact section.
Fix the root cause, not the symptom:
Confirm the Root Cause
Implement a Single Fix
Verify the Fix
If the Fix Doesn't Work
If 3+ Fixes Failed: Team Investigation Before Escalation
Step 1 — Agent Teams availability check:
Run: printenv CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS
1: Agent Teams is available — proceed to Step 2Step 2 — Spawn 3–5 teammates. Choose team size based on the number of distinct plausible theories at this point (minimum 3). If fewer than 3 distinct theories can be identified, assign the third teammate to "investigate novel angles not covered by the other theories."
Step 3 — Provide teammate context. Each teammate receives:
Step 4 — Enforce structured output. Each teammate must produce:
Format example:
Root cause: [assertion] / Evidence: [supporting detail] / Rebuttal: [strongest objection to this hypothesis]
Enforce via TeammateIdle hook (exit code 2 sends feedback, keeps the teammate working)
or by the lead sending direct messages challenging shallow or incomplete outputs.
Step 5 — Convergence check. After the team completes, review each teammate's structured conclusion:
Step 6 — On convergence: attempt one more targeted fix using the surviving theory. This is a fresh attempt, not counted toward the original 3-attempt limit.
Step 7 — On non-convergence: proceed directly to Architecture Discussion below, including a summary of the competing theories and evidence gathered by the team.
Note on overnight/autonomous contexts: If running autonomously (no human available): skip team investigation and fail the current task directly. The overnight runner's failure gate will surface it to morning review.
Architecture Discussion (escalation destination for non-convergence or post-team fix failure)
Patterns indicating an architectural problem:
STOP and question fundamentals:
Discuss with the user before attempting more fixes.
After this phase completes, write or update the debug session artifact. See Debug Session Artifact section.
Determine where to write the artifact using this priority:
/cortex-core:diagnose my-feature): write to
lifecycle/{feature}/debug-session.md if the directory exists. If not, warn verbally
and fall back to step 3.lifecycle/*/ for a .session file whose content matches
$LIFECYCLE_SESSION_ID. If found, write to lifecycle/{feature}/debug-session.md.debug/{date}-{slug}.md where {date} is ISO date (YYYY-MM-DD)
and {slug} is a short kebab-case description of what is being debugged (use diagnose
if no slug is available). Create the debug/ directory if absent.Note:
$LIFECYCLE_SESSION_IDpropagation into overnight sub-agent sessions is unverified. In autonomous/overnight context, pass the feature name explicitly (e.g.,/cortex-core:diagnose my-feature) for reliable lifecycle-coupled artifact placement.
# Debug Session: {context}
Date: YYYY-MM-DD
Status: In progress | Resolved | Escalated — investigation incomplete
## Phase N Findings
- **Observed behavior**: ...
- **Evidence gathered**: ...
- **Tests performed**: ...
- **Outcomes**: ...
- **Dead-ends**: ... (call out explicitly)
## Current State
Root cause identified: X. Fix applied: Y.
— or —
Best current theory: X. Not yet tried: Y.
## Prior Attempts
(Move prior content here if the file previously existed; current investigation stays on top.)
In progress.In progress.Resolved.Escalated — investigation incomplete before failing the task. This write is
mandatory — do not exit without it.Debug skill escalation and lifecycle escalation are different mechanisms covering different concerns:
| Debug escalation | Lifecycle escalation | |
|---|---|---|
| When | During implementation, after 3 failed fix attempts | At phase transitions (Research→Spec, Spec→Plan) |
| Signal | A bug resists fixes and shows architectural patterns | Feature scope/complexity exceeds original estimate |
| Action | Team investigation first (§5); architecture discussion if team doesn't converge | User is prompted to escalate to Complex tier |
| Phase | Implement (post-build debugging) | Research/Specify (pre-build design) |
Invoking /cortex-core:diagnose when a lifecycle task fails is a structured pre-retry step — it does not
replace or short-circuit the lifecycle's own phase gates.
If you catch yourself thinking any of these, stop and return to Phase 1:
If 3+ fixes failed: team investigation first (Phase 4 §5), then architecture discussion if the team doesn't converge.
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Phase 1 is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is faster than guess-and-check. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = run team investigation (Phase 4 §5) if Agent Teams available, then architecture discussion. Do not add a 4th fix without completing §5. |
Bugs often manifest far from their source. Your instinct is to fix where the error appears — that treats a symptom.
Core principle: Trace backward through the execution path until you find the original trigger, then fix at the source.
The tracing process:
git commit exits non-zero; GNUPGHOME not set)Adding diagnostic instrumentation when you can't trace manually:
# In a hook script: trace execution
set -x # prints every command as it runs
# At a key decision point: log the state
echo "DEBUG: GNUPGHOME=$GNUPGHOME, socket exists=$(test -S $GNUPGHOME/S.gpg-agent && echo yes || echo no)" >&2
# For skill invocation: add an explicit trace to the description trigger
# (temporarily add the invocation phrase you're testing to verify it triggers)
Never fix just where the error appears. Trace back to find the original trigger.
When you fix a bug caused by bad state, adding validation at one place feels sufficient — but that check can be bypassed by different code paths or edge cases.
Core principle: Validate at every layer data passes through. Make the bug structurally impossible.
The four layers:
Entry point: reject invalid input at the boundary
# In a hook script: verify required env vars are set before using them
if [ -z "$GNUPGHOME" ]; then
echo "ERROR: GNUPGHOME not set" >&2; exit 1
fi
Business logic: ensure data makes sense for this operation
# In a lifecycle script: verify events.log is valid before appending
if ! python3 -c "import json; [json.loads(l) for l in open('events.log')]" 2>/dev/null; then
echo "ERROR: events.log is malformed" >&2; exit 1
fi
Environment guards: prevent dangerous operations in specific contexts
# Before destructive operations: verify you're not in the wrong directory
if [ "$(pwd)" != "$EXPECTED_ROOT" ]; then
echo "ERROR: wrong working directory" >&2; exit 1
fi
Debug instrumentation: capture context for forensics
# Log state at each boundary for post-mortem if something still goes wrong
echo "DEBUG: entering hook, session=$LIFECYCLE_SESSION_ID, feature=$1" >&2
Don't stop at one validation point. Add checks at every layer.
Arbitrary sleeps guess at timing. This creates races where scripts pass on fast machines but fail under load or when the system is busy.
Core principle: wait for the actual condition you care about, not a guess about how long it takes.
Shell-native pattern:
wait_for() {
local description="$1"
local condition="$2" # a bash test expression
local timeout="${3:-30}" # seconds
local elapsed=0
while ! eval "$condition" 2>/dev/null; do
if [ "$elapsed" -ge "$timeout" ]; then
echo "Timeout after ${timeout}s waiting for: $description" >&2
return 1
fi
sleep 1
elapsed=$((elapsed + 1))
done
}
# Usage examples:
wait_for "events.log to appear" '[ -s lifecycle/my-feature/events.log ]'
wait_for "task-done flag" '[ -f /tmp/task-done.flag ]' 30
wait_for "GPG agent socket" '[ -S "$GNUPGHOME/S.gpg-agent" ]' 10
wait_for "overnight runner to finish" 'grep -q feature_complete lifecycle/my-feature/events.log' 120
Don't use when:
Common mistakes:
npx claudepluginhub charleshall888/cortex-command --plugin cortex-coreGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.