From sdlc
Applies 4-phase debugging protocol: root cause investigation, pattern analysis, hypothesis testing, confident fix. For bugs, test failures, unexpected behavior in any codebase.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sdlc:debugging-protocolThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Version:** 1.0.0
Version: 1.0.0 Portability: Universal
Defines a systematic 4-phase investigation process for debugging any bug, test failure, or unexpected behavior. Enforces root cause analysis before attempting fixes.
Purpose: Prevent symptom fixes (which hide bugs) and ensure deep understanding of problems before implementing solutions.
Scope:
The Iron Law: Never attempt a fix until you complete root cause investigation.
Why this matters: Symptom fixes hide bugs rather than solving them. They create technical debt, mask deeper issues, and often cause new bugs elsewhere.
How to apply:
Example:
❌ Bad: "Error says null pointer. Let me add a null check."
✓ Good: "Error says null pointer. Why is this null? (investigate)"
Result:
- Bad approach: Symptom fixed, root cause remains, bug appears elsewhere
- Good approach: Found initialization bug, fixed at source, entire class prevented
The Principle: Follow a structured investigation: Root Cause → Pattern Analysis → Hypothesis Testing → Implementation.
Why this matters: Structured investigation prevents random debugging and ensures you understand the problem completely before attempting solutions.
The Four Phases:
How to apply: Complete each phase before moving to the next. Document your findings at each phase.
The Principle: Test a single hypothesis with minimal changes. If it fails, undo and try a different theory.
Why this matters: Changing multiple things simultaneously makes it impossible to know which change had which effect. This wastes time and compounds confusion.
How to apply:
Example:
❌ Bad: "Let me change the import, add a null check, and update the type signature"
✓ Good:
- Hypothesis 1: "The import is wrong" → Test → Refuted → Undo
- Hypothesis 2: "The type is incorrect" → Test → Confirmed → Fix
Result: Clear understanding of what actually solved the problem
The Principle: If three fix attempts fail, stop trying fixes. The problem is deeper than you think.
Why this matters: Repeated failures signal architectural problems or domain modeling issues, not simple bugs. Continuing to try fixes wastes time.
How to apply:
Example:
Attempt 1: Add validation → Still fails
Attempt 2: Change order → Still fails
Attempt 3: Different algorithm → Still fails
STOP. Question the architecture:
- Are we solving the wrong problem?
- Is the domain model incorrect?
- Is this a fundamental design issue?
Rationale: Disciplined investigation finds root causes. Random debugging wastes time and hides problems.
Scenario: Test that passed before now fails after code changes.
Approach:
Phase 1: Root Cause Investigation
git diff HEAD~5 # What changed?
git log --oneline -10 # Recent commits
Phase 2: Pattern Analysis
Phase 3: Hypothesis Testing
Phase 4: Implementation
Scenario: Error occurs in distributed system (frontend → API → database).
Approach:
Phase 1: Root Cause Investigation
Phase 2: Pattern Analysis
Phase 3: Hypothesis Testing
Phase 4: Implementation
Scenario: Three fix attempts have failed.
Approach:
After 3rd failure:
Example:
Problem: "User authentication fails intermittently"
Attempt 1: Add retry logic → Still fails
Attempt 2: Increase timeout → Still fails
Attempt 3: Better error handling → Still fails
STOP. Architectural questions:
- Is session management the right approach?
- Should this be stateless with tokens instead?
- Is the database schema correct?
Result: Discovered fundamental session model flaw, redesigned auth flow
Works well with:
Prerequisites:
Problem: "I know what this is, let me just fix it" (skipping investigation)
Solution: Resist the urge. Do Phase 1 investigation FIRST, even if you think you know the answer. You're often wrong.
Problem: Changing multiple things without hypothesis, hoping something works
Solution: Form explicit hypothesis. Test ONE thing. Observe result. Learn from it.
Problem: "Let me add this check to prevent the error" without understanding why it occurs
Solution: Ask "why is this happening?" not "how do I hide this?" Fix the source, not the symptom.
Problem: Skipping working examples, trying to fix in isolation
Solution: Always find working code. Understanding why something works is as important as understanding why something fails.
Problem: "Fourth time's the charm" (continuing after 3+ failed fixes)
Solution: 3 failures = architectural problem signal. Stop fixing, start redesigning.
Phase 1: Root Cause Investigation
Error: "NullPointerException at user_service.rs:42"
File: user_service.rs
Line: 42
Code: let email = user.email.unwrap();
Reproduction: Always fails for user_id = 123, never fails for user_id = 456
Recent changes: Added email validation to registration (3 days ago)
Data flow: Database → UserService.load() → user.email → unwrap()
Phase 2: Pattern Analysis
Working example: User 456 has email in database
Failing example: User 123 has NULL email in database
Difference: User 123 was created BEFORE email validation was added
(email field nullable in DB for backward compatibility)
Dependencies: Database migration didn't backfill existing users
Phase 3: Hypothesis Testing
Hypothesis: "User 123 has NULL email because created before validation"
Test: Check database directly
SELECT id, email FROM users WHERE id = 123;
Result: email = NULL
Result: CONFIRMED - old users have NULL emails
Phase 4: Implementation
1. Create failing test:
test_user_service_handles_missing_email() {
user = User { id: 123, email: None };
result = service.load(user);
assert!(result.is_ok()); // Should handle gracefully
}
2. Fix (two parts):
a. Root cause: Backfill database (migration)
UPDATE users SET email = '[email protected]' WHERE email IS NULL;
b. Defense: Handle None case in code
let email = user.email.unwrap_or_default();
3. Verify:
- Test passes
- All users now have emails
- No more NullPointerException
Phase 1: Root Cause Investigation
Error: "Expected 200 OK, got 500 Internal Server Error"
Test: test_user_registration
Reproduction: Fails consistently in CI, passes locally
Recent changes: Updated authentication library (yesterday)
Environment difference:
- Local: SQLite in-memory database
- CI: PostgreSQL 14
Phase 2: Pattern Analysis
Working example: Local test with SQLite
Failing example: CI test with PostgreSQL
Difference investigation:
- Read auth library changelog
- Found: Library 2.0 uses PostgreSQL-specific JSON operators
- SQLite doesn't have these operators, but doesn't use them either
Dependencies:
- Auth library assumes PostgreSQL JSON support
- Library works with SQLite by accident (doesn't exercise JSON paths)
Phase 3: Hypothesis Testing
Hypothesis: "Auth library 2.0 uses PostgreSQL JSON operators incompatible with SQLite"
Test: Run local tests with PostgreSQL instead of SQLite
docker run -p 5432:5432 postgres:14
DATABASE_URL=postgresql://localhost/test cargo test
Result: CONFIRMED - local tests now fail with same error as CI
Phase 4: Implementation
1. Failing test already exists (test_user_registration)
2. Fix options:
a. Pin auth library to 1.x (workaround)
b. Migrate to PostgreSQL everywhere (align environments)
c. Use database-agnostic JSON library (portable)
Choice: (b) - Align local and CI environments
3. Implementation:
- Update local dev setup to use PostgreSQL
- Document in README
- Update .env.example
4. Verify:
- Local tests pass with PostgreSQL
- CI tests pass
- No environment discrepancies remain
Scenario: Performance bug (API response time > 5 seconds)
Attempt 1: Add caching
Hypothesis: "Database queries are slow, need caching"
Implementation: Add Redis cache for user queries
Result: FAILED - Still slow (5.2 seconds)
Attempt 2: Index database
Hypothesis: "Missing database indexes"
Implementation: Add index on users.email
Result: FAILED - Still slow (5.1 seconds, marginal improvement)
Attempt 3: Optimize query
Hypothesis: "N+1 query problem"
Implementation: Add eager loading for relationships
Result: FAILED - Still slow (4.8 seconds, still over limit)
After 3rd failure - STOP AND ESCALATE:
Question: "Why do performance fixes keep failing?"
Deeper investigation:
- Profile API with flamegraph
- Found: 90% of time spent in external service call (not database!)
- Root cause: Synchronous call to email validation API (3rd party)
Architectural problem:
- Wrong assumption: Database was the bottleneck
- Actual problem: Blocking I/O to external service
- Solution: Move email validation to async background job
Result: Response time < 200ms after architectural change
Use this checklist to verify you're following the debugging protocol:
If you can't check all boxes, you're not following the protocol.
Watch for these thoughts - they indicate you're about to skip the protocol:
| Thought | Reality | Correct Action |
|---|---|---|
| "I know what this is, let me just fix it" | You're skipping investigation | Do Phase 1 first |
| "Quick fix, then investigate if needed" | You'll never investigate after | Do Phase 1 FIRST |
| "Let me try a few things" | Random debugging hides bugs | ONE hypothesis at a time |
| "This worked before, must be environment" | Assumptions without evidence | Verify with evidence |
| "I'll add a check to prevent the error" | Symptom fix, not root cause | Find WHY it happens |
| "Fourth time's the charm" | 3+ failures = architecture problem | STOP. Escalate. |
When you catch yourself thinking these things, STOP and return to the protocol.
Source Documentation:
Related Skills:
External Resources:
Extraction Source: sdlc/commands/shared/debugging-protocol.md Extraction Date: 2026-02-04 Last Updated: 2026-02-04 Compatibility: Universal (all languages and frameworks) License: MIT
npx claudepluginhub jwilger/claude-code-plugins --plugin sdlcProvides four-phase debugging framework prioritizing root cause investigation before fixes. Use for bugs, test failures, unexpected behavior, performance issues, build failures.
Enforces 4-phase debugging process: root cause investigation via error reading, reproduction, change checks, instrumentation before fixes. For bugs, tests, builds, performance issues.
Enforces systematic root cause investigation for bugs, test failures, and unexpected behavior through four phases: investigation, pattern analysis, hypothesis testing, and implementation.