Root Cause Analysis
Scannednpx machina-cli add skill rohitg00/skillkit/root-cause-analysis --openclawRoot Cause Analysis
You are performing systematic root cause analysis to find the true source of a bug. Do not apply fixes until you understand WHY the bug exists.
Core Principle
Never fix a symptom. Always find and fix the root cause.
A root cause is the earliest point in the causal chain where intervention could prevent the defect.
The Five Whys Method
Ask "Why?" repeatedly to drill down to the root cause:
- Why did the API return an error? → The database query failed
- Why did the database query fail? → The connection timed out
- Why did the connection time out? → The connection pool was exhausted
- Why was the connection pool exhausted? → Connections weren't being released
- Why weren't connections being released?
→ ROOT CAUSE: Missing
finallyblock to close connections
Investigation Phases
Phase 1: Reproduce the Bug
Before investigating:
- Reproduce consistently - If you can't reproduce it, you can't verify a fix
- Document reproduction steps - Exact sequence of actions
- Note environment details - OS, versions, configuration
- Identify minimal reproduction - Smallest case that shows the bug
Questions to answer:
- Does it happen every time or intermittently?
- Does it happen in all environments?
- When did it start happening? (recent changes)
Phase 2: Gather Evidence
Collect information before forming theories:
Evidence Types:
├── Error messages and stack traces
├── Log files (application, system, database)
├── Recent code changes (git log, blame)
├── User reports and reproduction steps
├── Monitoring data (metrics, APM)
└── Related issues (search issue tracker)
Do NOT:
- Make changes while gathering evidence
- Assume you know the cause without evidence
- Ignore related symptoms
Phase 3: Form Hypotheses
Based on evidence, create ranked hypotheses:
| Priority | Hypothesis | Evidence | Test Plan |
|---|---|---|---|
| 1 | Connection leak in UserService | Stack trace shows connection pool | Add logging, check usage |
| 2 | Query timeout too short | Occurs under load | Test with longer timeout |
| 3 | Database server overload | Correlates with peak hours | Check DB metrics |
For each hypothesis:
- What evidence supports it?
- What evidence contradicts it?
- How can we test it?
Phase 4: Test Hypotheses
Test each hypothesis systematically:
- Start with highest probability
- Design a definitive test - Should clearly confirm or reject
- Make ONE change at a time
- Document results
If hypothesis is rejected:
- Cross it off the list
- Re-evaluate remaining hypotheses
- Consider if new evidence suggests new hypotheses
Phase 5: Verify Root Cause
Before declaring root cause found:
- Can you explain the full causal chain?
- Does fixing it consistently prevent the bug?
- Does it explain ALL observed symptoms?
- Is there nothing earlier in the chain that could be fixed?
Common Root Cause Categories
Code Defects
- Logic errors
- Boundary conditions
- Race conditions
- Resource leaks
- Null/undefined handling
Design Issues
- Missing error handling
- Inadequate validation
- Poor state management
- Coupling issues
Environment
- Configuration errors
- Resource constraints
- Version mismatches
- Network issues
Data Issues
- Invalid input data
- Data corruption
- Schema mismatches
- Encoding problems
Evidence Collection Commands
# Recent changes to relevant files
git log --oneline -20 -- path/to/file
# Who changed this line
git blame path/to/file
# Changes since last working version
git diff v1.2.3..HEAD -- src/
# Search for related error handling
grep -r "catch\|error\|throw" --include="*.ts" src/
Red Flags - You Haven't Found Root Cause
- "I'm not sure why, but this fix works"
- "The bug went away after I restarted"
- "I added a check to prevent this case"
- "It's probably a race condition somewhere"
These suggest symptom treatment, not root cause resolution.
Documentation Template
When root cause is found, document:
## Bug: [Description]
### Root Cause
[Clear explanation of why the bug occurred]
### Evidence
- [Evidence 1]
- [Evidence 2]
### Causal Chain
1. [Initial trigger]
2. [Intermediate cause]
3. [Root cause]
4. [Observed symptom]
### Fix
[Description of the fix and why it addresses root cause]
### Prevention
[How to prevent similar issues in the future]
Integration with Other Skills
After finding root cause:
- Use testing/red-green-refactor to write a test that exposes the bug
- Use planning/verification-gates to validate the fix
- Consider collaboration/structured-review for complex fixes
Source
git clone https://github.com/rohitg00/skillkit/blob/main/packages/core/src/methodology/packs/debugging/root-cause-analysis/SKILL.mdView on GitHub Overview
Root Cause Analysis is a systematic approach to identifying the earliest point in the causal chain where intervention could prevent a defect. It emphasizes never fixing symptoms, and uses structured phases like reproducing the bug, gathering evidence, forming hypotheses, testing them, and verifying the true root cause.
How This Skill Works
Begin with the Five Whys to drill down to the root cause, then follow the investigation phases: reproduce the bug, gather evidence, form ranked hypotheses, test them one change at a time, and verify the full causal chain. This disciplined flow helps separate symptoms from causes and yields durable fixes.
When to Use It
- When a defect shows multiple symptoms and the true source isn’t obvious.
- When debugging across environments and you need consistent evidence.
- When the bug is intermittent or flaky and requires evidence-based reasoning.
- When a fix must prevent regression by addressing the actual cause, not a workaround.
- When previous patches failed to fully resolve the issue and you need to map the full causal chain.
Quick Start
- Step 1: Reproduce the bug and document environment details.
- Step 2: Gather evidence (logs, traces, recent changes) before making changes.
- Step 3: Form hypotheses, test them one change at a time, and verify the root cause.
Best Practices
- Never fix a symptom; focus on the root cause first.
- Reproduce the bug consistently and document exact steps.
- Gather diverse evidence (logs, traces, changes, user reports) before hypothesizing.
- Form hypotheses in order of probability and outline a clear test plan.
- Test one change at a time and verify the root cause end-to-end.
Example Use Cases
- Root cause: Missing finally block to close DB connections, causing connection pool exhaustion.
- Root cause: Race condition in a shared cache, resolved by introducing proper synchronization.
- Root cause: Data encoding mismatch at the API gateway, leading to malformed responses.
- Root cause: Configuration drift in a feature flag, resulting in inconsistent service behavior.
- Root cause: Missing index on a slow query, revealed after collecting metrics and testing with a larger dataset.