debugging
npx machina-cli add skill Roberdan/MyConvergio/debugging --openclawDebugging Skill
Reusable workflow extracted from dario-debugger expertise.
Purpose
Systematically investigate and resolve bugs through scientific methodology, root cause analysis, and evidence-based diagnosis across all technology stacks.
When to Use
- Production incidents and outages
- Intermittent or hard-to-reproduce bugs
- Performance degradation investigation
- Memory leaks and resource exhaustion
- Concurrency issues (race conditions, deadlocks)
- Crash analysis and stack trace interpretation
- Test failures and CI/CD pipeline issues
Workflow Steps
-
Reproduce
- Confirm issue can be consistently reproduced
- Document exact reproduction steps
- Identify required environment/conditions
- Create minimal reproduction case
-
Isolate
- Narrow down problem space (component, input, timing)
- Use binary search to eliminate possibilities
- Identify affected versions (git bisect)
- Determine scope of impact
-
Gather Evidence
- Collect logs from all relevant systems
- Capture stack traces and error messages
- Record metrics and performance data
- Preserve system state before changes
- Use distributed tracing for microservices
-
Hypothesize
- Form testable hypotheses about root cause
- List potential causes ranked by probability
- Consider symptoms vs actual cause
- Apply 5 Whys technique
-
Test Hypotheses
- Design experiments to prove/disprove each hypothesis
- Use debuggers and profilers to validate
- Check logs for evidence supporting/refuting
- Eliminate possibilities systematically
-
Identify Root Cause
- Determine fundamental issue (not just symptom)
- Verify with >95% confidence
- Document evidence trail
- Distinguish correlation from causation
-
Fix & Verify
- Implement targeted fix for root cause
- Verify fix resolves issue
- Test for regressions
- Measure impact of fix
-
Prevent Recurrence
- Add regression tests
- Implement monitoring/alerting
- Document findings for team
- Update runbooks if applicable
Inputs Required
- Bug description: Expected vs actual behavior
- Environment: OS, versions, configurations, recent changes
- Reproduction: Steps to reproduce (if known)
- Evidence: Logs, error messages, screenshots, metrics
- Scope: When did it start? How many affected?
Outputs Produced
- Root Cause Report: Detailed analysis with evidence
- Reproduction Steps: Minimal, reliable reproduction case
- Fix Recommendations: Prioritized solutions with trade-offs
- Prevention Strategy: How to prevent similar issues
- Regression Tests: Tests to verify fix and prevent recurrence
Bug Classification
Priority Levels
- 🔴 P0 - Critical: System down, data loss, security breach - immediate response
- 🟠 P1 - High: Major feature broken, significant user impact
- 🟡 P2 - Medium: Feature degraded, workaround exists
- 🟢 P3 - Low: Minor issue, cosmetic, edge case
Debugging Techniques
Scientific Method
- Observe the problem
- Form hypothesis about cause
- Design experiment to test hypothesis
- Execute test and collect data
- Analyze results
- Refine hypothesis or conclude
Binary Search Debugging
- Divide problem space in half repeatedly
- Test midpoint to eliminate half of possibilities
- Efficient for narrowing down cause
5 Whys Technique
Problem: API endpoint returns 500 error
Why? Database connection failed
Why? Connection pool exhausted
Why? Connections not being released
Why? Missing finally block in error path
Why? Error handling added without proper resource cleanup
Root Cause: Incomplete error handling refactor
Time-Travel Debugging
- Use tools like rr, UndoDB for execution replay
- Step backwards through execution
- Examine state at any point in time
Example Usage
Input: Production API returning 500 errors intermittently
Workflow Execution:
1. Reproduce: 500 errors occur under load (>100 req/sec)
2. Isolate: Only affects /api/users endpoint, started after v2.3 deploy
3. Evidence: Connection pool at max, slow query log shows 30s timeouts
4. Hypothesis: Query performance degraded with new schema
5. Test: EXPLAIN ANALYZE shows missing index after migration
6. Root Cause: Migration script failed to create user_email_idx index
7. Fix: CREATE INDEX user_email_idx; query time drops to 50ms
8. Prevent: Add index existence check to health endpoint
Output:
ROOT CAUSE: Missing database index after incomplete migration
EVIDENCE: Query plan shows seq scan, migration log shows index creation failed
FIX: Manual index creation, update migration with IF NOT EXISTS
PREVENTION: Added database index monitoring, migration dry-run validation
CONFIDENCE: 99%
Debugging Tools by Platform
Language-Specific
- Python: pdb, ipdb, py-spy, memory_profiler
- JavaScript/Node: Chrome DevTools, node --inspect, ndb
- C/C++/Objective-C: LLDB, Instruments, AddressSanitizer, Valgrind
- Java/Kotlin: JDB, VisualVM, async-profiler
- Go: Delve, pprof, race detector
System-Level
- Linux: strace, ltrace, perf, eBPF/bpftrace
- macOS: dtrace, Instruments, sample, spindump
- Network: Wireshark, tcpdump, mtr, curl -v
- Container: docker logs, kubectl logs, container-diff
Observability
- Logging: ELK Stack, Splunk, Datadog
- Tracing: Jaeger, Zipkin, OpenTelemetry
- Metrics: Prometheus, Grafana, New Relic
- APM: Datadog APM, New Relic, Dynatrace
Log Analysis Patterns
Error Pattern Recognition
- Stack trace analysis and grouping
- Error rate anomaly detection
- Correlation of errors across services
- Timeline reconstruction
Distributed Tracing
- Follow request ID across microservices
- Identify latency contributors
- Find error propagation paths
- Visualize service dependencies
Related Agents
- dario-debugger - Full agent with reasoning and tool expertise
- rex-code-reviewer - Identifies bug-prone patterns
- otto-performance-optimizer - Performance-related debugging
- thor-quality-assurance-guardian - Test gap identification
- luca-security-expert - Security vulnerability investigation
ISE Engineering Fundamentals Alignment
- Build applications test-ready with comprehensive logging
- Use correlation IDs for distributed tracing
- Include contextual metadata in all logs
- Log to external systems for analysis
- Blameless post-mortems for systemic improvements
- Code without tests is incomplete - add regression tests
Source
git clone https://github.com/Roberdan/MyConvergio/blob/master/.claude/skills/debugging/SKILL.mdView on GitHub Overview
This debugging skill applies a scientific method to bug investigation, emphasizing root cause analysis and evidence-based diagnosis across stacks. It guides you from reproduction through prevention, reducing outages and reoccurrence.
How This Skill Works
It follows a repeatable workflow: reproduce, isolate, gather evidence, hypothesize, test hypotheses, identify the root cause, fix and verify, then implement prevention. Technical rigor is maintained with logs, stack traces, metrics, distributed tracing, and controlled experiments (e.g., binary search, 5 Whys) to validate findings.
When to Use It
- Production incidents and outages
- Intermittent or hard-to-reproduce bugs
- Performance degradation investigation
- Memory leaks and resource exhaustion
- Concurrency issues (race conditions, deadlocks)
Quick Start
- Step 1: Reproduce issue with exact steps, environment, and minimal case
- Step 2: Isolate by narrowing components and performing targeted tests (binary search, version checks)
- Step 3: Gather evidence, hypothesize with 5 Whys, test hypotheses, and implement a root-cause fix plus regression tests
Best Practices
- Reproduce with a minimal, exact reproduction case and document environment details
- Isolate the problem space using binary search and version controls (git bisect) to narrow scope
- Gather comprehensive evidence: logs, stack traces, metrics, and distributed traces
- Form testable hypotheses with ranked likelihoods and apply 5 Whys to uncover root causes
- Verify the fix, check for regressions, and implement prevention with tests and monitoring
Example Use Cases
- API endpoint intermittently returns 500 under load; investigation identifies a connection pool exhaustion as the root cause
- Slow page response due to an N+1 query pattern; binary search across components localizes to a data loader
- Memory leak in a long-running service detected via growing RSS and profiling; fix eliminates leak and adds cleanup
- CI/CD flaky build traced to environment mismatch; pinning dependencies and aligning runtime versions resolves it
- Race condition in multi-threaded code revealed by inconsistent logs; tracing confirms timing edge case and synchronized access fixes it