What is the debugging workflow?

A scientific, methodical process: reproduce, isolate, collect evidence, hypothesize, test, identify root cause, fix, verify, and prevent recurrence.

How do you ensure you found the root cause with confidence?

Gather multiple evidence sources, validate through controlled experiments, achieve >95% confidence, and confirm no regressions with regression tests and monitoring.

What tools are commonly used in this workflow?

Read, Glob, Grep, Bash, and Edit are allowed to gather logs, traces, metrics, and to perform reproducible experiments.

debugging

npx machina-cli add skill Roberdan/MyConvergio/debugging --openclaw

Files (1)

SKILL.md

6.7 KB

Debugging Skill

Reusable workflow extracted from dario-debugger expertise.

Purpose

Systematically investigate and resolve bugs through scientific methodology, root cause analysis, and evidence-based diagnosis across all technology stacks.

When to Use

Production incidents and outages
Intermittent or hard-to-reproduce bugs
Performance degradation investigation
Memory leaks and resource exhaustion
Concurrency issues (race conditions, deadlocks)
Crash analysis and stack trace interpretation
Test failures and CI/CD pipeline issues

Workflow Steps

Reproduce
- Confirm issue can be consistently reproduced
- Document exact reproduction steps
- Identify required environment/conditions
- Create minimal reproduction case
Isolate
- Narrow down problem space (component, input, timing)
- Use binary search to eliminate possibilities
- Identify affected versions (git bisect)
- Determine scope of impact
Gather Evidence
- Collect logs from all relevant systems
- Capture stack traces and error messages
- Record metrics and performance data
- Preserve system state before changes
- Use distributed tracing for microservices
Hypothesize
- Form testable hypotheses about root cause
- List potential causes ranked by probability
- Consider symptoms vs actual cause
- Apply 5 Whys technique
Test Hypotheses
- Design experiments to prove/disprove each hypothesis
- Use debuggers and profilers to validate
- Check logs for evidence supporting/refuting
- Eliminate possibilities systematically
Identify Root Cause
- Determine fundamental issue (not just symptom)
- Verify with >95% confidence
- Document evidence trail
- Distinguish correlation from causation
Fix & Verify
- Implement targeted fix for root cause
- Verify fix resolves issue
- Test for regressions
- Measure impact of fix
Prevent Recurrence
- Add regression tests
- Implement monitoring/alerting
- Document findings for team
- Update runbooks if applicable

Inputs Required

Bug description: Expected vs actual behavior
Environment: OS, versions, configurations, recent changes
Reproduction: Steps to reproduce (if known)
Evidence: Logs, error messages, screenshots, metrics
Scope: When did it start? How many affected?

Outputs Produced

Root Cause Report: Detailed analysis with evidence
Reproduction Steps: Minimal, reliable reproduction case
Fix Recommendations: Prioritized solutions with trade-offs
Prevention Strategy: How to prevent similar issues
Regression Tests: Tests to verify fix and prevent recurrence

Bug Classification

Priority Levels

🔴 P0 - Critical: System down, data loss, security breach - immediate response
🟠 P1 - High: Major feature broken, significant user impact
🟡 P2 - Medium: Feature degraded, workaround exists
🟢 P3 - Low: Minor issue, cosmetic, edge case

Debugging Techniques

Scientific Method

Observe the problem
Form hypothesis about cause
Design experiment to test hypothesis
Execute test and collect data
Analyze results
Refine hypothesis or conclude

Binary Search Debugging

Divide problem space in half repeatedly
Test midpoint to eliminate half of possibilities
Efficient for narrowing down cause

5 Whys Technique

Problem: API endpoint returns 500 error
Why? Database connection failed
Why? Connection pool exhausted
Why? Connections not being released
Why? Missing finally block in error path
Why? Error handling added without proper resource cleanup
Root Cause: Incomplete error handling refactor

Time-Travel Debugging

Use tools like rr, UndoDB for execution replay
Step backwards through execution
Examine state at any point in time

Example Usage

Input: Production API returning 500 errors intermittently

Workflow Execution:
1. Reproduce: 500 errors occur under load (>100 req/sec)
2. Isolate: Only affects /api/users endpoint, started after v2.3 deploy
3. Evidence: Connection pool at max, slow query log shows 30s timeouts
4. Hypothesis: Query performance degraded with new schema
5. Test: EXPLAIN ANALYZE shows missing index after migration
6. Root Cause: Migration script failed to create user_email_idx index
7. Fix: CREATE INDEX user_email_idx; query time drops to 50ms
8. Prevent: Add index existence check to health endpoint

Output:
ROOT CAUSE: Missing database index after incomplete migration
EVIDENCE: Query plan shows seq scan, migration log shows index creation failed
FIX: Manual index creation, update migration with IF NOT EXISTS
PREVENTION: Added database index monitoring, migration dry-run validation
CONFIDENCE: 99%

Debugging Tools by Platform

Language-Specific

Python: pdb, ipdb, py-spy, memory_profiler
JavaScript/Node: Chrome DevTools, node --inspect, ndb
C/C++/Objective-C: LLDB, Instruments, AddressSanitizer, Valgrind
Java/Kotlin: JDB, VisualVM, async-profiler
Go: Delve, pprof, race detector

System-Level

Linux: strace, ltrace, perf, eBPF/bpftrace
macOS: dtrace, Instruments, sample, spindump
Network: Wireshark, tcpdump, mtr, curl -v
Container: docker logs, kubectl logs, container-diff

Observability

Logging: ELK Stack, Splunk, Datadog
Tracing: Jaeger, Zipkin, OpenTelemetry
Metrics: Prometheus, Grafana, New Relic
APM: Datadog APM, New Relic, Dynatrace

Log Analysis Patterns

Error Pattern Recognition

Stack trace analysis and grouping
Error rate anomaly detection
Correlation of errors across services
Timeline reconstruction

Distributed Tracing

Follow request ID across microservices
Identify latency contributors
Find error propagation paths
Visualize service dependencies

Related Agents

dario-debugger - Full agent with reasoning and tool expertise
rex-code-reviewer - Identifies bug-prone patterns
otto-performance-optimizer - Performance-related debugging
thor-quality-assurance-guardian - Test gap identification
luca-security-expert - Security vulnerability investigation

ISE Engineering Fundamentals Alignment

Build applications test-ready with comprehensive logging
Use correlation IDs for distributed tracing
Include contextual metadata in all logs
Log to external systems for analysis
Blameless post-mortems for systemic improvements
Code without tests is incomplete - add regression tests

Source

git clone https://github.com/Roberdan/MyConvergio/blob/master/.claude/skills/debugging/SKILL.mdView on GitHub

Overview

This debugging skill applies a scientific method to bug investigation, emphasizing root cause analysis and evidence-based diagnosis across stacks. It guides you from reproduction through prevention, reducing outages and reoccurrence.

How This Skill Works

It follows a repeatable workflow: reproduce, isolate, gather evidence, hypothesize, test hypotheses, identify the root cause, fix and verify, then implement prevention. Technical rigor is maintained with logs, stack traces, metrics, distributed tracing, and controlled experiments (e.g., binary search, 5 Whys) to validate findings.

When to Use It

Production incidents and outages
Intermittent or hard-to-reproduce bugs
Performance degradation investigation
Memory leaks and resource exhaustion
Concurrency issues (race conditions, deadlocks)

Quick Start

Step 1: Reproduce issue with exact steps, environment, and minimal case
Step 2: Isolate by narrowing components and performing targeted tests (binary search, version checks)
Step 3: Gather evidence, hypothesize with 5 Whys, test hypotheses, and implement a root-cause fix plus regression tests

Best Practices

Reproduce with a minimal, exact reproduction case and document environment details
Isolate the problem space using binary search and version controls (git bisect) to narrow scope
Gather comprehensive evidence: logs, stack traces, metrics, and distributed traces
Form testable hypotheses with ranked likelihoods and apply 5 Whys to uncover root causes
Verify the fix, check for regressions, and implement prevention with tests and monitoring

Example Use Cases

API endpoint intermittently returns 500 under load; investigation identifies a connection pool exhaustion as the root cause
Slow page response due to an N+1 query pattern; binary search across components localizes to a data loader
Memory leak in a long-running service detected via growing RSS and profiling; fix eliminates leak and adds cleanup
CI/CD flaky build traced to environment mismatch; pinning dependencies and aligning runtime versions resolves it
Race condition in multi-threaded code revealed by inconsistent logs; tracing confirms timing edge case and synchronized access fixes it

Frequently Asked Questions

Add this skill to your agents