Dist Debug
Scannednpx machina-cli add skill Langerrr/distributed-architect/dist-debug --openclaw/dist-debug — Debug-Time Root Cause Analysis
Trace distributed system failures backward from symptom to root cause. In distributed systems, the symptom component and the cause component are often different. Trace backward, don't just fix where it hurts.
Arguments
$ARGUMENTSmay contain a symptom description, error message, or log snippet
Step 1: Gather Symptoms
From $ARGUMENTS or by asking the user:
- What's the observable behavior? (error message, unexpected state, performance issue)
- Which component reported the error? (this is the SYMPTOM component)
- When does it happen? (always, intermittently, under load, after a specific event)
- Any recent changes? (new code, config change, scaling event)
Step 2: Load Project Topology
Load the topology file to understand the communication graph. The root cause is almost always at a boundary — knowing the boundaries narrows the search.
Step 3: Identify the Symptom Boundary
From the topology:
- What components feed into the symptom component?
- What data/state does the symptom component depend on from other components?
- What communication paths lead to the symptom component?
Step 4: Backward Trace
Starting from the symptom, trace backward through the topology:
Symptom: [what's observed, in which component]
<- Depends on: [state/data from component X]
<- Which depends on: [operation in component Y]
<- Which could fail if: [condition in component Z]
At each hop, ask:
- Could this component be in an unexpected state?
- Could the data arriving here be stale, missing, or corrupted?
- Could timing/ordering explain the symptom?
Step 5: Anti-Pattern Match
Read the catalog entries from the plugin's catalog/ directory and check the symptom shape:
| Symptom shape | Likely anti-pattern |
|---|---|
| Rapid repeated failures from same component | AP-1: Tight Retry Loop |
| Data missing that "should be there" | AP-2: Lost Data at ACK |
| Operation marked failed but actually succeeded | AP-3: Channel Conflation |
| Component making wrong decisions about another's state | AP-4: Boundary State Leak |
| Way more retries than configured | AP-5: Compounding Retry |
| Failures immediately after recovery | AP-6: Premature State Transition |
Step 6: Hypothesis Formation
Form 1-3 hypotheses based on the backward trace and anti-pattern match:
Hypothesis 1 (highest confidence):
Root cause: [what's actually wrong]
Boundary: [which boundary the issue crosses]
Mechanism: [how the cause produces the symptom]
Verification: [how to confirm — specific log, query, or test]
Step 7: Forward Trace the Fix
For each hypothesis, take the proposed fix and trace it forward through the same path the backward trace established. The backward trace gave you the chain; now run the fix through it in the opposite direction:
Fix applied at: [root cause component]
-> Does it change: [state/data at component Y]?
-> Does that resolve: [the dependency at component X]?
-> Does the symptom disappear?
At each hop, ask: does this resolve for all data, or only for new data?
If any hop answers "only going forward" — the fix leaves accumulated damage from the bug's lifetime. Flag what cleanup is needed (purge, migration, backfill) and whether it can run at startup, as a one-off script, or requires a migration.
Step 8: Output
## Distributed Debug Analysis
**Symptom**: [description]
**Symptom component**: [where it manifests]
**Probable cause component**: [where it originates]
**Boundary involved**: [which inter-component boundary]
### Backward Trace
[The chain from symptom to probable cause]
### Anti-Pattern Match
[Matching pattern, or "no known pattern match"]
### Hypotheses
[Ranked by confidence with verification steps]
### Forward Trace
[For top hypothesis: trace the fix forward through the same path. Flag any hop where the fix only addresses future data, not existing state.]
### Immediate Investigation Steps
1. [Most targeted check to confirm/deny top hypothesis]
2. [Next check]
3. [Fallback investigation if above don't confirm]
Source
git clone https://github.com/Langerrr/distributed-architect/blob/main/skills/dist-debug/SKILL.mdView on GitHub Overview
dist-debug provides backward root-cause tracing from observed symptoms to their origin across distributed components. It emphasizes inter-component boundaries to diagnose cross-service incidents rather than just patching the symptom.
How This Skill Works
Start with symptom details, load the project topology to identify boundaries, then walk the trace backward through dependencies to locate the boundary likely responsible. Use anti-pattern checks to form 1-3 hypotheses and verify them with targeted forward tracing before applying fixes.
When to Use It
- Investigating a production incident with cross-component failures.
- Symptom observed in one component but root cause lies in a dependent boundary.
- Intermittent failures or latency that correlate with load or events.
- Data missing or corrupted as it travels between services.
- Failures after deployment or config change where inter-component boundaries may be affected.
Quick Start
- Step 1: Gather Symptoms from the ARGUMENTS input (observable behavior, error messages, logs).
- Step 2: Load the topology to locate the symptom boundary and dependencies.
- Step 3: Perform a backward trace from the symptom, form hypotheses, then verify with a forward trace.
Best Practices
- Gather clear symptom details: observable behavior, exact component, timing, and recent changes.
- Load and study the topology to identify the boundary around the symptom.
- Perform a backward trace from the symptom through the topology and note dependencies.
- Check anti-patterns (AP-1 to AP-6) to guide hypotheses and avoid common pitfalls.
- Form 1-3 hypotheses and verify them with a forward-trace of the proposed fix.
Example Use Cases
- Checkout failure traced to a boundary mismatch between cart and payment services.
- Intermittent 500 errors caused by stale session data flowing from an auth service to the main API.
- Latency spike after a feature-flag change due to timing/order issues across services.
- Data missing during a CDC event leading to downstream inconsistency.
- Rapid repeated failures in a component caused by a too-tightly coupled retry loop (AP-1).