A backward root-cause analysis technique for distributed systems that traces symptoms to their boundary and probable cause.

How is dist-debug different from traditional debugging?

It traces paths in reverse through component boundaries, uses anti-patterns to guide hypotheses, and then tests those hypotheses with forward tracing.

What inputs are required to run dist-debug?

Symptom description, error messages or logs, and access to the system topology (topology file) to understand boundaries and dependencies.

Dist Debug

Scanned

npx machina-cli add skill Langerrr/distributed-architect/dist-debug --openclaw

Files (1)

SKILL.md

4.4 KB

/dist-debug — Debug-Time Root Cause Analysis

Trace distributed system failures backward from symptom to root cause. In distributed systems, the symptom component and the cause component are often different. Trace backward, don't just fix where it hurts.

Arguments

$ARGUMENTS may contain a symptom description, error message, or log snippet

Step 1: Gather Symptoms

From $ARGUMENTS or by asking the user:

What's the observable behavior? (error message, unexpected state, performance issue)
Which component reported the error? (this is the SYMPTOM component)
When does it happen? (always, intermittently, under load, after a specific event)
Any recent changes? (new code, config change, scaling event)

Step 2: Load Project Topology

Load the topology file to understand the communication graph. The root cause is almost always at a boundary — knowing the boundaries narrows the search.

Step 3: Identify the Symptom Boundary

From the topology:

What components feed into the symptom component?
What data/state does the symptom component depend on from other components?
What communication paths lead to the symptom component?

Step 4: Backward Trace

Starting from the symptom, trace backward through the topology:

Symptom: [what's observed, in which component]
   <- Depends on: [state/data from component X]
      <- Which depends on: [operation in component Y]
         <- Which could fail if: [condition in component Z]

At each hop, ask:

Could this component be in an unexpected state?
Could the data arriving here be stale, missing, or corrupted?
Could timing/ordering explain the symptom?

Step 5: Anti-Pattern Match

Read the catalog entries from the plugin's catalog/ directory and check the symptom shape:

Symptom shape	Likely anti-pattern
Rapid repeated failures from same component	AP-1: Tight Retry Loop
Data missing that "should be there"	AP-2: Lost Data at ACK
Operation marked failed but actually succeeded	AP-3: Channel Conflation
Component making wrong decisions about another's state	AP-4: Boundary State Leak
Way more retries than configured	AP-5: Compounding Retry
Failures immediately after recovery	AP-6: Premature State Transition

Step 6: Hypothesis Formation

Form 1-3 hypotheses based on the backward trace and anti-pattern match:

Hypothesis 1 (highest confidence):
  Root cause: [what's actually wrong]
  Boundary: [which boundary the issue crosses]
  Mechanism: [how the cause produces the symptom]
  Verification: [how to confirm — specific log, query, or test]

Step 7: Forward Trace the Fix

For each hypothesis, take the proposed fix and trace it forward through the same path the backward trace established. The backward trace gave you the chain; now run the fix through it in the opposite direction:

Fix applied at: [root cause component]
   -> Does it change: [state/data at component Y]?
      -> Does that resolve: [the dependency at component X]?
         -> Does the symptom disappear?

At each hop, ask: does this resolve for all data, or only for new data?

If any hop answers "only going forward" — the fix leaves accumulated damage from the bug's lifetime. Flag what cleanup is needed (purge, migration, backfill) and whether it can run at startup, as a one-off script, or requires a migration.

Step 8: Output

## Distributed Debug Analysis

**Symptom**: [description]
**Symptom component**: [where it manifests]
**Probable cause component**: [where it originates]
**Boundary involved**: [which inter-component boundary]

### Backward Trace
[The chain from symptom to probable cause]

### Anti-Pattern Match
[Matching pattern, or "no known pattern match"]

### Hypotheses
[Ranked by confidence with verification steps]

### Forward Trace
[For top hypothesis: trace the fix forward through the same path. Flag any hop where the fix only addresses future data, not existing state.]

### Immediate Investigation Steps
1. [Most targeted check to confirm/deny top hypothesis]
2. [Next check]
3. [Fallback investigation if above don't confirm]

Source

git clone https://github.com/Langerrr/distributed-architect/blob/main/skills/dist-debug/SKILL.mdView on GitHub

Overview

dist-debug provides backward root-cause tracing from observed symptoms to their origin across distributed components. It emphasizes inter-component boundaries to diagnose cross-service incidents rather than just patching the symptom.

How This Skill Works

Start with symptom details, load the project topology to identify boundaries, then walk the trace backward through dependencies to locate the boundary likely responsible. Use anti-pattern checks to form 1-3 hypotheses and verify them with targeted forward tracing before applying fixes.

When to Use It

Investigating a production incident with cross-component failures.
Symptom observed in one component but root cause lies in a dependent boundary.
Intermittent failures or latency that correlate with load or events.
Data missing or corrupted as it travels between services.
Failures after deployment or config change where inter-component boundaries may be affected.

Quick Start

Step 1: Gather Symptoms from the ARGUMENTS input (observable behavior, error messages, logs).
Step 2: Load the topology to locate the symptom boundary and dependencies.
Step 3: Perform a backward trace from the symptom, form hypotheses, then verify with a forward trace.

Best Practices

Gather clear symptom details: observable behavior, exact component, timing, and recent changes.
Load and study the topology to identify the boundary around the symptom.
Perform a backward trace from the symptom through the topology and note dependencies.
Check anti-patterns (AP-1 to AP-6) to guide hypotheses and avoid common pitfalls.
Form 1-3 hypotheses and verify them with a forward-trace of the proposed fix.

Example Use Cases

Checkout failure traced to a boundary mismatch between cart and payment services.
Intermittent 500 errors caused by stale session data flowing from an auth service to the main API.
Latency spike after a feature-flag change due to timing/order issues across services.
Data missing during a CDC event leading to downstream inconsistency.
Rapid repeated failures in a component caused by a too-tightly coupled retry loop (AP-1).

Frequently Asked Questions

Add this skill to your agents