Get the FREE Ultimate OpenClaw Setup Guide →

error-handling

npx machina-cli add skill zircote/claude-team-orchestration/error-handling --openclaw
Files (1)
SKILL.md
7.4 KB

Error Handling

Debug, recover from, and prevent common agent team errors. Includes hooks for quality enforcement and known limitations.

Related skills:


Common Errors

ErrorCauseSolution
"Cannot cleanup with active members"Teammates still runningShutdown all teammates first, wait for approval
"Already leading a team"Team already existsTeamDelete() first, or use different team name
"Agent not found"Wrong teammate nameRead config.json for actual names
"Team does not exist"No team createdCall TeamCreate() first
"team_name is required"Missing team contextProvide team_name parameter
"Agent type not found"Invalid subagent_typeCheck available agents with proper prefix

Quality Gate Hooks

Use hooks to enforce rules when teammates finish work or tasks complete.

TeammateIdle Hook

Runs when a teammate is about to go idle. Exit with code 2 to send feedback and keep the teammate working.

{
  "hooks": {
    "TeammateIdle": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "python3 check_teammate_quality.py"
          }
        ]
      }
    ]
  }
}

Use cases:

  • Verify teammate completed all assigned tasks before going idle
  • Run linting or tests on teammate's changes
  • Enforce documentation requirements

Exit codes:

  • 0 - Allow teammate to go idle normally
  • 2 - Send feedback to teammate, keep them working

TaskCompleted Hook

Runs when a task is being marked complete. Exit with code 2 to prevent completion and send feedback.

{
  "hooks": {
    "TaskCompleted": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "python3 validate_task_completion.py"
          }
        ]
      }
    ]
  }
}

Use cases:

  • Verify tests pass before marking a task complete
  • Ensure code quality standards are met
  • Validate documentation was updated

Exit codes:

  • 0 - Allow task completion
  • 2 - Prevent completion, send feedback to teammate

Known Limitations

Agent teams are experimental. Current limitations:

  1. No session resumption with in-process teammates: /resume and /rewind do not restore in-process teammates. After resuming, the lead may try to message teammates that no longer exist. Tell the lead to spawn new teammates.

  2. Task status can lag: Teammates sometimes fail to mark tasks as completed, which blocks dependent tasks. Check whether work is done and update status manually, or tell the lead to nudge the teammate.

  3. Shutdown can be slow: Teammates finish their current request or tool call before shutting down.

  4. One team per session: A lead can only manage one team at a time. Clean up the current team before starting a new one.

  5. No nested teams: Teammates cannot spawn their own teams or teammates. Only the lead can manage the team.

  6. Lead is fixed: The session that creates the team is the lead for its lifetime. You cannot promote a teammate or transfer leadership.

  7. Permissions set at spawn: All teammates start with the lead's permission mode. You can change individual modes after spawning, but cannot set per-teammate modes at spawn time.

  8. Split panes require tmux or iTerm2: Default in-process mode works in any terminal. Split-pane mode isn't supported in VS Code's integrated terminal, Windows Terminal, or Ghostty.


Graceful Shutdown Sequence

See Team Management for the full shutdown procedure. In summary:

// 1. Request shutdown for all teammates
SendMessage({ type: "shutdown_request", recipient: "worker-1", content: "Done" })
SendMessage({ type: "shutdown_request", recipient: "worker-2", content: "Done" })

// 2. Wait for shutdown approvals

// 3. Verify no active members

// 4. Only then cleanup
TeamDelete()

Handling Crashed Teammates

Teammates have a 5-minute heartbeat timeout. If a teammate crashes:

  1. They are automatically marked as inactive after timeout
  2. Their tasks remain in the task list
  3. Another teammate can claim their tasks
  4. Cleanup will work after timeout expires

Recovery Strategies

Teammate Stops on Error

Teammates may stop after encountering errors instead of recovering.

Recovery:

  1. Check their output using Shift+Up/Down (in-process) or click pane (split mode)
  2. Give them additional instructions directly
  3. Or spawn a replacement teammate to continue the work

Lead Starts Implementing Instead of Delegating

The lead sometimes starts doing work itself instead of waiting for teammates.

Recovery: Tell it to wait:

Wait for your teammates to complete their tasks before proceeding

Or enable delegate mode to restrict the lead to coordination-only tools.

Lead Shuts Down Prematurely

The lead may decide the team is finished before all tasks are complete.

Recovery: Tell it to keep going. You can also tell the lead to wait for teammates to finish before proceeding.

Task Appears Stuck

A task stays in pending even though its dependencies are done.

Recovery:

  1. Check if the blocking task was actually marked completed
  2. If work is done but status wasn't updated, update it manually
  3. Tell the lead to nudge the teammate

Too Many Permission Prompts

Teammate permission requests bubble up to the lead.

Recovery: Pre-approve common operations in your permission settings before spawning teammates.

Orphaned tmux Sessions

A tmux session persists after the team ends.

Recovery:

tmux ls
tmux kill-session -t <session-name>

Debugging Commands

# Check team config
cat ~/.claude/teams/{team}/config.json | jq '.members[] | {name, agentType, backendType}'

# Check teammate inboxes
cat ~/.claude/teams/{team}/inboxes/{agent}.json | jq '.'

# List all teams
ls ~/.claude/teams/

# Check task states
cat ~/.claude/tasks/{team}/*.json | jq '{id, subject, status, owner, blockedBy}'

# Watch for new messages
tail -f ~/.claude/teams/{team}/inboxes/team-lead.json

Best Practices for Error Prevention

Handle Worker Failures

  • Workers have 5-minute heartbeat timeout
  • Tasks of crashed workers can be reclaimed
  • Build retry logic into worker prompts

Avoid File Conflicts

Two teammates editing the same file leads to overwrites. Break work so each teammate owns a different set of files.

Monitor and Steer

Check in on teammate progress, redirect approaches that aren't working, and synthesize findings as they come in. Letting a team run unattended too long increases risk of wasted effort.

Source

git clone https://github.com/zircote/claude-team-orchestration/blob/main/skills/error-handling/SKILL.mdView on GitHub

Overview

Error-handling provides practical guidance to debug, recover from, and prevent common agent team errors. It documents typical causes, includes quality-gate hooks, known limitations, and recovery strategies to keep teams productive.

How This Skill Works

It centers on three pillars: tracking common errors with concrete solutions, applying quality-gate hooks at critical workflow points, and documenting known limitations to manage expectations. When an error occurs, you reference the error table, enable relevant hooks such as TeammateIdle or TaskCompleted, and follow the recovery steps described in the Known Limitations.

When to Use It

  • When you encounter common team errors such as 'Agent not found' or 'Team does not exist' during team orchestration.
  • When you need to enforce quality gates before progress using hooks like TeammateIdle or TaskCompleted.
  • When debugging messaging, task-status, or spawn-backend issues that block workflows.
  • When known limitations affect workflow (e.g., no session resumption, slow shutdown) and you need guided recovery.
  • When setting up or modifying teams and you require clean recovery and cleanup guidance (TeamCreate/TeamDelete).

Quick Start

  1. Step 1: Review recent error messages and consult the Common Errors table for likely causes.
  2. Step 2: Configure and enable relevant quality-gate hooks (TeammateIdle, TaskCompleted) with example JSON.
  3. Step 3: Apply recovery steps from the Known Limitations (shutdown current team, spawn new teammates) and verify operation.

Best Practices

  • Map each error to a concrete fix using the Common Errors table as a reference.
  • Enable and test quality-gate hooks (TeammateIdle, TaskCompleted) before critical handoffs.
  • Always verify the correct team context and names in config.json and use appropriate TeamCreate/TeamDelete flows.
  • Be mindful of listed limitations and plan recovery steps (e.g., spawn new teammates) before proceeding.
  • Test recovery procedures in a safe or staging environment before applying in production.

Example Use Cases

  • Encountering 'Agent not found': read actual teammate names from config.json and retry with the correct names.
  • Resolve 'Already leading a team' by using TeamDelete() or creating a new team name.
  • Prevent task completion when tests fail by applying a TaskCompleted hook that returns code 2.
  • Guard quality before idle by configuring a TeammateIdle hook with a validation script.
  • Diagnose slow shutdown or task-status lag and manually refresh statuses or nudge teammates.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers