Get the FREE Ultimate OpenClaw Setup Guide →

cluster-diagnostics

Scanned
npx machina-cli add skill FrankChen021/datastoria/cluster-diagnostics --openclaw
Files (1)
SKILL.md
2.1 KB

Tools

  • collect_cluster_status: modes snapshot (default) | windowed
    • Collection only — diagnosis and recommendations are produced by this skill.
    • For charts, delegate to the visualization skill; never emit chart specs directly.

Workflow

  1. Always call collect_cluster_status before any health assessment.
  2. Use status_analysis_mode="windowed" for bounded-time questions (e.g., "past 3 hours"); keep the same window in follow-up calls.
  3. Base all findings solely on collect_cluster_status output.

Severity Thresholds

LevelReplication LagDisk Usage
CRITICAL> 300s> 90%
WARNING> 60s> 80%

Output Format

  1. ### Summary — print this heading, then:

    StatusNodes with IssuesChecks RunTimestamp
    🟢 OK / 🟠 WARNING / 🔴 CRITICALNcomma-separated category names (e.g., parts, replication, disk)ISO8601
  2. ### Findings by Category — print this heading, then a markdown table with one row per category returned by the tool, in stable order:

    CategoryStatusKey MetricsTop Outlier / ScopeNotes

    Rules:

    • Status: emoji + text (e.g., 🟠 WARNING), never emoji-only.
    • Key Metrics: 1–2 values, single-line, semicolon-separated (e.g., max_parts_per_partition=533 (>500)).
    • Notes: remaining metrics as compact key=value items, single-line.
    • Wrap identifiers in backticks (e.g., `db.table`).
    • No outlier → set Top Outlier / Scope to -.
    • No multi-line content inside cells.
  3. Recommendations — max 3 items: title + reason + SQL/command if applicable.

Rules

  • Never give a health opinion without calling collect_cluster_status first.
  • Never assume schema or table names; use only tool output.
  • Never write custom health-check SQL; the tool is the source of truth.

Source

git clone https://github.com/FrankChen021/datastoria/blob/master/src/lib/ai/skills/cluster-diagnostics/SKILL.mdView on GitHub

Overview

Diagnoses the health of a ClickHouse cluster using collect_cluster_status and delivers concrete remediation steps. It focuses on key health signals like replication lag and disk usage, applying defined severity thresholds to guide incident response and capacity planning.

How This Skill Works

The skill first calls collect_cluster_status (default mode: snapshot). For bounded-time questions, use status_analysis_mode='windowed' and keep the same window for follow-ups. It then analyzes the returned data to rank health by category (e.g., replication, disk) and provides actionable remediation recommendations.

When to Use It

  • Post-deploy checks to verify cluster health after changes.
  • During a live incident to pinpoint lag, disk pressure, or parts issues.
  • Capacity planning and trend monitoring using bounded time windows.
  • Pre-scaling or shard rebalancing to ensure safe operations.
  • Post-incident review to validate remediation effectiveness.

Quick Start

  1. Step 1: Call collect_cluster_status (default mode: snapshot).
  2. Step 2: If analyzing a bounded period, set status_analysis_mode='windowed' and pick a window; keep it consistent for follow-ups.
  3. Step 3: Review Findings by Category and apply the recommended remediation.

Best Practices

  • Always fetch status via collect_cluster_status before diagnosing health.
  • Use status_analysis_mode='windowed' for bounded-time questions and keep the window consistent across follow-ups.
  • Interpret both replication lag and disk usage together; a high lag with normal disk may indicate config or network issues.
  • Keep remediation steps idempotent and actionable.
  • Document findings and recommended actions in runbooks; avoid guessing.

Example Use Cases

  • CRITICAL replication lag (>300s) on shard-01 with disk usage >90% on node-2.
  • WARNING replication lag (>60s) on replica-3 with disk usage at 82%, prompting rebalancing.
  • Backlog in merges indicated by high parts count during a heavy data load.
  • Post-deploy health check shows lag reduced after enabling more aggressive merges.
  • Windowed analysis over last 2 hours reveals lag trending down after remediation steps.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers