cluster-diagnostics
Scannednpx machina-cli add skill FrankChen021/datastoria/cluster-diagnostics --openclawTools
collect_cluster_status: modessnapshot(default) |windowed- Collection only — diagnosis and recommendations are produced by this skill.
- For charts, delegate to the
visualizationskill; never emit chart specs directly.
Workflow
- Always call
collect_cluster_statusbefore any health assessment. - Use
status_analysis_mode="windowed"for bounded-time questions (e.g., "past 3 hours"); keep the same window in follow-up calls. - Base all findings solely on
collect_cluster_statusoutput.
Severity Thresholds
| Level | Replication Lag | Disk Usage |
|---|---|---|
| CRITICAL | > 300s | > 90% |
| WARNING | > 60s | > 80% |
Output Format
-
### Summary— print this heading, then:Status Nodes with Issues Checks Run Timestamp 🟢 OK / 🟠 WARNING / 🔴 CRITICAL N comma-separated category names (e.g., parts, replication, disk)ISO8601 -
### Findings by Category— print this heading, then a markdown table with one row per category returned by the tool, in stable order:Category Status Key Metrics Top Outlier / Scope Notes Rules:
- Status: emoji + text (e.g.,
🟠 WARNING), never emoji-only. - Key Metrics: 1–2 values, single-line, semicolon-separated (e.g.,
max_parts_per_partition=533 (>500)). - Notes: remaining metrics as compact
key=valueitems, single-line. - Wrap identifiers in backticks (e.g.,
`db.table`). - No outlier → set
Top Outlier / Scopeto-. - No multi-line content inside cells.
- Status: emoji + text (e.g.,
-
Recommendations — max 3 items: title + reason + SQL/command if applicable.
Rules
- Never give a health opinion without calling
collect_cluster_statusfirst. - Never assume schema or table names; use only tool output.
- Never write custom health-check SQL; the tool is the source of truth.
Source
git clone https://github.com/FrankChen021/datastoria/blob/master/src/lib/ai/skills/cluster-diagnostics/SKILL.mdView on GitHub Overview
Diagnoses the health of a ClickHouse cluster using collect_cluster_status and delivers concrete remediation steps. It focuses on key health signals like replication lag and disk usage, applying defined severity thresholds to guide incident response and capacity planning.
How This Skill Works
The skill first calls collect_cluster_status (default mode: snapshot). For bounded-time questions, use status_analysis_mode='windowed' and keep the same window for follow-ups. It then analyzes the returned data to rank health by category (e.g., replication, disk) and provides actionable remediation recommendations.
When to Use It
- Post-deploy checks to verify cluster health after changes.
- During a live incident to pinpoint lag, disk pressure, or parts issues.
- Capacity planning and trend monitoring using bounded time windows.
- Pre-scaling or shard rebalancing to ensure safe operations.
- Post-incident review to validate remediation effectiveness.
Quick Start
- Step 1: Call collect_cluster_status (default mode: snapshot).
- Step 2: If analyzing a bounded period, set status_analysis_mode='windowed' and pick a window; keep it consistent for follow-ups.
- Step 3: Review Findings by Category and apply the recommended remediation.
Best Practices
- Always fetch status via collect_cluster_status before diagnosing health.
- Use status_analysis_mode='windowed' for bounded-time questions and keep the window consistent across follow-ups.
- Interpret both replication lag and disk usage together; a high lag with normal disk may indicate config or network issues.
- Keep remediation steps idempotent and actionable.
- Document findings and recommended actions in runbooks; avoid guessing.
Example Use Cases
- CRITICAL replication lag (>300s) on shard-01 with disk usage >90% on node-2.
- WARNING replication lag (>60s) on replica-3 with disk usage at 82%, prompting rebalancing.
- Backlog in merges indicated by high parts count during a heavy data load.
- Post-deploy health check shows lag reduced after enabling more aggressive merges.
- Windowed analysis over last 2 hours reveals lag trending down after remediation steps.