What data source does cluster-diagnostics rely on?

It bases all findings on the output of collect_cluster_status; no other data sources are used.

What do the severity thresholds mean?

CRITICAL: replication lag > 300s or disk usage > 90%; WARNING: lag > 60s or disk usage > 80%.

How are remediation recommendations delivered?

They are concrete actions with context, potentially including SQL or operational steps, designed to be applied safely and repeatedly if needed.

cluster-diagnostics

Scanned

npx machina-cli add skill FrankChen021/datastoria/cluster-diagnostics --openclaw

Files (1)

SKILL.md

2.1 KB

Tools

collect_cluster_status: modes snapshot (default) | windowed
- Collection only — diagnosis and recommendations are produced by this skill.
- For charts, delegate to the visualization skill; never emit chart specs directly.

Workflow

Always call collect_cluster_status before any health assessment.
Use status_analysis_mode="windowed" for bounded-time questions (e.g., "past 3 hours"); keep the same window in follow-up calls.
Base all findings solely on collect_cluster_status output.

Severity Thresholds

Level	Replication Lag	Disk Usage
CRITICAL	> 300s	> 90%
WARNING	> 60s	> 80%

Output Format

### Summary — print this heading, then:

Status Nodes with Issues Checks Run Timestamp
🟢 OK / 🟠 WARNING / 🔴 CRITICAL N comma-separated category names (e.g., parts, replication, disk) ISO8601
### Findings by Category — print this heading, then a markdown table with one row per category returned by the tool, in stable order:

Category Status Key Metrics Top Outlier / Scope Notes

Rules:
- Status: emoji + text (e.g., 🟠 WARNING), never emoji-only.
- Key Metrics: 1–2 values, single-line, semicolon-separated (e.g., max_parts_per_partition=533 (>500)).
- Notes: remaining metrics as compact key=value items, single-line.
- Wrap identifiers in backticks (e.g., `db.table`).
- No outlier → set Top Outlier / Scope to -.
- No multi-line content inside cells.
Recommendations — max 3 items: title + reason + SQL/command if applicable.

Status	Nodes with Issues	Checks Run	Timestamp
🟢 OK / 🟠 WARNING / 🔴 CRITICAL	N	comma-separated category names (e.g., `parts, replication, disk`)	ISO8601

Rules

Never give a health opinion without calling collect_cluster_status first.
Never assume schema or table names; use only tool output.
Never write custom health-check SQL; the tool is the source of truth.

Source

git clone https://github.com/FrankChen021/datastoria/blob/master/src/lib/ai/skills/cluster-diagnostics/SKILL.mdView on GitHub

Overview

Diagnoses the health of a ClickHouse cluster using collect_cluster_status and delivers concrete remediation steps. It focuses on key health signals like replication lag and disk usage, applying defined severity thresholds to guide incident response and capacity planning.

How This Skill Works

The skill first calls collect_cluster_status (default mode: snapshot). For bounded-time questions, use status_analysis_mode='windowed' and keep the same window for follow-ups. It then analyzes the returned data to rank health by category (e.g., replication, disk) and provides actionable remediation recommendations.

When to Use It

Post-deploy checks to verify cluster health after changes.
During a live incident to pinpoint lag, disk pressure, or parts issues.
Capacity planning and trend monitoring using bounded time windows.
Pre-scaling or shard rebalancing to ensure safe operations.
Post-incident review to validate remediation effectiveness.

Quick Start

Step 1: Call collect_cluster_status (default mode: snapshot).
Step 2: If analyzing a bounded period, set status_analysis_mode='windowed' and pick a window; keep it consistent for follow-ups.
Step 3: Review Findings by Category and apply the recommended remediation.

Best Practices

Always fetch status via collect_cluster_status before diagnosing health.
Use status_analysis_mode='windowed' for bounded-time questions and keep the window consistent across follow-ups.
Interpret both replication lag and disk usage together; a high lag with normal disk may indicate config or network issues.
Keep remediation steps idempotent and actionable.
Document findings and recommended actions in runbooks; avoid guessing.

Example Use Cases

CRITICAL replication lag (>300s) on shard-01 with disk usage >90% on node-2.
WARNING replication lag (>60s) on replica-3 with disk usage at 82%, prompting rebalancing.
Backlog in merges indicated by high parts count during a heavy data load.
Post-deploy health check shows lag reduced after enabling more aggressive merges.
Windowed analysis over last 2 hours reveals lag trending down after remediation steps.

Frequently Asked Questions

Add this skill to your agents