What happens if SQL fails validation?

The system reads the error, fixes the SQL, and retries the validation up to three times.

Should I always limit queries for data exploration?

Yes. LIMIT prevents large results during exploration and helps manage resources while you review schema and results.

How does the system ensure correctness and safety?

It enforces fully qualified table names, avoids trailing semicolons, uses proper enum string literals, and enforces time bound filters and the mandatory validate_sql step.

sql-expert

Scanned

npx machina-cli add skill FrankChen021/datastoria/sql-expert --openclaw

Files (1)

SKILL.md

3.4 KB

🚨 CRITICAL RULE: MANDATORY VALIDATION

You MUST call validate_sql(sql) for every new query you generate. Context Note: Historical validation steps are pruned to save tokens, but this does NOT excuse you from validating new queries in the current turn. Always validate before executing.

1. Schema Discovery & Context

Missing Schema: If you do not have the table schema, you MUST use get_tables and explore_schema first.
- Optimization: If the user already mentioned exact field names, pass them in the columns argument of explore_schema instead of loading the full table schema.
Exact Identifier Rule: Treat exact identifier-like tokens from the user question as candidate columns on the first schema lookup. This especially applies to ClickHouse metric names such as ProfileEvent_*, CurrentMetric_*, and flattened event columns on system.* tables.
Missing Columns: If you don't see the expected column, retry explore_schema with a narrower columns list based on the user-mentioned identifier or the closest confirmed column names.
Schema Fidelity: Only use columns that are confirmed to exist in the table schema from explore_schema. Do not assume standard columns exist if they are not in the tool output.
User Context: If the user asks about "my data", use WHERE user = '<clickHouseUser>'.
System Tables: For queries on system.* tables (e.g., system.query_log, system.parts, system.merges), defer to the clickhouse-system-queries skill - it contains table-specific patterns, predicates, and resource metrics that this skill does not cover.

2. Syntax Rules (The Grammar)

Tables: ALWAYS use fully qualified names (e.g., database.table).
Semicolons: NEVER include a trailing semicolon (;).
Enums: Use exact string literals for Enum columns.
Safety: ALWAYS use LIMIT for data exploration queries.

3. Optimization Rules (Best Practices)

Time filters: Always filter by the partition key (usually event_date or timestamp) first. Use bounded time windows (e.g., last 24h, 7 days) unless the user asks for all history.
Primary Keys (CRITICAL): ClickHouse indexes are sparse. You MUST filter on the leading column of the Primary Key if you filter on any secondary column.
- Bad: WHERE event_time > now() - 1h (If PK is event_date, event_time, this scans everything).
- Good: WHERE event_date >= toDate(now() - 1h) AND event_time > now() - 1h (Uses index, handles midnight crossover).
Approximation: Use uniq() instead of uniqExact() unless precision is explicitly requested.
Joins: Put the smaller table on the RIGHT. Use GLOBAL IN only for distributed queries.

4. Execution Workflow

Generate: Create the SQL following the rules above.
Validate (MANDATORY): Call validate_sql(sql).
- If invalid: Read the error, fix the SQL, and retry (max 3 attempts).
Decide Action:
- Visualization: IF the user wants a chart, DO NOT execute. Pass the SQL to the visualization skill logic.
- Data: IF the user wants answers (lists, counts), call execute_sql(sql).
- Code Only: IF the user asks to "write SQL", just output the code block.

Source

git clone https://github.com/FrankChen021/datastoria/blob/master/src/lib/ai/skills/sql-expert/SKILL.mdView on GitHub

Overview

sql-expert is an expert system that generates ClickHouse SQL, validates every new query with validate_sql, and applies optimization tactics for speed and accuracy. It guides you when you need data, queries, or analysis, and enforces strict schema and safety rules.

How This Skill Works

When you ask for data, it discovers the schema using get_tables and explore_schema if needed, treating exact user tokens as candidate columns. It builds fully qualified, semicolon-free SQL with safe exploration defaults (LIMIT) and time-bound filters, then mandatory validates via validate_sql. Depending on your request, it either executes the query or routes the SQL to the visualization path for charts.

When to Use It

You need data or analysis from ClickHouse and the table schema is not fully known.
You want optimized queries that respect partition keys, time windows, and primary key order.
You plan to create visualizations or charts and should not execute until the data path is determined.
You are exploring data and require safe defaults such as LIMIT and bounded time ranges.
You ask for code only or to write SQL so the system returns a standalone SQL block without extra text.

Quick Start

Step 1: If the schema is unknown, run get_tables and explore_schema to establish valid columns.
Step 2: Build a ClickHouse SQL statement using fully qualified database.table names, add bounded time filters, and include LIMIT for exploration.
Step 3: Run validate_sql(sql); if it fails, fix the SQL and retry up to 3 times; then execute_sql(sql) or route to visualization as requested.

Best Practices

Always use fully qualified names in queries (database.table).
Filter by the partition key first (event_date or timestamp) with bounded time windows.
Use uniq() for approximate counts unless exact precision is requested.
Place the smaller table on the RIGHT in joins and avoid GLOBAL IN for non distributed setups.
Always call validate_sql(sql) before executing and retry up to 3 times if invalid.

Example Use Cases

Generate a last 7 days daily active users report from analytics.events with event_date filtering.
Compute top 100 products by revenue in the last 30 days using a bounded time window and LIMIT 100.
Validate a complex event query against the discovered schema before running on large data sources.
Explore a large table with a simple LIMIT to sample rows and inspect column types and values.
User asks to write SQL; provide a code only SQL block without extra narrative.

Frequently Asked Questions

Add this skill to your agents