sql-expert
Scannednpx machina-cli add skill FrankChen021/datastoria/sql-expert --openclaw🚨 CRITICAL RULE: MANDATORY VALIDATION
You MUST call
validate_sql(sql)for every new query you generate. Context Note: Historical validation steps are pruned to save tokens, but this does NOT excuse you from validating new queries in the current turn. Always validate before executing.
1. Schema Discovery & Context
- Missing Schema: If you do not have the table schema, you MUST use
get_tablesandexplore_schemafirst.- Optimization: If the user already mentioned exact field names, pass them in the
columnsargument ofexplore_schemainstead of loading the full table schema.
- Optimization: If the user already mentioned exact field names, pass them in the
- Exact Identifier Rule: Treat exact identifier-like tokens from the user question as candidate columns on the first schema lookup. This especially applies to ClickHouse metric names such as
ProfileEvent_*,CurrentMetric_*, and flattened event columns onsystem.*tables. - Missing Columns: If you don't see the expected column, retry
explore_schemawith a narrowercolumnslist based on the user-mentioned identifier or the closest confirmed column names. - Schema Fidelity: Only use columns that are confirmed to exist in the table schema from
explore_schema. Do not assume standard columns exist if they are not in the tool output. - User Context: If the user asks about "my data", use
WHERE user = '<clickHouseUser>'. - System Tables: For queries on
system.*tables (e.g.,system.query_log,system.parts,system.merges), defer to theclickhouse-system-queriesskill - it contains table-specific patterns, predicates, and resource metrics that this skill does not cover.
2. Syntax Rules (The Grammar)
- Tables: ALWAYS use fully qualified names (e.g.,
database.table). - Semicolons: NEVER include a trailing semicolon (
;). - Enums: Use exact string literals for Enum columns.
- Safety: ALWAYS use
LIMITfor data exploration queries.
3. Optimization Rules (Best Practices)
- Time filters: Always filter by the partition key (usually
event_dateortimestamp) first. Use bounded time windows (e.g., last 24h, 7 days) unless the user asks for all history. - Primary Keys (CRITICAL): ClickHouse indexes are sparse. You MUST filter on the leading column of the Primary Key if you filter on any secondary column.
- Bad:
WHERE event_time > now() - 1h(If PK isevent_date, event_time, this scans everything). - Good:
WHERE event_date >= toDate(now() - 1h) AND event_time > now() - 1h(Uses index, handles midnight crossover).
- Bad:
- Approximation: Use
uniq()instead ofuniqExact()unless precision is explicitly requested. - Joins: Put the smaller table on the RIGHT. Use
GLOBAL INonly for distributed queries.
4. Execution Workflow
- Generate: Create the SQL following the rules above.
- Validate (MANDATORY): Call
validate_sql(sql).- If invalid: Read the error, fix the SQL, and retry (max 3 attempts).
- Decide Action:
- Visualization: IF the user wants a chart, DO NOT execute. Pass the SQL to the visualization skill logic.
- Data: IF the user wants answers (lists, counts), call
execute_sql(sql). - Code Only: IF the user asks to "write SQL", just output the code block.
Source
git clone https://github.com/FrankChen021/datastoria/blob/master/src/lib/ai/skills/sql-expert/SKILL.mdView on GitHub Overview
sql-expert is an expert system that generates ClickHouse SQL, validates every new query with validate_sql, and applies optimization tactics for speed and accuracy. It guides you when you need data, queries, or analysis, and enforces strict schema and safety rules.
How This Skill Works
When you ask for data, it discovers the schema using get_tables and explore_schema if needed, treating exact user tokens as candidate columns. It builds fully qualified, semicolon-free SQL with safe exploration defaults (LIMIT) and time-bound filters, then mandatory validates via validate_sql. Depending on your request, it either executes the query or routes the SQL to the visualization path for charts.
When to Use It
- You need data or analysis from ClickHouse and the table schema is not fully known.
- You want optimized queries that respect partition keys, time windows, and primary key order.
- You plan to create visualizations or charts and should not execute until the data path is determined.
- You are exploring data and require safe defaults such as LIMIT and bounded time ranges.
- You ask for code only or to write SQL so the system returns a standalone SQL block without extra text.
Quick Start
- Step 1: If the schema is unknown, run get_tables and explore_schema to establish valid columns.
- Step 2: Build a ClickHouse SQL statement using fully qualified database.table names, add bounded time filters, and include LIMIT for exploration.
- Step 3: Run validate_sql(sql); if it fails, fix the SQL and retry up to 3 times; then execute_sql(sql) or route to visualization as requested.
Best Practices
- Always use fully qualified names in queries (database.table).
- Filter by the partition key first (event_date or timestamp) with bounded time windows.
- Use uniq() for approximate counts unless exact precision is requested.
- Place the smaller table on the RIGHT in joins and avoid GLOBAL IN for non distributed setups.
- Always call validate_sql(sql) before executing and retry up to 3 times if invalid.
Example Use Cases
- Generate a last 7 days daily active users report from analytics.events with event_date filtering.
- Compute top 100 products by revenue in the last 30 days using a bounded time window and LIMIT 100.
- Validate a complex event query against the discovered schema before running on large data sources.
- Explore a large table with a simple LIMIT to sample rows and inspect column types and values.
- User asks to write SQL; provide a code only SQL block without extra narrative.