csv-data-summarizer
Use Cautionnpx machina-cli add skill besoeasy/open-skills/csv-data-summarizer --openclawCSV Data Summarizer
This skill analyzes any CSV file and delivers a complete statistical summary with visualizations in one shot. It adapts intelligently to the type of data it finds — sales, customer, financial, operational, survey, or generic tabular data.
When to Use This Skill
- User uploads or references a CSV file
- Asking to summarize, analyze, or visualize tabular data
- Requesting insights from a dataset
- Wanting to understand data structure and quality
Behavior Rule
Do not ask the user what they want. Immediately run the full analysis.
When a CSV is provided, skip questions like "What would you like me to do?" and go straight to the analysis.
Required Tools / Libraries
pip install pandas matplotlib seaborn
How It Works
The skill inspects the data first, then automatically determines which analyses are relevant:
| Data type | Focus areas |
|---|---|
| Sales / e-commerce | Time-series trends, revenue, product performance |
| Customer data | Distributions, segmentation, geographic patterns |
| Financial | Trend analysis, statistics, correlations |
| Operational | Time-series, performance metrics, distributions |
| Survey | Frequency analysis, cross-tabulations |
| Generic | Adapts based on column types found |
Visualizations are only created when they make sense:
- Time-series plots → only if date/timestamp columns exist
- Correlation heatmaps → only if multiple numeric columns exist
- Category distributions → only if categorical columns exist
- Histograms → for numeric distributions when relevant
Core Function
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def summarize_csv(file_path):
df = pd.read_csv(file_path)
summary = []
charts_created = []
# --- Overview ---
summary.append("=" * 60)
summary.append("DATA OVERVIEW")
summary.append("=" * 60)
summary.append(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
summary.append(f"\nColumns: {', '.join(df.columns.tolist())}")
summary.append("\nDATA TYPES:")
for col, dtype in df.dtypes.items():
summary.append(f" • {col}: {dtype}")
# --- Data quality ---
missing = df.isnull().sum().sum()
missing_pct = (missing / (df.shape[0] * df.shape[1])) * 100
summary.append("\nDATA QUALITY:")
if missing:
summary.append(f"Missing values: {missing:,} ({missing_pct:.2f}% of total data)")
for col in df.columns:
col_missing = df[col].isnull().sum()
if col_missing > 0:
summary.append(f" • {col}: {col_missing:,} ({(col_missing / len(df)) * 100:.1f}%)")
else:
summary.append("No missing values — dataset is complete.")
# --- Numeric analysis ---
numeric_cols = df.select_dtypes(include='number').columns.tolist()
if numeric_cols:
summary.append("\nNUMERICAL ANALYSIS:")
summary.append(str(df[numeric_cols].describe()))
if len(numeric_cols) > 1:
corr_matrix = df[numeric_cols].corr()
summary.append("\nCORRELATIONS:")
summary.append(str(corr_matrix))
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, square=True, linewidths=1)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.close()
charts_created.append('correlation_heatmap.png')
# --- Categorical analysis ---
categorical_cols = [c for c in df.select_dtypes(include='object').columns if 'id' not in c.lower()]
if categorical_cols:
summary.append("\nCATEGORICAL ANALYSIS:")
for col in categorical_cols[:5]:
value_counts = df[col].value_counts()
summary.append(f"\n{col}:")
for val, count in value_counts.head(10).items():
summary.append(f" • {val}: {count:,} ({(count / len(df)) * 100:.1f}%)")
# --- Time series analysis ---
date_cols = [c for c in df.columns if 'date' in c.lower() or 'time' in c.lower()]
if date_cols:
date_col = date_cols[0]
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
date_range = df[date_col].max() - df[date_col].min()
summary.append(f"\nTIME SERIES ANALYSIS:")
summary.append(f"Date range: {df[date_col].min()} to {df[date_col].max()}")
summary.append(f"Span: {date_range.days} days")
if numeric_cols:
fig, axes = plt.subplots(min(3, len(numeric_cols)), 1, figsize=(12, 4 * min(3, len(numeric_cols))))
if len(numeric_cols) == 1:
axes = [axes]
for idx, num_col in enumerate(numeric_cols[:3]):
ax = axes[idx]
df.groupby(date_col)[num_col].mean().plot(ax=ax, linewidth=2)
ax.set_title(f'{num_col} Over Time')
ax.set_xlabel('Date')
ax.set_ylabel(num_col)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('time_series_analysis.png', dpi=150)
plt.close()
charts_created.append('time_series_analysis.png')
# --- Distribution plots ---
if numeric_cols:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
for idx, col in enumerate(numeric_cols[:4]):
axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
axes[idx].set_title(f'Distribution of {col}')
axes[idx].set_xlabel(col)
axes[idx].set_ylabel('Frequency')
axes[idx].grid(True, alpha=0.3)
for idx in range(len(numeric_cols[:4]), 4):
axes[idx].set_visible(False)
plt.tight_layout()
plt.savefig('distributions.png', dpi=150)
plt.close()
charts_created.append('distributions.png')
# --- Categorical distribution plots ---
if categorical_cols:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, col in enumerate(categorical_cols[:4]):
value_counts = df[col].value_counts().head(10)
axes[idx].barh(range(len(value_counts)), value_counts.values)
axes[idx].set_yticks(range(len(value_counts)))
axes[idx].set_yticklabels(value_counts.index)
axes[idx].set_title(f'Top Values in {col}')
axes[idx].set_xlabel('Count')
axes[idx].grid(True, alpha=0.3, axis='x')
for idx in range(len(categorical_cols[:4]), 4):
axes[idx].set_visible(False)
plt.tight_layout()
plt.savefig('categorical_distributions.png', dpi=150)
plt.close()
charts_created.append('categorical_distributions.png')
if charts_created:
summary.append("\nVISUALIZATIONS CREATED:")
for chart in charts_created:
summary.append(f" ✓ {chart}")
summary.append("\n" + "=" * 60)
summary.append("ANALYSIS COMPLETE")
summary.append("=" * 60)
return "\n".join(summary)
Usage
Here's sales_data.csv. Can you summarize this file?
Analyze this customer data CSV and show me trends.
What insights can you find in orders.csv?
Example Output
============================================================
DATA OVERVIEW
============================================================
Rows: 5,000 | Columns: 8
Columns: order_id, date, product, category, quantity, price, region, customer_id
DATA TYPES:
• order_id: int64
• date: object
• price: float64
...
DATA QUALITY:
Missing values: 100 (0.25% of total data)
• price: 100 (2.0%)
NUMERICAL ANALYSIS:
quantity price
count 5000.000 4900.000
mean 3.200 58.200
std 1.800 12.400
...
TIME SERIES ANALYSIS:
Date range: 2023-01-01 to 2023-12-31
Span: 364 days
VISUALIZATIONS CREATED:
✓ time_series_analysis.png
✓ distributions.png
✓ categorical_distributions.png
✓ correlation_heatmap.png
============================================================
ANALYSIS COMPLETE
============================================================
Notes
- Date columns are auto-detected if the column name contains
dateortime - Columns with
idin the name are excluded from categorical analysis - All charts are saved as PNG files in the working directory
- Missing data is handled gracefully throughout
Related Skills
json-and-csv-data-transformation— Clean and reshape CSV data before analysisdatabase-query-and-export— Export query results to CSV for analysisd3js-data-visualization— Build interactive browser-based charts from the same data
Source
git clone https://github.com/besoeasy/open-skills/blob/main/skills/csv-data-summarizer/SKILL.mdView on GitHub Overview
CSV Data Summarizer automatically analyzes any CSV, delivering a complete statistical summary, data quality checks, and visualizations in one shot. It adapts to sales, customer, financial, operational, or survey data and outputs insights without prompts.
How This Skill Works
It loads the CSV with pandas, detects data types, and selects relevant analyses per data category (e.g., time-series for sales, distributions for categorical data). It computes descriptive statistics, checks for missing values, and, when appropriate, generates visualizations like correlation heatmaps and distribution plots using matplotlib and seaborn.
When to Use It
- User uploads or references a CSV file for analysis.
- You want to summarize, analyze, or visualize tabular data.
- You need actionable insights from a dataset.
- You want to understand data structure and data quality at a glance.
- You require an automatic, prompt-free full analysis after upload.
Quick Start
- Step 1: Install dependencies if needed (pip install pandas matplotlib seaborn) and prepare your CSV.
- Step 2: Run summarize_csv('<path-to-csv>') or upload the file to trigger analysis.
- Step 3: Open the generated DATA OVERVIEW, DATA QUALITY report, numerical results, and any saved charts.
Best Practices
- Prepare a clean CSV (UTF-8, proper delimiter) to ensure smooth parsing.
- Keep date/time columns identifiable so time-series visuals render correctly.
- Review outputs for missing values and seasonality before making decisions.
- Run analyses with multiple numeric columns to enable correlation insights.
- Validate charts and summaries with domain knowledge; avoid over-interpretation.
Example Use Cases
- Finance: analyze a monthly sales CSV to reveal revenue trends and key correlations.
- E-commerce: assess product performance and customer demographics with distributions and cross-tabs.
- Operations: evaluate uptime, throughput, and performance metrics across time.
- Survey: perform frequency analysis and cross-tabulations to identify respondent patterns.
- Generic tabular: auto-generate a data-quality report and visuals for a mixed-column dataset.