Get the FREE Ultimate OpenClaw Setup Guide →

csv-data-summarizer

Use Caution
npx machina-cli add skill besoeasy/open-skills/csv-data-summarizer --openclaw
Files (1)
SKILL.md
8.9 KB

CSV Data Summarizer

This skill analyzes any CSV file and delivers a complete statistical summary with visualizations in one shot. It adapts intelligently to the type of data it finds — sales, customer, financial, operational, survey, or generic tabular data.

When to Use This Skill

  • User uploads or references a CSV file
  • Asking to summarize, analyze, or visualize tabular data
  • Requesting insights from a dataset
  • Wanting to understand data structure and quality

Behavior Rule

Do not ask the user what they want. Immediately run the full analysis.

When a CSV is provided, skip questions like "What would you like me to do?" and go straight to the analysis.

Required Tools / Libraries

pip install pandas matplotlib seaborn

How It Works

The skill inspects the data first, then automatically determines which analyses are relevant:

Data typeFocus areas
Sales / e-commerceTime-series trends, revenue, product performance
Customer dataDistributions, segmentation, geographic patterns
FinancialTrend analysis, statistics, correlations
OperationalTime-series, performance metrics, distributions
SurveyFrequency analysis, cross-tabulations
GenericAdapts based on column types found

Visualizations are only created when they make sense:

  • Time-series plots → only if date/timestamp columns exist
  • Correlation heatmaps → only if multiple numeric columns exist
  • Category distributions → only if categorical columns exist
  • Histograms → for numeric distributions when relevant

Core Function

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def summarize_csv(file_path):
    df = pd.read_csv(file_path)
    summary = []
    charts_created = []

    # --- Overview ---
    summary.append("=" * 60)
    summary.append("DATA OVERVIEW")
    summary.append("=" * 60)
    summary.append(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
    summary.append(f"\nColumns: {', '.join(df.columns.tolist())}")

    summary.append("\nDATA TYPES:")
    for col, dtype in df.dtypes.items():
        summary.append(f"  • {col}: {dtype}")

    # --- Data quality ---
    missing = df.isnull().sum().sum()
    missing_pct = (missing / (df.shape[0] * df.shape[1])) * 100
    summary.append("\nDATA QUALITY:")
    if missing:
        summary.append(f"Missing values: {missing:,} ({missing_pct:.2f}% of total data)")
        for col in df.columns:
            col_missing = df[col].isnull().sum()
            if col_missing > 0:
                summary.append(f"  • {col}: {col_missing:,} ({(col_missing / len(df)) * 100:.1f}%)")
    else:
        summary.append("No missing values — dataset is complete.")

    # --- Numeric analysis ---
    numeric_cols = df.select_dtypes(include='number').columns.tolist()
    if numeric_cols:
        summary.append("\nNUMERICAL ANALYSIS:")
        summary.append(str(df[numeric_cols].describe()))

        if len(numeric_cols) > 1:
            corr_matrix = df[numeric_cols].corr()
            summary.append("\nCORRELATIONS:")
            summary.append(str(corr_matrix))

            plt.figure(figsize=(10, 8))
            sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, square=True, linewidths=1)
            plt.title('Correlation Heatmap')
            plt.tight_layout()
            plt.savefig('correlation_heatmap.png', dpi=150)
            plt.close()
            charts_created.append('correlation_heatmap.png')

    # --- Categorical analysis ---
    categorical_cols = [c for c in df.select_dtypes(include='object').columns if 'id' not in c.lower()]
    if categorical_cols:
        summary.append("\nCATEGORICAL ANALYSIS:")
        for col in categorical_cols[:5]:
            value_counts = df[col].value_counts()
            summary.append(f"\n{col}:")
            for val, count in value_counts.head(10).items():
                summary.append(f"  • {val}: {count:,} ({(count / len(df)) * 100:.1f}%)")

    # --- Time series analysis ---
    date_cols = [c for c in df.columns if 'date' in c.lower() or 'time' in c.lower()]
    if date_cols:
        date_col = date_cols[0]
        df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
        date_range = df[date_col].max() - df[date_col].min()
        summary.append(f"\nTIME SERIES ANALYSIS:")
        summary.append(f"Date range: {df[date_col].min()} to {df[date_col].max()}")
        summary.append(f"Span: {date_range.days} days")

        if numeric_cols:
            fig, axes = plt.subplots(min(3, len(numeric_cols)), 1, figsize=(12, 4 * min(3, len(numeric_cols))))
            if len(numeric_cols) == 1:
                axes = [axes]
            for idx, num_col in enumerate(numeric_cols[:3]):
                ax = axes[idx]
                df.groupby(date_col)[num_col].mean().plot(ax=ax, linewidth=2)
                ax.set_title(f'{num_col} Over Time')
                ax.set_xlabel('Date')
                ax.set_ylabel(num_col)
                ax.grid(True, alpha=0.3)
            plt.tight_layout()
            plt.savefig('time_series_analysis.png', dpi=150)
            plt.close()
            charts_created.append('time_series_analysis.png')

    # --- Distribution plots ---
    if numeric_cols:
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        axes = axes.flatten()
        for idx, col in enumerate(numeric_cols[:4]):
            axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
            axes[idx].set_title(f'Distribution of {col}')
            axes[idx].set_xlabel(col)
            axes[idx].set_ylabel('Frequency')
            axes[idx].grid(True, alpha=0.3)
        for idx in range(len(numeric_cols[:4]), 4):
            axes[idx].set_visible(False)
        plt.tight_layout()
        plt.savefig('distributions.png', dpi=150)
        plt.close()
        charts_created.append('distributions.png')

    # --- Categorical distribution plots ---
    if categorical_cols:
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        axes = axes.flatten()
        for idx, col in enumerate(categorical_cols[:4]):
            value_counts = df[col].value_counts().head(10)
            axes[idx].barh(range(len(value_counts)), value_counts.values)
            axes[idx].set_yticks(range(len(value_counts)))
            axes[idx].set_yticklabels(value_counts.index)
            axes[idx].set_title(f'Top Values in {col}')
            axes[idx].set_xlabel('Count')
            axes[idx].grid(True, alpha=0.3, axis='x')
        for idx in range(len(categorical_cols[:4]), 4):
            axes[idx].set_visible(False)
        plt.tight_layout()
        plt.savefig('categorical_distributions.png', dpi=150)
        plt.close()
        charts_created.append('categorical_distributions.png')

    if charts_created:
        summary.append("\nVISUALIZATIONS CREATED:")
        for chart in charts_created:
            summary.append(f"  ✓ {chart}")

    summary.append("\n" + "=" * 60)
    summary.append("ANALYSIS COMPLETE")
    summary.append("=" * 60)

    return "\n".join(summary)

Usage

Here's sales_data.csv. Can you summarize this file?
Analyze this customer data CSV and show me trends.
What insights can you find in orders.csv?

Example Output

============================================================
DATA OVERVIEW
============================================================
Rows: 5,000 | Columns: 8
Columns: order_id, date, product, category, quantity, price, region, customer_id

DATA TYPES:
  • order_id: int64
  • date: object
  • price: float64
  ...

DATA QUALITY:
Missing values: 100 (0.25% of total data)
  • price: 100 (2.0%)

NUMERICAL ANALYSIS:
         quantity        price
count    5000.000    4900.000
mean        3.200      58.200
std         1.800      12.400
...

TIME SERIES ANALYSIS:
Date range: 2023-01-01 to 2023-12-31
Span: 364 days

VISUALIZATIONS CREATED:
  ✓ time_series_analysis.png
  ✓ distributions.png
  ✓ categorical_distributions.png
  ✓ correlation_heatmap.png
============================================================
ANALYSIS COMPLETE
============================================================

Notes

  • Date columns are auto-detected if the column name contains date or time
  • Columns with id in the name are excluded from categorical analysis
  • All charts are saved as PNG files in the working directory
  • Missing data is handled gracefully throughout

Related Skills

  • json-and-csv-data-transformation — Clean and reshape CSV data before analysis
  • database-query-and-export — Export query results to CSV for analysis
  • d3js-data-visualization — Build interactive browser-based charts from the same data

Source

git clone https://github.com/besoeasy/open-skills/blob/main/skills/csv-data-summarizer/SKILL.mdView on GitHub

Overview

CSV Data Summarizer automatically analyzes any CSV, delivering a complete statistical summary, data quality checks, and visualizations in one shot. It adapts to sales, customer, financial, operational, or survey data and outputs insights without prompts.

How This Skill Works

It loads the CSV with pandas, detects data types, and selects relevant analyses per data category (e.g., time-series for sales, distributions for categorical data). It computes descriptive statistics, checks for missing values, and, when appropriate, generates visualizations like correlation heatmaps and distribution plots using matplotlib and seaborn.

When to Use It

  • User uploads or references a CSV file for analysis.
  • You want to summarize, analyze, or visualize tabular data.
  • You need actionable insights from a dataset.
  • You want to understand data structure and data quality at a glance.
  • You require an automatic, prompt-free full analysis after upload.

Quick Start

  1. Step 1: Install dependencies if needed (pip install pandas matplotlib seaborn) and prepare your CSV.
  2. Step 2: Run summarize_csv('<path-to-csv>') or upload the file to trigger analysis.
  3. Step 3: Open the generated DATA OVERVIEW, DATA QUALITY report, numerical results, and any saved charts.

Best Practices

  • Prepare a clean CSV (UTF-8, proper delimiter) to ensure smooth parsing.
  • Keep date/time columns identifiable so time-series visuals render correctly.
  • Review outputs for missing values and seasonality before making decisions.
  • Run analyses with multiple numeric columns to enable correlation insights.
  • Validate charts and summaries with domain knowledge; avoid over-interpretation.

Example Use Cases

  • Finance: analyze a monthly sales CSV to reveal revenue trends and key correlations.
  • E-commerce: assess product performance and customer demographics with distributions and cross-tabs.
  • Operations: evaluate uptime, throughput, and performance metrics across time.
  • Survey: perform frequency analysis and cross-tabulations to identify respondent patterns.
  • Generic tabular: auto-generate a data-quality report and visuals for a mixed-column dataset.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers