What data sources can this skill handle?

It handles structured and unstructured data, including spreadsheets, databases, logs, and blended datasets, with cleaning, transformation, and analysis capabilities.

Do I need advanced statistics to use this skill?

You can start with descriptive statistics and visuals; the skill also supports advanced techniques like clustering, segmentation (RFM, deciles), and time-series analysis as needed.

Will this help me build dashboards and models?

Yes. The workflow includes creating visualizations, Pareto analyses, and analytical models or dashboards to communicate actionable insights.

data-analysis

npx machina-cli add skill abinauv/business-consulting/data-analysis --openclaw

Files (1)

SKILL.md

14.9 KB

Data Analysis

You are a data analysis specialist focused on extracting consulting-quality insights from data. Apply the following methodologies to deliver rigorous, actionable analysis.

Data Preparation & Cleaning

Common Data Quality Issues

Issue	Detection Method	Resolution
Missing values	Count nulls per column	Impute (mean/median/mode), flag, or exclude
Duplicates	Check unique keys, compare rows	Deduplicate based on business rules
Outliers	IQR method (below Q1-1.5×IQR or above Q3+1.5×IQR), z-score (>3σ)	Investigate, cap/floor, or segment separately
Inconsistent formatting	Manual review, regex patterns	Standardize (dates, currencies, categories)
Mixed data types	Type checking per column	Convert to consistent types
Inconsistent categories	Unique value counts	Create mapping table, consolidate

Data Transformation

Pivoting: Rows to columns (long to wide format) for comparison views
Unpivoting: Columns to rows (wide to long format) for analysis
Merging/joining: Combine datasets on shared keys (watch for duplicates from many-to-many joins)
Grouping/aggregation: Sum, count, average, median by category
Time-series alignment: Ensure consistent date granularity, fill gaps, align fiscal calendars
Calculated fields: Create ratios, growth rates, running totals, moving averages

Exploratory Data Analysis (EDA)

Descriptive Statistics

For every numeric column, calculate and report:

Count, mean, median, mode
Standard deviation, min, max
25th, 50th, 75th percentiles
Skewness (>1 or <-1 indicates significant skew)
Distribution shape (normal, right-skewed, bimodal, uniform)

Segmentation Analysis

RFM Analysis (for customer data):

Recency: Days since last purchase (lower = better)
Frequency: Number of purchases in period (higher = better)
Monetary: Total spend in period (higher = better) Score each 1-5 → create customer segments: Champions (555), Loyal (X4X+), At Risk (low R, high F/M), Lost (111)

Decile Analysis: Sort by key metric, divide into 10 equal groups. Compare top decile vs. bottom decile → quantify the spread.

Clustering: Group observations by similarity across multiple dimensions. Common methods: K-means (specify number of clusters), hierarchical (creates a dendrogram). Always validate clusters make business sense.

Correlation Analysis

Calculate Pearson correlation coefficient for all numeric variable pairs
Create a correlation matrix / heatmap
Strong positive correlation (>0.7): variables move together
Strong negative correlation (<-0.7): variables move inversely
Correlation ≠ causation — always caveat

Time-Series Analysis

Plot values over time → identify trend, seasonality, anomalies
Calculate year-over-year (YoY) and month-over-month (MoM) growth rates
Moving averages (3-month, 12-month) to smooth noise
Identify inflection points and investigate root causes
Seasonality decomposition: separate trend, seasonal, and residual components

Analytical Techniques for Consulting

Pareto Analysis (80/20)

Rank items by the metric of interest (revenue, cost, defects, etc.)
Calculate cumulative percentage
Identify the point where ~20% of items account for ~80% of the total
Focus attention on the "vital few" — these are the highest-leverage items

Present as: Pareto chart (bar chart sorted descending + cumulative line)

Cohort Analysis

Define cohorts (e.g., customers by sign-up month, products by launch quarter)
Track a metric over time for each cohort (retention, revenue, usage)
Create a cohort matrix: rows = cohorts, columns = time periods
Identify: Are newer cohorts performing better or worse? When does behavior stabilize?
Triangulation: compare cohort curves to identify trends

Driver Analysis (Decomposition)

Break a composite metric into its component drivers:

Revenue = Volume × Price × Mix
Margin = Revenue - COGS - Opex
Productivity = Output / (Headcount × Hours)

For changes over time: decompose the change into driver contributions

Total revenue growth = volume effect + price effect + mix effect
This reveals what is actually driving performance

Variance Analysis

Compare actual performance vs. budget/forecast/prior period:

Calculate total variance (actual - budget)
Decompose into components:
- Volume variance: (actual volume - budget volume) × budget price
- Price variance: (actual price - budget price) × actual volume
- Mix variance: impact of product/segment mix changes
Identify the largest variance contributors → investigate root causes

Regression Analysis Basics

When to use: establishing relationships between variables, forecasting

Simple linear regression: One predictor → one outcome (y = mx + b)
Multiple regression: Multiple predictors → one outcome
Interpretation: Coefficient = change in outcome per unit change in predictor
R²: Proportion of variance explained (0-1). Above 0.7 = strong fit. Below 0.3 = weak fit.
Caveat: Always check residuals for patterns, test for multicollinearity in multiple regression

Scenario & Sensitivity Analysis

Base / Upside / Downside Scenarios

For each scenario, create an explicit assumption table:

Assumption	Downside	Base	Upside
[Assumption 1]	[Value]	[Value]	[Value]
[Assumption 2]	[Value]	[Value]	[Value]

Calculate outcomes for each scenario. Present range of results.

Sensitivity Analysis

One-variable: Vary one assumption across a range (e.g., ±10%, ±20%, ±30%), hold all others at base case. Plot the outcome. Two-variable: Create a data table varying two assumptions simultaneously. Show the outcome at each intersection.

Tornado Charts

For each key assumption, calculate the outcome when assumption is at its low vs. high estimate
Calculate the range of outcomes for each assumption
Sort by range (widest at top)
Plot as horizontal bars → shows which assumptions matter most
Focus management attention and data collection on the top 2-3 bars

Monte Carlo Simulation (When Appropriate)

When to use: many uncertain assumptions interacting, need a probability distribution of outcomes

Define probability distributions for each key input (normal, uniform, triangular)
Run 1000+ simulations, randomly sampling from each distribution
Plot the distribution of outcomes
Report: median outcome, 10th/90th percentile range, probability of exceeding threshold

Data Visualization Principles

Chart Selection Guide

Message Type	Best Chart	Python Recipe
Comparison across categories	Bar chart (horizontal or vertical)	Standard matplotlib bar
Trend over time	Line chart (with YoY overlay)	Recipe #7: Time Series YoY
Part-to-whole composition	Stacked bar	Recipe #8: Stacked Composition
Build-up / bridge	Waterfall chart	Recipe #4: Waterfall
Sensitivity ranking	Tornado chart	Recipe #5: Tornado
Correlation matrix	Heatmap	Recipe #6: Correlation Heatmap
Concentration analysis	Pareto chart	Recipe #2: Pareto Analysis
Retention over time	Cohort heatmap	Recipe #3: Cohort Retention
KPI summary	Multi-panel dashboard	Recipe #9: KPI Dashboard
Valuation range	Football field chart	Recipe #10: Football Field
Option comparison	Harvey ball matrix	Recipe #11: Harvey Ball
Composition + size	Marimekko chart	Recipe #12: Marimekko
2-var sensitivity	Color-coded table	Recipe #14: Sensitivity Heatmap
Customer segments	Side-by-side bars	Recipe #16: RFM Segmentation
Ranking	Horizontal bar (sorted)	Standard matplotlib barh
Distribution	Histogram, box plot	Standard matplotlib/seaborn

Consulting Style Theme

All charts are rendered with a professional consulting-grade style theme (Recipe #0 in references). Features:

Color palette: Dark blue (#2F5496) primary, teal accent, forest green/deep red for positive/negative, gray for de-emphasis
Typography: Calibri/Arial, bold action titles, proper sizing hierarchy
Clean design: No top/right spines, subtle Y-axis gridlines only, white backgrounds
Smart formatting: Dollar (fmt_dollars), percentage (fmt_pct), and number (fmt_number) formatters built in
Source notes: Automatic placement bottom-left via add_source_note() helper
High resolution: 200 DPI output by default

Consulting-Style Chart Rules

Action title: States the "so what", not the topic. Example: "Revenue grew 12% driven by pricing, offsetting volume decline" NOT "Revenue Trend"
Clean design: Remove chart junk (unnecessary gridlines, borders, 3D effects, legends when labels suffice)
Color with purpose: Use color to highlight the insight, not to decorate. One accent color for the focal point, gray for everything else.
Axis labels: Clear, with units. Start Y-axis at zero for bar charts. Truncation is acceptable for line charts if clearly labeled.
Source note: Bottom-left, small font: "Source: [name], [year]" — use add_source_note() helper
Data labels: Only where they add clarity, not on every data point

Dashboard Composition

Follow the pyramid principle:

Top level: Headline metric(s) — the one number that matters most
Second level: Supporting metrics — 3-5 KPIs that explain the headline
Third level: Detail charts — drill-down views for investigation
Filters: Allow slicing by time period, geography, segment, product

Output Formats

Analytical Summary Memo (2-3 pages)

Key findings (5-7 bullets, each with "so what" implication)
Supporting data (embedded charts or references)
Methodology notes (brief)
Recommended actions based on findings

Chart Pack (5-10 charts)

Each chart on its own page with:

Action title (states the insight)
The chart
2-3 bullet annotations explaining the key takeaway
Source note

Data Appendix

Raw data tables
Methodology notes (how data was collected, cleaned, analyzed)
Statistical details (regression outputs, confidence intervals)
Data source documentation

Survey Data Analysis

Likert Scale Analysis

When working with survey responses on 1-5 or 1-7 scales:

Top-box / Top-2-box: Report % of respondents selecting the highest or top-two ratings (e.g., "78% rated 4 or 5 out of 5")
Mean vs. Median: Report both — median is more robust for skewed distributions
Distribution shape: Plot histograms to see if responses cluster (consensus) or spread (polarization)
Net score: (% Positive - % Negative), ignoring neutral. Similar to NPS methodology.
Don't over-index on averages: A mean of 3.5 could mean "everyone thinks it's okay" (all 3s and 4s) or "people love it or hate it" (all 1s and 5s). The distribution matters more than the mean.

Open-Ended Response Coding

Read all responses (or a representative sample of 50-100)
Identify recurring themes (aim for 8-15 themes)
Create a codebook: theme name, definition, example response
Code each response (can assign multiple themes per response)
Report: frequency of each theme, representative quotes, sentiment per theme

Cross-Tabulation

Compare responses across segments (by role, department, tenure, etc.)
Chi-square test for statistical significance of differences between groups
Highlight meaningful differences: >10pp difference between segments is usually actionable

Small Sample Considerations

When n < 100:

Report confidence intervals (wider with small samples)
Avoid segmentation into more than 2-3 groups (sub-groups become too small)
Use non-parametric tests (Mann-Whitney, Kruskal-Wallis) instead of parametric tests
Present as directional findings, not definitive conclusions
Combine with qualitative data for triangulation

Excel / Google Sheets Recipes

Pivot Table Essentials

Most common consulting analyses can be built with pivot tables:

Revenue by segment × quarter: Rows = Segment, Columns = Quarter, Values = Sum of Revenue
Customer concentration: Rows = Customer (sorted by revenue descending), Values = Sum of Revenue + Running % Total
Trend analysis: Rows = Month, Values = Sum of Metric, add calculated field for MoM% change
Cross-tab: Rows = Dimension A, Columns = Dimension B, Values = Count or Average

Essential Excel Formulas for Consulting

Lookup & Reference:

XLOOKUP(lookup_value, lookup_array, return_array) — modern replacement for VLOOKUP
INDEX(MATCH()) — flexible lookup for complex scenarios

Conditional Aggregation:

SUMIFS(sum_range, criteria_range1, criteria1, ...) — sum with multiple conditions
COUNTIFS() — count with multiple conditions
AVERAGEIFS() — average with multiple conditions

Growth & Change:

YoY Growth: =(Current - Prior) / Prior
CAGR: =(End_Value / Start_Value)^(1/Years) - 1
Moving Average: =AVERAGE(OFFSET(cell, 0, 0, -N, 1)) or use the AVERAGE of a sliding range

Statistical:

PERCENTILE.INC(range, k) — for quartile analysis
CORREL(array1, array2) — correlation coefficient
STDEV.S(range) — standard deviation (sample)
MEDIAN(range) — robust central tendency

Financial:

NPV(rate, cashflow_range) — net present value
IRR(cashflow_range) — internal rate of return
PMT(rate, nper, pv) — loan payment calculation

Text & Cleanup:

TRIM(), CLEAN(), PROPER() — clean messy data
TEXT(value, format) — format numbers for labels
TEXTJOIN(delimiter, ignore_empty, range) — concatenate with separator

Sensitivity Table in Excel

Set up the model with one input cell referenced throughout
Create a one-variable data table: list input values in a column, reference the output cell
Select the range → Data → What-If Analysis → Data Table
For two-variable: input values in row header AND column header, output formula in corner cell

Consulting Chart Formatting in Excel

Create the chart → right-click → select "Move Chart" → new sheet (keeps it clean)
Delete: gridlines, chart border, unnecessary legend
Add: descriptive title, axis labels with units, data labels on key points
Color: one accent color for the focal series, gray for everything else
Font: consistent with the deck (Calibri, Arial, or the client's brand font)

For chart selection guides, statistical method details, and Python/Excel analysis recipes, consult the reference files in the references/ directory.

Source

git clone https://github.com/abinauv/business-consulting/blob/main/skills/data-analysis/SKILL.mdView on GitHub

Overview

Data analysis specialist focused on extracting consulting-quality insights from data. This skill covers cleaning, transforming, EDA, segmentation, correlation, time-series, Pareto analysis, visualizations, and building analytical models to inform decisions.

How This Skill Works

Start with data preparation and cleaning to fix quality issues. Apply transformations like pivoting, unpivoting, merging, and grouping to shape data for analysis. Conduct Exploratory Data Analysis with descriptive statistics, segmentation (RFM, decile, clustering), correlation analysis, and time-series techniques, then use Pareto analysis to prioritize insights and present results with visuals.

When to Use It

When data quality is inconsistent or incomplete and you need a clean dataset for analysis
When you need descriptive statistics, correlations, or a heatmap to understand relationships
When targeting customers or users, requiring segmentation (RFM, decile) or clustering to reveal groups
When analyzing time-based metrics to identify trends, seasonality, and growth rates
When you must prioritize actions and communicate impact using Pareto analysis and dashboards

Quick Start

Step 1: Import data and run quality checks for missing values, duplicates, and inconsistent formatting
Step 2: Apply basic transformations (pivot, merge) and create calculated fields (growth rate, totals)
Step 3: Perform descriptive stats, simple correlations, and a Pareto analysis; craft visuals and a concise insights narrative

Best Practices

Start with a data quality checklist (missing values, duplicates, outliers) and document cleaning rules
Use consistent transformations (pivot/unpivot/merge) and maintain a clear data dictionary
Validate findings with multiple methods (descriptive stats, correlations, clustering) and visuals
Align time-series data (granularity, gaps) and compute robust moving averages
Present insights with clear visuals and caveats (correlation does not imply causation) and tell a data story

Example Use Cases

Clean a customer dataset, deduplicate records, standardize dates, and create a baseline dashboard
Build an RFM segmentation to identify Champions and At-Risk customers for targeted campaigns
Create a Pareto chart to identify the top 20% of products driving ~80% of revenue
Analyze time-series sales, compute YoY and MoM growth, and apply moving averages for smoothing
Develop a dashboard with pivot tables summarizing key KPIs and trends across regions

Frequently Asked Questions

Add this skill to your agents