data-analysis
npx machina-cli add skill abinauv/business-consulting/data-analysis --openclawData Analysis
You are a data analysis specialist focused on extracting consulting-quality insights from data. Apply the following methodologies to deliver rigorous, actionable analysis.
Data Preparation & Cleaning
Common Data Quality Issues
| Issue | Detection Method | Resolution |
|---|---|---|
| Missing values | Count nulls per column | Impute (mean/median/mode), flag, or exclude |
| Duplicates | Check unique keys, compare rows | Deduplicate based on business rules |
| Outliers | IQR method (below Q1-1.5×IQR or above Q3+1.5×IQR), z-score (>3σ) | Investigate, cap/floor, or segment separately |
| Inconsistent formatting | Manual review, regex patterns | Standardize (dates, currencies, categories) |
| Mixed data types | Type checking per column | Convert to consistent types |
| Inconsistent categories | Unique value counts | Create mapping table, consolidate |
Data Transformation
- Pivoting: Rows to columns (long to wide format) for comparison views
- Unpivoting: Columns to rows (wide to long format) for analysis
- Merging/joining: Combine datasets on shared keys (watch for duplicates from many-to-many joins)
- Grouping/aggregation: Sum, count, average, median by category
- Time-series alignment: Ensure consistent date granularity, fill gaps, align fiscal calendars
- Calculated fields: Create ratios, growth rates, running totals, moving averages
Exploratory Data Analysis (EDA)
Descriptive Statistics
For every numeric column, calculate and report:
- Count, mean, median, mode
- Standard deviation, min, max
- 25th, 50th, 75th percentiles
- Skewness (>1 or <-1 indicates significant skew)
- Distribution shape (normal, right-skewed, bimodal, uniform)
Segmentation Analysis
RFM Analysis (for customer data):
- Recency: Days since last purchase (lower = better)
- Frequency: Number of purchases in period (higher = better)
- Monetary: Total spend in period (higher = better) Score each 1-5 → create customer segments: Champions (555), Loyal (X4X+), At Risk (low R, high F/M), Lost (111)
Decile Analysis: Sort by key metric, divide into 10 equal groups. Compare top decile vs. bottom decile → quantify the spread.
Clustering: Group observations by similarity across multiple dimensions. Common methods: K-means (specify number of clusters), hierarchical (creates a dendrogram). Always validate clusters make business sense.
Correlation Analysis
- Calculate Pearson correlation coefficient for all numeric variable pairs
- Create a correlation matrix / heatmap
- Strong positive correlation (>0.7): variables move together
- Strong negative correlation (<-0.7): variables move inversely
- Correlation ≠ causation — always caveat
Time-Series Analysis
- Plot values over time → identify trend, seasonality, anomalies
- Calculate year-over-year (YoY) and month-over-month (MoM) growth rates
- Moving averages (3-month, 12-month) to smooth noise
- Identify inflection points and investigate root causes
- Seasonality decomposition: separate trend, seasonal, and residual components
Analytical Techniques for Consulting
Pareto Analysis (80/20)
- Rank items by the metric of interest (revenue, cost, defects, etc.)
- Calculate cumulative percentage
- Identify the point where ~20% of items account for ~80% of the total
- Focus attention on the "vital few" — these are the highest-leverage items
Present as: Pareto chart (bar chart sorted descending + cumulative line)
Cohort Analysis
- Define cohorts (e.g., customers by sign-up month, products by launch quarter)
- Track a metric over time for each cohort (retention, revenue, usage)
- Create a cohort matrix: rows = cohorts, columns = time periods
- Identify: Are newer cohorts performing better or worse? When does behavior stabilize?
- Triangulation: compare cohort curves to identify trends
Driver Analysis (Decomposition)
Break a composite metric into its component drivers:
- Revenue = Volume × Price × Mix
- Margin = Revenue - COGS - Opex
- Productivity = Output / (Headcount × Hours)
For changes over time: decompose the change into driver contributions
- Total revenue growth = volume effect + price effect + mix effect
- This reveals what is actually driving performance
Variance Analysis
Compare actual performance vs. budget/forecast/prior period:
- Calculate total variance (actual - budget)
- Decompose into components:
- Volume variance: (actual volume - budget volume) × budget price
- Price variance: (actual price - budget price) × actual volume
- Mix variance: impact of product/segment mix changes
- Identify the largest variance contributors → investigate root causes
Regression Analysis Basics
When to use: establishing relationships between variables, forecasting
- Simple linear regression: One predictor → one outcome (y = mx + b)
- Multiple regression: Multiple predictors → one outcome
- Interpretation: Coefficient = change in outcome per unit change in predictor
- R²: Proportion of variance explained (0-1). Above 0.7 = strong fit. Below 0.3 = weak fit.
- Caveat: Always check residuals for patterns, test for multicollinearity in multiple regression
Scenario & Sensitivity Analysis
Base / Upside / Downside Scenarios
For each scenario, create an explicit assumption table:
| Assumption | Downside | Base | Upside |
|---|---|---|---|
| [Assumption 1] | [Value] | [Value] | [Value] |
| [Assumption 2] | [Value] | [Value] | [Value] |
Calculate outcomes for each scenario. Present range of results.
Sensitivity Analysis
One-variable: Vary one assumption across a range (e.g., ±10%, ±20%, ±30%), hold all others at base case. Plot the outcome. Two-variable: Create a data table varying two assumptions simultaneously. Show the outcome at each intersection.
Tornado Charts
- For each key assumption, calculate the outcome when assumption is at its low vs. high estimate
- Calculate the range of outcomes for each assumption
- Sort by range (widest at top)
- Plot as horizontal bars → shows which assumptions matter most
- Focus management attention and data collection on the top 2-3 bars
Monte Carlo Simulation (When Appropriate)
When to use: many uncertain assumptions interacting, need a probability distribution of outcomes
- Define probability distributions for each key input (normal, uniform, triangular)
- Run 1000+ simulations, randomly sampling from each distribution
- Plot the distribution of outcomes
- Report: median outcome, 10th/90th percentile range, probability of exceeding threshold
Data Visualization Principles
Chart Selection Guide
| Message Type | Best Chart | Python Recipe |
|---|---|---|
| Comparison across categories | Bar chart (horizontal or vertical) | Standard matplotlib bar |
| Trend over time | Line chart (with YoY overlay) | Recipe #7: Time Series YoY |
| Part-to-whole composition | Stacked bar | Recipe #8: Stacked Composition |
| Build-up / bridge | Waterfall chart | Recipe #4: Waterfall |
| Sensitivity ranking | Tornado chart | Recipe #5: Tornado |
| Correlation matrix | Heatmap | Recipe #6: Correlation Heatmap |
| Concentration analysis | Pareto chart | Recipe #2: Pareto Analysis |
| Retention over time | Cohort heatmap | Recipe #3: Cohort Retention |
| KPI summary | Multi-panel dashboard | Recipe #9: KPI Dashboard |
| Valuation range | Football field chart | Recipe #10: Football Field |
| Option comparison | Harvey ball matrix | Recipe #11: Harvey Ball |
| Composition + size | Marimekko chart | Recipe #12: Marimekko |
| 2-var sensitivity | Color-coded table | Recipe #14: Sensitivity Heatmap |
| Customer segments | Side-by-side bars | Recipe #16: RFM Segmentation |
| Ranking | Horizontal bar (sorted) | Standard matplotlib barh |
| Distribution | Histogram, box plot | Standard matplotlib/seaborn |
Consulting Style Theme
All charts are rendered with a professional consulting-grade style theme (Recipe #0 in references). Features:
- Color palette: Dark blue (#2F5496) primary, teal accent, forest green/deep red for positive/negative, gray for de-emphasis
- Typography: Calibri/Arial, bold action titles, proper sizing hierarchy
- Clean design: No top/right spines, subtle Y-axis gridlines only, white backgrounds
- Smart formatting: Dollar (fmt_dollars), percentage (fmt_pct), and number (fmt_number) formatters built in
- Source notes: Automatic placement bottom-left via add_source_note() helper
- High resolution: 200 DPI output by default
Consulting-Style Chart Rules
- Action title: States the "so what", not the topic. Example: "Revenue grew 12% driven by pricing, offsetting volume decline" NOT "Revenue Trend"
- Clean design: Remove chart junk (unnecessary gridlines, borders, 3D effects, legends when labels suffice)
- Color with purpose: Use color to highlight the insight, not to decorate. One accent color for the focal point, gray for everything else.
- Axis labels: Clear, with units. Start Y-axis at zero for bar charts. Truncation is acceptable for line charts if clearly labeled.
- Source note: Bottom-left, small font: "Source: [name], [year]" — use add_source_note() helper
- Data labels: Only where they add clarity, not on every data point
Dashboard Composition
Follow the pyramid principle:
- Top level: Headline metric(s) — the one number that matters most
- Second level: Supporting metrics — 3-5 KPIs that explain the headline
- Third level: Detail charts — drill-down views for investigation
- Filters: Allow slicing by time period, geography, segment, product
Output Formats
Analytical Summary Memo (2-3 pages)
- Key findings (5-7 bullets, each with "so what" implication)
- Supporting data (embedded charts or references)
- Methodology notes (brief)
- Recommended actions based on findings
Chart Pack (5-10 charts)
Each chart on its own page with:
- Action title (states the insight)
- The chart
- 2-3 bullet annotations explaining the key takeaway
- Source note
Data Appendix
- Raw data tables
- Methodology notes (how data was collected, cleaned, analyzed)
- Statistical details (regression outputs, confidence intervals)
- Data source documentation
Survey Data Analysis
Likert Scale Analysis
When working with survey responses on 1-5 or 1-7 scales:
- Top-box / Top-2-box: Report % of respondents selecting the highest or top-two ratings (e.g., "78% rated 4 or 5 out of 5")
- Mean vs. Median: Report both — median is more robust for skewed distributions
- Distribution shape: Plot histograms to see if responses cluster (consensus) or spread (polarization)
- Net score: (% Positive - % Negative), ignoring neutral. Similar to NPS methodology.
- Don't over-index on averages: A mean of 3.5 could mean "everyone thinks it's okay" (all 3s and 4s) or "people love it or hate it" (all 1s and 5s). The distribution matters more than the mean.
Open-Ended Response Coding
- Read all responses (or a representative sample of 50-100)
- Identify recurring themes (aim for 8-15 themes)
- Create a codebook: theme name, definition, example response
- Code each response (can assign multiple themes per response)
- Report: frequency of each theme, representative quotes, sentiment per theme
Cross-Tabulation
- Compare responses across segments (by role, department, tenure, etc.)
- Chi-square test for statistical significance of differences between groups
- Highlight meaningful differences: >10pp difference between segments is usually actionable
Small Sample Considerations
When n < 100:
- Report confidence intervals (wider with small samples)
- Avoid segmentation into more than 2-3 groups (sub-groups become too small)
- Use non-parametric tests (Mann-Whitney, Kruskal-Wallis) instead of parametric tests
- Present as directional findings, not definitive conclusions
- Combine with qualitative data for triangulation
Excel / Google Sheets Recipes
Pivot Table Essentials
Most common consulting analyses can be built with pivot tables:
- Revenue by segment × quarter: Rows = Segment, Columns = Quarter, Values = Sum of Revenue
- Customer concentration: Rows = Customer (sorted by revenue descending), Values = Sum of Revenue + Running % Total
- Trend analysis: Rows = Month, Values = Sum of Metric, add calculated field for MoM% change
- Cross-tab: Rows = Dimension A, Columns = Dimension B, Values = Count or Average
Essential Excel Formulas for Consulting
Lookup & Reference:
XLOOKUP(lookup_value, lookup_array, return_array)— modern replacement for VLOOKUPINDEX(MATCH())— flexible lookup for complex scenarios
Conditional Aggregation:
SUMIFS(sum_range, criteria_range1, criteria1, ...)— sum with multiple conditionsCOUNTIFS()— count with multiple conditionsAVERAGEIFS()— average with multiple conditions
Growth & Change:
- YoY Growth:
=(Current - Prior) / Prior - CAGR:
=(End_Value / Start_Value)^(1/Years) - 1 - Moving Average:
=AVERAGE(OFFSET(cell, 0, 0, -N, 1))or use the AVERAGE of a sliding range
Statistical:
PERCENTILE.INC(range, k)— for quartile analysisCORREL(array1, array2)— correlation coefficientSTDEV.S(range)— standard deviation (sample)MEDIAN(range)— robust central tendency
Financial:
NPV(rate, cashflow_range)— net present valueIRR(cashflow_range)— internal rate of returnPMT(rate, nper, pv)— loan payment calculation
Text & Cleanup:
TRIM(),CLEAN(),PROPER()— clean messy dataTEXT(value, format)— format numbers for labelsTEXTJOIN(delimiter, ignore_empty, range)— concatenate with separator
Sensitivity Table in Excel
- Set up the model with one input cell referenced throughout
- Create a one-variable data table: list input values in a column, reference the output cell
- Select the range → Data → What-If Analysis → Data Table
- For two-variable: input values in row header AND column header, output formula in corner cell
Consulting Chart Formatting in Excel
- Create the chart → right-click → select "Move Chart" → new sheet (keeps it clean)
- Delete: gridlines, chart border, unnecessary legend
- Add: descriptive title, axis labels with units, data labels on key points
- Color: one accent color for the focal series, gray for everything else
- Font: consistent with the deck (Calibri, Arial, or the client's brand font)
For chart selection guides, statistical method details, and Python/Excel analysis recipes, consult the reference files in the references/ directory.
Source
git clone https://github.com/abinauv/business-consulting/blob/main/skills/data-analysis/SKILL.mdView on GitHub Overview
Data analysis specialist focused on extracting consulting-quality insights from data. This skill covers cleaning, transforming, EDA, segmentation, correlation, time-series, Pareto analysis, visualizations, and building analytical models to inform decisions.
How This Skill Works
Start with data preparation and cleaning to fix quality issues. Apply transformations like pivoting, unpivoting, merging, and grouping to shape data for analysis. Conduct Exploratory Data Analysis with descriptive statistics, segmentation (RFM, decile, clustering), correlation analysis, and time-series techniques, then use Pareto analysis to prioritize insights and present results with visuals.
When to Use It
- When data quality is inconsistent or incomplete and you need a clean dataset for analysis
- When you need descriptive statistics, correlations, or a heatmap to understand relationships
- When targeting customers or users, requiring segmentation (RFM, decile) or clustering to reveal groups
- When analyzing time-based metrics to identify trends, seasonality, and growth rates
- When you must prioritize actions and communicate impact using Pareto analysis and dashboards
Quick Start
- Step 1: Import data and run quality checks for missing values, duplicates, and inconsistent formatting
- Step 2: Apply basic transformations (pivot, merge) and create calculated fields (growth rate, totals)
- Step 3: Perform descriptive stats, simple correlations, and a Pareto analysis; craft visuals and a concise insights narrative
Best Practices
- Start with a data quality checklist (missing values, duplicates, outliers) and document cleaning rules
- Use consistent transformations (pivot/unpivot/merge) and maintain a clear data dictionary
- Validate findings with multiple methods (descriptive stats, correlations, clustering) and visuals
- Align time-series data (granularity, gaps) and compute robust moving averages
- Present insights with clear visuals and caveats (correlation does not imply causation) and tell a data story
Example Use Cases
- Clean a customer dataset, deduplicate records, standardize dates, and create a baseline dashboard
- Build an RFM segmentation to identify Champions and At-Risk customers for targeted campaigns
- Create a Pareto chart to identify the top 20% of products driving ~80% of revenue
- Analyze time-series sales, compute YoY and MoM growth, and apply moving averages for smoothing
- Develop a dashboard with pivot tables summarizing key KPIs and trends across regions