What is the data-manipulation skill for?

It prepares, cleans, transforms, and aggregates structured data for analysis and reporting, including tidy data transformations.

What tools and formats are used?

It leverages Pandas (vectorization, melt), NumPy (numerical transforms), DuckDB (fast ingestion/analytics), and Parquet for intermediate storage.

When should I use the Tidy Data (Wide-to-Long) approach?

Use it when reporting requires long-form data; melt converts wide formats into a long, analysis-friendly structure.

data-manipulation

Scanned

npx machina-cli add skill pablodiegoo/Data-Pro-Skill/data-manipulation --openclaw

Files (1)

SKILL.md

2.3 KB

Data Manipulation

High-performance data manipulation and transformation suite using Pandas, Numpy, and DuckDB. This skill handles the "T-Layer" (Transformation) of the data pipeline, preparing raw data for statistical analysis or reporting.

1. Core Capabilities

A. Dictionary Mapping & Cleaning

Standardizes raw, cryptic variables into semantic labels using external mapping dictionaries.

Script: scripts/dict_mapper.py
Reference: pipeline.md

B. High-Performance Ingestion & Aggregation (DuckDB Track)

Provides extremely fast ingestion and fuzzy cleaning for messy local files or files > 1GB.

Scripts: scripts/duckdb_fuzzy_cleaner.py, scripts/quant_analyzer_duckdb.py
Reference: duckdb_analytics.md

C. Sample Weighting

Calculates and applies expansion weights for representative survey analysis.

Script: scripts/weighting.py

D. Data Organization

Utilities for managing project directory structures and data discovery.

Script: scripts/data_directory_finder.py

2. Technical Guidelines

Efficiency: Use Pandas Patterns (vectorization, categorical dtypes) for datasets under 1GB.
Robustness: Use NumPy Stats for low-level numerical transformations.
Hierarchy: Always prefer Parquet for intermediate data storage to preserve data types.

3. Reference Decision Matrix

Task	Recommended Tool	Pattern Reference
Join > 5M rows	DuckDB	Analysis Pattern
Wide-to-Long	Pandas `melt`	Tidy Pattern
Clean Outliers	NumPy/SciPy	Stats Pattern

[!IMPORTANT] This skill focusing on preparation. For statistical inference, multivariate modeling, or causal analysis, defer to the @data-analysis-suite.

Source

git clone https://github.com/pablodiegoo/Data-Pro-Skill/blob/main/src/datapro/data/skills/data-manipulation/SKILL.mdView on GitHub

Overview

Data Manipulation is a high-performance suite for cleaning, transforming, and aggregating structured data (CSV, Parquet, JSON). It focuses on preparing raw data for analysis and reporting, including the Tidy Data (Wide-to-Long) strategy.

How This Skill Works

It combines Pandas for vectorized transformations and melt-based reshaping, NumPy for numerical operations, and DuckDB for fast ingestion and large-scale analytics. It standardizes variables with dictionary mappings, performs high-speed data processing, and stores intermediate results in Parquet to preserve data types and enable repeatable pipelines.

When to Use It

Clean or transform structured data (CSV, Parquet, JSON) to prepare for analysis
Perform large-scale aggregations or analytics on big datasets
Optimize analysis for performance and memory usage on constrained hardware
Implement the Tidy Data (Wide-to-Long) strategy for reporting
Ingest and fuzzily clean messy local files larger than 1GB with DuckDB

Quick Start

Step 1: Use the DuckDB fuzzy cleaner to ingest and clean large data (scripts/duckdb_fuzzy_cleaner.py, scripts/quant_analyzer_duckdb.py)
Step 2: Apply dictionary mappings to standardize variables (scripts/dict_mapper.py)
Step 3: If needed, perform Wide-to-Long tidy data transformation with Pandas melt and store results as Parquet

Best Practices

Use Parquet for intermediate data storage to preserve data types
Apply Pandas patterns (vectorization, categorical dtypes) for datasets under 1GB
Use NumPy Stats for low-level numerical transformations
Prefer DuckDB for fast ingestion and aggregation of large files (>1GB)
Standardize variables early with dictionary mapping before analytics

Example Use Cases

Standardize cryptic variables into semantic labels using dict_mapper.py
Ingest and fuzzy-clean large local files (>1GB) with duckdb_fuzzy_cleaner.py and quantify analytics with quant_analyzer_duckdb.py
Calculate and apply expansion weights for representative survey analysis using weighting.py
Organize project data directories and enable data discovery with data_directory_finder.py
Convert wide data to long format for tidy reporting using Pandas melt and Parquet storage

Frequently Asked Questions

Add this skill to your agents