data-manipulation
Scannednpx machina-cli add skill pablodiegoo/Data-Pro-Skill/data-manipulation --openclawData Manipulation
High-performance data manipulation and transformation suite using Pandas, Numpy, and DuckDB. This skill handles the "T-Layer" (Transformation) of the data pipeline, preparing raw data for statistical analysis or reporting.
1. Core Capabilities
A. Dictionary Mapping & Cleaning
Standardizes raw, cryptic variables into semantic labels using external mapping dictionaries.
- Script:
scripts/dict_mapper.py - Reference: pipeline.md
B. High-Performance Ingestion & Aggregation (DuckDB Track)
Provides extremely fast ingestion and fuzzy cleaning for messy local files or files > 1GB.
- Scripts:
scripts/duckdb_fuzzy_cleaner.py,scripts/quant_analyzer_duckdb.py - Reference: duckdb_analytics.md
C. Sample Weighting
Calculates and applies expansion weights for representative survey analysis.
- Script:
scripts/weighting.py
D. Data Organization
Utilities for managing project directory structures and data discovery.
- Script:
scripts/data_directory_finder.py
2. Technical Guidelines
- Efficiency: Use Pandas Patterns (vectorization, categorical dtypes) for datasets under 1GB.
- Robustness: Use NumPy Stats for low-level numerical transformations.
- Hierarchy: Always prefer Parquet for intermediate data storage to preserve data types.
3. Reference Decision Matrix
| Task | Recommended Tool | Pattern Reference |
|---|---|---|
| Join > 5M rows | DuckDB | Analysis Pattern |
| Wide-to-Long | Pandas melt | Tidy Pattern |
| Clean Outliers | NumPy/SciPy | Stats Pattern |
[!IMPORTANT] This skill focusing on preparation. For statistical inference, multivariate modeling, or causal analysis, defer to the
@data-analysis-suite.
Source
git clone https://github.com/pablodiegoo/Data-Pro-Skill/blob/main/src/datapro/data/skills/data-manipulation/SKILL.mdView on GitHub Overview
Data Manipulation is a high-performance suite for cleaning, transforming, and aggregating structured data (CSV, Parquet, JSON). It focuses on preparing raw data for analysis and reporting, including the Tidy Data (Wide-to-Long) strategy.
How This Skill Works
It combines Pandas for vectorized transformations and melt-based reshaping, NumPy for numerical operations, and DuckDB for fast ingestion and large-scale analytics. It standardizes variables with dictionary mappings, performs high-speed data processing, and stores intermediate results in Parquet to preserve data types and enable repeatable pipelines.
When to Use It
- Clean or transform structured data (CSV, Parquet, JSON) to prepare for analysis
- Perform large-scale aggregations or analytics on big datasets
- Optimize analysis for performance and memory usage on constrained hardware
- Implement the Tidy Data (Wide-to-Long) strategy for reporting
- Ingest and fuzzily clean messy local files larger than 1GB with DuckDB
Quick Start
- Step 1: Use the DuckDB fuzzy cleaner to ingest and clean large data (scripts/duckdb_fuzzy_cleaner.py, scripts/quant_analyzer_duckdb.py)
- Step 2: Apply dictionary mappings to standardize variables (scripts/dict_mapper.py)
- Step 3: If needed, perform Wide-to-Long tidy data transformation with Pandas melt and store results as Parquet
Best Practices
- Use Parquet for intermediate data storage to preserve data types
- Apply Pandas patterns (vectorization, categorical dtypes) for datasets under 1GB
- Use NumPy Stats for low-level numerical transformations
- Prefer DuckDB for fast ingestion and aggregation of large files (>1GB)
- Standardize variables early with dictionary mapping before analytics
Example Use Cases
- Standardize cryptic variables into semantic labels using dict_mapper.py
- Ingest and fuzzy-clean large local files (>1GB) with duckdb_fuzzy_cleaner.py and quantify analytics with quant_analyzer_duckdb.py
- Calculate and apply expansion weights for representative survey analysis using weighting.py
- Organize project data directories and enable data discovery with data_directory_finder.py
- Convert wide data to long format for tidy reporting using Pandas melt and Parquet storage