Get the FREE Ultimate OpenClaw Setup Guide →

data-manipulation

Scanned
npx machina-cli add skill pablodiegoo/Data-Pro-Skill/data-manipulation --openclaw
Files (1)
SKILL.md
2.3 KB

Data Manipulation

High-performance data manipulation and transformation suite using Pandas, Numpy, and DuckDB. This skill handles the "T-Layer" (Transformation) of the data pipeline, preparing raw data for statistical analysis or reporting.

1. Core Capabilities

A. Dictionary Mapping & Cleaning

Standardizes raw, cryptic variables into semantic labels using external mapping dictionaries.

  • Script: scripts/dict_mapper.py
  • Reference: pipeline.md

B. High-Performance Ingestion & Aggregation (DuckDB Track)

Provides extremely fast ingestion and fuzzy cleaning for messy local files or files > 1GB.

  • Scripts: scripts/duckdb_fuzzy_cleaner.py, scripts/quant_analyzer_duckdb.py
  • Reference: duckdb_analytics.md

C. Sample Weighting

Calculates and applies expansion weights for representative survey analysis.

  • Script: scripts/weighting.py

D. Data Organization

Utilities for managing project directory structures and data discovery.

  • Script: scripts/data_directory_finder.py

2. Technical Guidelines

  1. Efficiency: Use Pandas Patterns (vectorization, categorical dtypes) for datasets under 1GB.
  2. Robustness: Use NumPy Stats for low-level numerical transformations.
  3. Hierarchy: Always prefer Parquet for intermediate data storage to preserve data types.

3. Reference Decision Matrix

TaskRecommended ToolPattern Reference
Join > 5M rowsDuckDBAnalysis Pattern
Wide-to-LongPandas meltTidy Pattern
Clean OutliersNumPy/SciPyStats Pattern

[!IMPORTANT] This skill focusing on preparation. For statistical inference, multivariate modeling, or causal analysis, defer to the @data-analysis-suite.

Source

git clone https://github.com/pablodiegoo/Data-Pro-Skill/blob/main/src/datapro/data/skills/data-manipulation/SKILL.mdView on GitHub

Overview

Data Manipulation is a high-performance suite for cleaning, transforming, and aggregating structured data (CSV, Parquet, JSON). It focuses on preparing raw data for analysis and reporting, including the Tidy Data (Wide-to-Long) strategy.

How This Skill Works

It combines Pandas for vectorized transformations and melt-based reshaping, NumPy for numerical operations, and DuckDB for fast ingestion and large-scale analytics. It standardizes variables with dictionary mappings, performs high-speed data processing, and stores intermediate results in Parquet to preserve data types and enable repeatable pipelines.

When to Use It

  • Clean or transform structured data (CSV, Parquet, JSON) to prepare for analysis
  • Perform large-scale aggregations or analytics on big datasets
  • Optimize analysis for performance and memory usage on constrained hardware
  • Implement the Tidy Data (Wide-to-Long) strategy for reporting
  • Ingest and fuzzily clean messy local files larger than 1GB with DuckDB

Quick Start

  1. Step 1: Use the DuckDB fuzzy cleaner to ingest and clean large data (scripts/duckdb_fuzzy_cleaner.py, scripts/quant_analyzer_duckdb.py)
  2. Step 2: Apply dictionary mappings to standardize variables (scripts/dict_mapper.py)
  3. Step 3: If needed, perform Wide-to-Long tidy data transformation with Pandas melt and store results as Parquet

Best Practices

  • Use Parquet for intermediate data storage to preserve data types
  • Apply Pandas patterns (vectorization, categorical dtypes) for datasets under 1GB
  • Use NumPy Stats for low-level numerical transformations
  • Prefer DuckDB for fast ingestion and aggregation of large files (>1GB)
  • Standardize variables early with dictionary mapping before analytics

Example Use Cases

  • Standardize cryptic variables into semantic labels using dict_mapper.py
  • Ingest and fuzzy-clean large local files (>1GB) with duckdb_fuzzy_cleaner.py and quantify analytics with quant_analyzer_duckdb.py
  • Calculate and apply expansion weights for representative survey analysis using weighting.py
  • Organize project data directories and enable data discovery with data_directory_finder.py
  • Convert wide data to long format for tidy reporting using Pandas melt and Parquet storage

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers