clustering-toolkit
Scannednpx machina-cli add skill pablodiegoo/Data-Pro-Skill/clustering-toolkit --openclawClustering Toolkit Skill
This skill provides a specialized pipeline for identifying homogeneous groups within high-dimensional datasets. It combines dimensionality reduction (PCA) with density-based clustering (DBSCAN) to find natural patterns while filtering noise.
Capabilities
1. PCA+DBSCAN Grouping (pca_dbscan_grouping)
A hybrid pipeline that uses Principal Component Analysis to extract features and DBSCAN to group entities.
- Supports hybrid features (numerical + categorical weights).
- Configurable walk-forward clustering for dynamic datasets.
- Automatic noise detection (outliers).
2. Basic Segmentation (basic_clustering)
Standard K-Means clustering pipeline for rapid entity grouping.
- Automated feature scaling.
- Configurable cluster count (
k). - Centroid analysis for segment profiling.
3. Residual Segmentation (residual_segmentation)
Advanced behavioral segmentation using regression residuals (Actual vs. Predicted).
- Identifies "Delighted" vs "Disappointed" segments based on unmeasured variables.
- Automated distribution plotting and coefficient analysis.
4. Gower Distance Matrix (gower_distance)
Similarity metric for mixed data types (numerical + categorical).
- Handles NaNs gracefully.
- Core component for distance-based clustering when one-hot encoding is undesirable.
2. Cluster Quality Diagnostics (dbscan_cluster_quality)
Utilities to detect common clustering pathologies.
- Giant Cluster Ratio: Detects if a single group dominates the universe (>50%).
- Stability Metrics: Measures how often entities change groups over time.
- Configuration Scoring: Scalar metric to rank different hyperparameter (EPS, MinSamples) setups.
Usage
from scripts.pca_dbscan_grouping import PCA_DBSCAN_Pipeline
from scripts.dbscan_cluster_quality import calculate_cluster_metrics
# 1. Run clustering pipeline
pipeline = PCA_DBSCAN_Pipeline(n_components=5, eps=0.015)
clusters = pipeline.fit_predict(df)
# 2. Diagnose quality
metrics = calculate_cluster_metrics(clusters)
if metrics['Giant_Ratio'] > 0.5:
print("Warning: Pathological giant cluster detected. Reduce EPS.")
Best Practices
- Feature Scaling: Always normalize features before PCA.
- Categorical Weights: Use
sector_weight(or equivalent) to balance statistical similarity with domain knowledge. - EPS Tuning: Small changes in
epscan have drastic effects. Usegrid_search_checkpointfor tuning.
Detailed References
- Methodology: See pca_dbscan_methodology.md for pipeline, parameters, and diagnostics.
Dependencies
scikit-learn, pandas, numpy.
Source
git clone https://github.com/pablodiegoo/Data-Pro-Skill/blob/main/src/datapro/data/skills/clustering-toolkit/SKILL.mdView on GitHub Overview
Clustering Toolkit provides a specialized pipeline to identify homogeneous groups in high-dimensional data by combining PCA for dimensionality reduction with DBSCAN clustering and noise filtering. It also includes modules for residual segmentation, Gower distance for mixed data, and robust quality diagnostics to surface clustering pathologies such as giant clusters or unstable configurations. Built around scikit-learn, pandas, and numpy, it offers a practical end-to-end solution.
How This Skill Works
The workflow typically reduces dimensionality with PCA, then applies DBSCAN to discover dense clusters while filtering noise. It offers additional components such as basic clustering with K-Means, residual segmentation using regression residuals, and a Gower distance matrix for mixed data types, plus utilities for cluster quality diagnostics like Giant Cluster Ratio, Stability Metrics, and Configuration Scoring.
When to Use It
- When you need natural groupings of assets, products, or clients from multi-dimensional features
- When you want dimensionality reduction via PCA before clustering high-dimensional data
- When you require noise filtering and density-based clustering (DBSCAN) to avoid spurious clusters
- When you must diagnose clustering pathologies such as giant cluster ratio or unstable configurations
- When working with mixed data types and you want distance-based clustering using Gower distance
Quick Start
- Step 1: Initialize and run: pipeline = PCA_DBSCAN_Pipeline(n_components=5, eps=0.015); clusters = pipeline.fit_predict(df)
- Step 2: Diagnose quality: metrics = calculate_cluster_metrics(clusters)
- Step 3: If metrics['Giant_Ratio'] > 0.5, adjust EPS or parameters and re-run
Best Practices
- Normalize features before applying PCA to ensure balanced variance across dimensions
- Use domain-aware weights (e.g., sector_weight) to balance statistical similarity with business knowledge
- Tune EPS and MinSamples carefully; small changes can drastically affect results; use grid search/checkpoints
- Run cluster quality diagnostics (Giant_Ratio, Stability Metrics, Configuration Scoring) to catch pathological outcomes
- Prefer PCA+DBSCAN for high-dimensional data; switch to faster methods like Basic Clustering when rapid prototyping is needed
Example Use Cases
- Segment customers for targeted marketing by clustering multi-dimensional behavioral features
- Group products for catalog optimization based on feature similarity and usage patterns
- Cluster assets (e.g., financial or industrial) for risk profiling with noise filtering
- Monitor evolving datasets with walk-forward clustering to assess stability over time
- Apply Gower distance for mixed numeric and categorical data in a mixed-data customer profile