Get the FREE Ultimate OpenClaw Setup Guide →

clustering-toolkit

Scanned
npx machina-cli add skill pablodiegoo/Data-Pro-Skill/clustering-toolkit --openclaw
Files (1)
SKILL.md
3.2 KB

Clustering Toolkit Skill

This skill provides a specialized pipeline for identifying homogeneous groups within high-dimensional datasets. It combines dimensionality reduction (PCA) with density-based clustering (DBSCAN) to find natural patterns while filtering noise.

Capabilities

1. PCA+DBSCAN Grouping (pca_dbscan_grouping)

A hybrid pipeline that uses Principal Component Analysis to extract features and DBSCAN to group entities.

  • Supports hybrid features (numerical + categorical weights).
  • Configurable walk-forward clustering for dynamic datasets.
  • Automatic noise detection (outliers).

2. Basic Segmentation (basic_clustering)

Standard K-Means clustering pipeline for rapid entity grouping.

  • Automated feature scaling.
  • Configurable cluster count (k).
  • Centroid analysis for segment profiling.

3. Residual Segmentation (residual_segmentation)

Advanced behavioral segmentation using regression residuals (Actual vs. Predicted).

  • Identifies "Delighted" vs "Disappointed" segments based on unmeasured variables.
  • Automated distribution plotting and coefficient analysis.

4. Gower Distance Matrix (gower_distance)

Similarity metric for mixed data types (numerical + categorical).

  • Handles NaNs gracefully.
  • Core component for distance-based clustering when one-hot encoding is undesirable.

2. Cluster Quality Diagnostics (dbscan_cluster_quality)

Utilities to detect common clustering pathologies.

  • Giant Cluster Ratio: Detects if a single group dominates the universe (>50%).
  • Stability Metrics: Measures how often entities change groups over time.
  • Configuration Scoring: Scalar metric to rank different hyperparameter (EPS, MinSamples) setups.

Usage

from scripts.pca_dbscan_grouping import PCA_DBSCAN_Pipeline
from scripts.dbscan_cluster_quality import calculate_cluster_metrics

# 1. Run clustering pipeline
pipeline = PCA_DBSCAN_Pipeline(n_components=5, eps=0.015)
clusters = pipeline.fit_predict(df)

# 2. Diagnose quality
metrics = calculate_cluster_metrics(clusters)
if metrics['Giant_Ratio'] > 0.5:
    print("Warning: Pathological giant cluster detected. Reduce EPS.")

Best Practices

  • Feature Scaling: Always normalize features before PCA.
  • Categorical Weights: Use sector_weight (or equivalent) to balance statistical similarity with domain knowledge.
  • EPS Tuning: Small changes in eps can have drastic effects. Use grid_search_checkpoint for tuning.

Detailed References

Dependencies

scikit-learn, pandas, numpy.

Source

git clone https://github.com/pablodiegoo/Data-Pro-Skill/blob/main/src/datapro/data/skills/clustering-toolkit/SKILL.mdView on GitHub

Overview

Clustering Toolkit provides a specialized pipeline to identify homogeneous groups in high-dimensional data by combining PCA for dimensionality reduction with DBSCAN clustering and noise filtering. It also includes modules for residual segmentation, Gower distance for mixed data, and robust quality diagnostics to surface clustering pathologies such as giant clusters or unstable configurations. Built around scikit-learn, pandas, and numpy, it offers a practical end-to-end solution.

How This Skill Works

The workflow typically reduces dimensionality with PCA, then applies DBSCAN to discover dense clusters while filtering noise. It offers additional components such as basic clustering with K-Means, residual segmentation using regression residuals, and a Gower distance matrix for mixed data types, plus utilities for cluster quality diagnostics like Giant Cluster Ratio, Stability Metrics, and Configuration Scoring.

When to Use It

  • When you need natural groupings of assets, products, or clients from multi-dimensional features
  • When you want dimensionality reduction via PCA before clustering high-dimensional data
  • When you require noise filtering and density-based clustering (DBSCAN) to avoid spurious clusters
  • When you must diagnose clustering pathologies such as giant cluster ratio or unstable configurations
  • When working with mixed data types and you want distance-based clustering using Gower distance

Quick Start

  1. Step 1: Initialize and run: pipeline = PCA_DBSCAN_Pipeline(n_components=5, eps=0.015); clusters = pipeline.fit_predict(df)
  2. Step 2: Diagnose quality: metrics = calculate_cluster_metrics(clusters)
  3. Step 3: If metrics['Giant_Ratio'] > 0.5, adjust EPS or parameters and re-run

Best Practices

  • Normalize features before applying PCA to ensure balanced variance across dimensions
  • Use domain-aware weights (e.g., sector_weight) to balance statistical similarity with business knowledge
  • Tune EPS and MinSamples carefully; small changes can drastically affect results; use grid search/checkpoints
  • Run cluster quality diagnostics (Giant_Ratio, Stability Metrics, Configuration Scoring) to catch pathological outcomes
  • Prefer PCA+DBSCAN for high-dimensional data; switch to faster methods like Basic Clustering when rapid prototyping is needed

Example Use Cases

  • Segment customers for targeted marketing by clustering multi-dimensional behavioral features
  • Group products for catalog optimization based on feature similarity and usage patterns
  • Cluster assets (e.g., financial or industrial) for risk profiling with noise filtering
  • Monitor evolving datasets with walk-forward clustering to assess stability over time
  • Apply Gower distance for mixed numeric and categorical data in a mixed-data customer profile

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers