dask
npx machina-cli add skill G1Joshi/Agent-Skills/dask --openclawDask
Dask scales Python. It looks like Pandas/NumPy but runs on clusters. 2025 updates focus on High Performance Shuffle and GPU integration.
When to Use
- Big Data: When data > RAM but < BigQuery scale.
- Cluster Computing: Utilizing a Kubernetes cluster for Python functions.
- Xarray: Backend for geospatial data.
Core Concepts
Collections
dask.dataframe, dask.array, dask.bag.
Scheduler
Decides where to run tasks (Local Threads, Processes, or Distributed Cluster).
Dashboard
Real-time visualization of task progress (port 8787).
Best Practices (2025)
Do:
- Use
dask-expr: The new query optimization engine for Dask DataFrames. - Use Parquet: CSVs are distinctively slow in distributed settings.
Don't:
- Don't use for small data: The overhead of the scheduler makes it slower than Pandas for <1GB.
References
Source
git clone https://github.com/G1Joshi/Agent-Skills/blob/main/skills/ai-ml/dask/SKILL.mdView on GitHub Overview
Dask scales Python, offering familiar Pandas/NumPy-like APIs but runs computations on a cluster. It enables out-of-core processing and distributed execution, ideal when data exceeds RAM. The 2025 updates emphasize High Performance Shuffle and GPU integration.
How This Skill Works
Dask exposes task-based collections (dask.dataframe, dask.array, dask.bag) that build lazy task graphs. A scheduler then dispatches work to local threads, processes, or a Distributed Cluster. A real-time Dashboard (port 8787) visualizes progress and resource usage.
When to Use It
- Big data that exceeds RAM but isn't at BigQuery scale
- Distributing Python function execution on a Kubernetes cluster
- Geospatial or Xarray workloads needing a Dask backend
- Pandas-like workflows that benefit from lazy evaluation and parallelism
- Parquet-based pipelines to avoid slow CSVs in distributed settings
Quick Start
- Step 1: Install Dask and the distributed scheduler (e.g., pip install 'dask[complete]' and 'dask[distributed]')
- Step 2: Create a Dask client and connect to a LocalCluster or a remote cluster
- Step 3: Load data with dask.dataframe (or dask.array), perform operations, and call compute() when needed
Best Practices
- Use dask-expr for query optimization in Dask DataFrames
- Prefer Parquet over CSV for distributed workloads
- Don't use Dask for small datasets; scheduler overhead can slow you down (<~1 GB)
- Leverage the Distributed Scheduler for scalable task execution
- Monitor and tune performance with the Dask Dashboard (port 8787)
Example Use Cases
- Processing multi-GB CSV or Parquet datasets with dask.dataframe on a cluster
- Building Parquet-based data pipelines on cloud storage (S3/GCS) using Dask
- Geospatial analysis with Xarray backed by Dask for large raster data
- Large-scale array computations with dask.array across distributed workers
- Pandas-style analytics at scale via lazy graphs and distributed scheduling