fast-mlx
Scannednpx machina-cli add skill unsanitary-bek/mlx-skills/fast-mlx --openclawFiles (1)
SKILL.md
1.2 KB
Fast MLX
Workflow
- Looks for opportunities to compile functions of mostly elementwise operations.
- For models with fixed shape inputs or where the shapes don't change much, compile the entire graph
- Replace slow implementations with MLX fast ops
- Identify evaluation boundaries and unintended sync points (
mx.eval,item(), NumPy conversions). - Check dtype promotion and scalar usage; keep precision consistent with intent.
- Review compilation strategy; avoid unnecessary recompiles and closure captures.
- Reduce peak memory via lazy loading order and releasing temporaries before
mx.eval. - Suggest profiling steps if the bottleneck is unclear.
References
- Read
references/fast-mlx-guide.mdfor detailed tips and examples. Use it as the source of truth.
Output expectations
- Provide concrete code changes with brief rationale
- Call out changes that need user confirmation (e.g., enabling async eval or shapeless compile).
Source
git clone https://github.com/unsanitary-bek/mlx-skills/blob/main/mlx_skills/skills/fast-mlx/SKILL.mdView on GitHub Overview
fast-mlx helps you squeeze performance from MLX by compiling key operation paths and tuning memory usage. It targets elementwise kernels, fixed-shape graphs, and reduces slow eval boundaries to minimize latency and peak memory. The workflow emphasizes avoiding unnecessary recompiles and careful dtype handling.
How This Skill Works
The skill identifies opportunities to compile elementwise operations and, when shapes are stable, compiles the entire graph. It replaces slow implementations with MLX fast ops, tracks evaluation boundaries such as mx.eval and NumPy conversions, and manages memory by lazy loading and timely temporaries release, while ensuring dtype promotion stays aligned with the desired precision.
When to Use It
- You want to speed up MLX models or algorithms
- You're hitting latency or throughput bottlenecks
- Shapes are fixed or rarely change and graphs can be compiled
- You suspect synchronization points like mx.eval or Python scalars are slowing you down
- You need to reduce peak memory and optimize memory usage
Quick Start
- Step 1: Identify hot paths and enable fast-mlx optimizations (profiling recommended).
- Step 2: If inputs have fixed shapes, compile the entire graph; else compile dominant elementwise ops.
- Step 3: Review eval boundaries (mx.eval, item(), NumPy conversions), manage memory by releasing temporaries before mx.eval, and re-profile.
Best Practices
- Compile functions for elementwise ops when feasible
- For fixed-shape inputs, compile the entire graph to minimize Python overhead
- Replace slow implementations with MLX fast ops
- Explicitly identify evaluation boundaries (mx.eval, item(), NumPy conversions) and minimize their use
- Manage memory deliberately: arrange lazy loading, release temporaries before mx.eval, and monitor peak memory
Example Use Cases
- Speed up an elementwise layer by compiling the entire fixed-shape graph
- Reduce latency in a model by removing frequent mx.eval calls and delaying evaluation
- Switch to lazy loading sequences to lower peak memory during inference
- Optimize dtype promotion to ensure consistent precision while avoiding unnecessary casting
- Profile bottlenecks and recompile targeted paths with fast-mlx substitutions
Frequently Asked Questions
Add this skill to your agents