A workflow to optimize MLX code by compiling function paths, using fast ops, and tuning memory and eval behavior.

When should I not use fast-mlx?

When shapes vary wildly or the compilation overhead outweighs runtime gains, or if debugging requires Python-level introspection.

What deliverables come from using fast-mlx?

Concrete code changes with brief rationale, and recommended profiling steps. Some changes may require user confirmation (e.g., enabling async eval or shapeless compile).

fast-mlx

Scanned

npx machina-cli add skill unsanitary-bek/mlx-skills/fast-mlx --openclaw

Files (1)

SKILL.md

1.2 KB

Fast MLX

Workflow

Looks for opportunities to compile functions of mostly elementwise operations.
For models with fixed shape inputs or where the shapes don't change much, compile the entire graph
Replace slow implementations with MLX fast ops
Identify evaluation boundaries and unintended sync points (mx.eval, item(), NumPy conversions).
Check dtype promotion and scalar usage; keep precision consistent with intent.
Review compilation strategy; avoid unnecessary recompiles and closure captures.
Reduce peak memory via lazy loading order and releasing temporaries before mx.eval.
Suggest profiling steps if the bottleneck is unclear.

References

Read references/fast-mlx-guide.md for detailed tips and examples. Use it as the source of truth.

Output expectations

Provide concrete code changes with brief rationale
Call out changes that need user confirmation (e.g., enabling async eval or shapeless compile).

Source

git clone https://github.com/unsanitary-bek/mlx-skills/blob/main/mlx_skills/skills/fast-mlx/SKILL.mdView on GitHub

Overview

fast-mlx helps you squeeze performance from MLX by compiling key operation paths and tuning memory usage. It targets elementwise kernels, fixed-shape graphs, and reduces slow eval boundaries to minimize latency and peak memory. The workflow emphasizes avoiding unnecessary recompiles and careful dtype handling.

How This Skill Works

The skill identifies opportunities to compile elementwise operations and, when shapes are stable, compiles the entire graph. It replaces slow implementations with MLX fast ops, tracks evaluation boundaries such as mx.eval and NumPy conversions, and manages memory by lazy loading and timely temporaries release, while ensuring dtype promotion stays aligned with the desired precision.

When to Use It

You want to speed up MLX models or algorithms
You're hitting latency or throughput bottlenecks
Shapes are fixed or rarely change and graphs can be compiled
You suspect synchronization points like mx.eval or Python scalars are slowing you down
You need to reduce peak memory and optimize memory usage

Quick Start

Step 1: Identify hot paths and enable fast-mlx optimizations (profiling recommended).
Step 2: If inputs have fixed shapes, compile the entire graph; else compile dominant elementwise ops.
Step 3: Review eval boundaries (mx.eval, item(), NumPy conversions), manage memory by releasing temporaries before mx.eval, and re-profile.

Best Practices

Compile functions for elementwise ops when feasible
For fixed-shape inputs, compile the entire graph to minimize Python overhead
Replace slow implementations with MLX fast ops
Explicitly identify evaluation boundaries (mx.eval, item(), NumPy conversions) and minimize their use
Manage memory deliberately: arrange lazy loading, release temporaries before mx.eval, and monitor peak memory

Example Use Cases

Speed up an elementwise layer by compiling the entire fixed-shape graph
Reduce latency in a model by removing frequent mx.eval calls and delaying evaluation
Switch to lazy loading sequences to lower peak memory during inference
Optimize dtype promotion to ensure consistent precision while avoiding unnecessary casting
Profile bottlenecks and recompile targeted paths with fast-mlx substitutions

Frequently Asked Questions

Add this skill to your agents