Get the FREE Ultimate OpenClaw Setup Guide →

mlx

npx machina-cli add skill itsmostafa/llm-engineering-skills/mlx --openclaw
Files (1)
SKILL.md
8.3 KB

Using MLX for LLMs on Apple Silicon

MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.

Table of Contents

Core Concepts

Why MLX

AspectPyTorch on MacMLX
MemorySeparate CPU/GPU copiesUnified memory, no copies
OptimizationGeneric Metal backendApple Silicon native
Model loadingSlower, more memoryLazy loading, efficient
QuantizationLimited supportBuilt-in 4/8-bit

MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.

Supported Models

MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.

Installation

pip install mlx-lm

Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).

Text Generation

Python API

from mlx_lm import load, generate

# Load model (from HF hub or local path)
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Generate text
response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms:",
    max_tokens=256,
    temp=0.7,
)
print(response)

Streaming Generation

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
    print(response.text, end="", flush=True)
print()

Batch Generation

from mlx_lm import load, batch_generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

prompts = [
    "What is machine learning?",
    "Explain neural networks:",
    "Define deep learning:",
]

responses = batch_generate(
    model,
    tokenizer,
    prompts,
    max_tokens=100,
)

for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n")

CLI Generation

# Basic generation
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --prompt "Explain recursion:" \
    --max-tokens 256

# With sampling parameters
mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
    --prompt "Write a poem about AI:" \
    --temp 0.8 \
    --top-p 0.95

Interactive Chat

CLI Chat

# Start chat REPL (context preserved between turns)
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

Python Chat

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

Model Conversion

Convert Hugging Face models to MLX format:

CLI Conversion

# Convert with 4-bit quantization
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct \
    -q  # Quantize to 4-bit

# With specific quantization
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --q-bits 8 \
    --q-group-size 64

# Upload to Hugging Face Hub
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct \
    -q \
    --upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx

Python Conversion

from mlx_lm import convert

convert(
    hf_path="meta-llama/Llama-3.2-3B-Instruct",
    mlx_path="./llama-3.2-3b-mlx",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)

Conversion Options

OptionDefaultDescription
--q-bits4Quantization bits (4 or 8)
--q-group-size64Group size for quantization
--dtypefloat16Data type for non-quantized weights

Quantization

MLX supports multiple quantization methods for different use cases:

MethodBest ForCommand
BasicQuick conversionmlx_lm.convert -q
DWQQuality-preservingmlx_lm.dwq
AWQActivation-awaremlx_lm.awq
DynamicPer-layer precisionmlx_lm.dynamic_quant
GPTQEstablished methodmlx_lm.gptq

Quick Quantization

# 4-bit quantization during conversion
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q

# 8-bit for higher quality
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8

For detailed coverage of each method, see reference/quantization.md.

Fine-tuning with LoRA

MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.

Quick Start

# Prepare training data (JSONL format)
# {"text": "Your training text here"}
# or
# {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --train \
    --data ./data \
    --iters 1000

# Generate with adapter
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --prompt "Your prompt here"

Fuse Adapter into Model

# Merge LoRA weights into base model
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --save-path ./fused-model

# Or export to GGUF
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --export-gguf

For detailed LoRA configuration and training patterns, see reference/fine-tuning.md.

Serving Models

OpenAI-Compatible Server

# Start server
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080

# Use with OpenAI client
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "default",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 256
    }'

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
    max_tokens=100,
)
print(response.choices[0].message.content)

Best Practices

  1. Use pre-quantized models: Download from mlx-community on Hugging Face for immediate use

  2. Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality

  3. Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)

  4. Use streaming for UX: stream_generate provides responsive output for interactive applications

  5. Cache prompt prefixes: Use mlx_lm.cache_prompt for repeated prompts with varying suffixes

  6. Batch similar requests: batch_generate is more efficient than sequential generation

  7. Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues

  8. Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately

  9. Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing

  10. Use chat templates: Always apply tokenizer.apply_chat_template() for instruction-tuned models

References

See reference/ for detailed documentation:

  • quantization.md - Detailed quantization methods and when to use each
  • fine-tuning.md - Complete LoRA/QLoRA training guide with data formats and configuration

Source

git clone https://github.com/itsmostafa/llm-engineering-skills/blob/main/skills/mlx/SKILL.mdView on GitHub

Overview

MLX-LM is a Python package that enables running and fine-tuning large language models on Apple Silicon using the MLX framework. It supports converting Hugging Face models to MLX format, LoRA/QLoRA fine-tuning on Apple Silicon, and serving models via an HTTP API, all optimized for Macs.

How This Skill Works

MLX relies on unified memory that is accessible by both CPU and GPU, enabling lazy loading and memory-efficient model handling on Apple Silicon. It offers built-in 4-bit and 8-bit quantization and provides Python APIs to load models from HF hub or local paths, then perform text generation, streaming, or batch inference and to convert models for MLX.

When to Use It

  • You are running LLMs locally on an Apple Silicon Mac and want optimized performance
  • You need to convert a Hugging Face model to MLX format for MLX-specific loading
  • You plan to fine-tune models with LoRA or QLoRA on Apple Silicon
  • You want to serve a local model via HTTP API from your Mac
  • You want memory-efficient inference using 4-bit or 8-bit quantization

Quick Start

  1. Step 1: Install mlx-lm with pip install mlx-lm
  2. Step 2: Load a model, e.g. model, tokenizer = load('mlx-community/Llama-3.2-3B-Instruct-4bit')
  3. Step 3: Generate text with generate or stream_generate using a prompt

Best Practices

  • Start with 4-bit quantization for most models to balance speed and accuracy
  • Use MLX unified memory and lazy loading to minimize copies and memory usage
  • Verify model conversions to MLX and test outputs against HF baseline
  • Leverage the mlx-community pre-converted models for easier setup
  • Profile memory and performance when enabling streaming or batch generation

Example Use Cases

  • Load and generate with Llama-3.2-3B-Instruct-4bit via mlx-lm
  • Stream output using Mistral-7B-Instruct-v0.3-4bit
  • Batch generate with Qwen2.5-7B-Instruct-4bit for multiple prompts
  • Interactively chat in a CLI session with a loaded MLX model
  • Serve a local model over HTTP for a small app on macOS

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers