How do I convert a PyTorch model to CoreML?

Use torch.jit.trace to produce a traced model and then coremltools.convert with proper inputs and minimum_deployment_target (ct.target.iOS18).

How do I make predictions asynchronously on iOS?

Load the MLModel asynchronously (MLModel.load) and call prediction from a background task to keep the UI responsive.

How can I compress a CoreML model without large accuracy loss?

Use post-training compression (palettization, quantization, pruning) starting from a Float16 baseline and evaluate accuracy at each step, targeting iOS 18+ features.

ax-coreml

npx machina-cli add skill Kasempiternal/axiom-v2/ax-coreml --openclaw

Files (1)

SKILL.md

12.5 KB

CoreML — On-Device Machine Learning

Quick Patterns

Basic Conversion (PyTorch to CoreML)

import coremltools as ct
import torch

model.eval()
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=example_input.shape)],
    minimum_deployment_target=ct.target.iOS18
)
mlmodel.save("MyModel.mlpackage")

Load and Predict (Swift)

// Async load (preferred)
let config = MLModelConfiguration()
config.computeUnits = .all  // .cpuOnly, .cpuAndGPU, .cpuAndNeuralEngine
let model = try await MLModel.load(contentsOf: url, configuration: config)

// Async prediction (thread-safe)
let output = try await model.prediction(from: input)

Post-Training Compression

from coremltools.optimize.coreml import OpPalettizerConfig, OptimizationConfig, palettize_weights

config = OpPalettizerConfig(mode="kmeans", nbits=4, granularity="per_grouped_channel", group_size=16)
compressed = palettize_weights(model, OptimizationConfig(global_config=config))

Stateful Model (KV-Cache)

let state = model.makeState()
let output = try model.prediction(from: input, using: state) // state updated in-place

MLTensor (iOS 18+)

let scores = MLTensor(shape: [1, vocab_size], scalars: logits)
let topK = scores.topK(k: 10)
let probs = (topK.values / temperature).softmax()
let sampled = probs.multinomial(numSamples: 1)
let result = await sampled.shapedArray(of: Int32.self) // materialize

Decision Tree

Need on-device ML?
|
+-- Text generation (simple prompts, structured output)?
|   -> Foundation Models (ax-foundation-models), not CoreML
|
+-- Custom trained model / fine-tuned LLM?
|   -> CoreML
|   |
|   +-- PyTorch model to convert?
|   |   -> Pattern: Basic Conversion
|   +-- Model too large for device?
|   |   -> Pattern: Compression (palettization > quantization > pruning)
|   +-- Transformer with KV-cache?
|   |   -> Pattern: Stateful Models
|   +-- Multiple LoRA adapters?
|   |   -> Pattern: Multi-Function Models
|   +-- Pipeline stitching between models?
|   |   -> Pattern: MLTensor
|   +-- Concurrent predictions needed?
|       -> Pattern: Async Prediction
|
+-- Issue / not working?
    +-- Model won't load?          -> Diagnostics: Load Failures
    +-- Slow inference?            -> Diagnostics: Performance
    +-- High memory?               -> Diagnostics: Memory
    +-- Accuracy lost after compress? -> Diagnostics: Compression
    +-- Conversion fails?          -> Diagnostics: Conversion

Anti-Patterns

Loading models on main thread at launch

MLModel(contentsOf:) blocks. Use async loading in background task.

Reloading model for each prediction

Model loading is expensive. Load once, keep reference, reuse.

Compressing without profiling

Don't jump to 2-bit. Start with Float16 baseline, then 8-bit, 6-bit, 4-bit with grouped channels, testing accuracy at each step.

Missing deployment target

Always set minimum_deployment_target=ct.target.iOS18 to enable SDPA fusion, per-block quantization, MLTensor, and state support.

Unlimited concurrent predictions

Each prediction allocates input/output buffers. Limit concurrency to 2-3 to avoid memory pressure.

Deep Patterns

Model Compression

Three techniques with different tradeoffs:

Palettization (best for Neural Engine): Clusters weights into lookup tables.

from coremltools.optimize.coreml import OpPalettizerConfig, OptimizationConfig, palettize_weights

# 4-bit with grouped channels (iOS 18+)
config = OpPalettizerConfig(mode="kmeans", nbits=4, granularity="per_grouped_channel", group_size=16)
compressed = palettize_weights(model, OptimizationConfig(global_config=config))

Bits	Compression	Accuracy Impact
8-bit	2x	Minimal
6-bit	2.7x	Low
4-bit	4x	Moderate (use grouped channels)
2-bit	8x	High (requires training-time)

Quantization (best for GPU on Mac): Linear mapping to INT8/INT4.

from coremltools.optimize.coreml import OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights

config = OpLinearQuantizerConfig(mode="linear", dtype="int4", granularity="per_block", block_size=32)
compressed = linear_quantize_weights(model, OptimizationConfig(global_config=config))

Pruning: Sets weights to zero for sparse representation.

from coremltools.optimize.coreml import OpMagnitudePrunerConfig, OptimizationConfig, prune_weights

config = OpMagnitudePrunerConfig(target_sparsity=0.4)
sparse = prune_weights(model, OptimizationConfig(global_config=config))

Training-Time Compression

When post-training loses too much accuracy, fine-tune with compression:

from coremltools.optimize.torch.palettization import DKMPalettizerConfig, DKMPalettizer

config = DKMPalettizerConfig(global_config={"n_bits": 4})
palettizer = DKMPalettizer(model, config)
prepared = palettizer.prepare()

for epoch in range(epochs):
    train_epoch(prepared, data_loader)
    palettizer.step()

final = palettizer.finalize()

Calibration-Based Compression (iOS 18+)

Middle ground between post-training and full training:

from coremltools.optimize.torch.pruning import MagnitudePrunerConfig, LayerwiseCompressor

config = MagnitudePrunerConfig(target_sparsity=0.4, n_samples=128)
compressor = LayerwiseCompressor(model, config)
compressed = compressor.compress(calibration_loader)

Stateful Models (KV-Cache for LLMs)

PyTorch model registers state buffers, converted with ct.StateType:

mlmodel = ct.convert(
    traced,
    inputs=[
        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
        ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
    ],
    states=[
        ct.StateType(name="keyCache", wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))),
        ct.StateType(name="valueCache", wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128)))
    ],
    minimum_deployment_target=ct.target.iOS18
)

Runtime: let state = model.makeState() then model.prediction(from: input, using: state). State updated in-place. 1.6x speedup on Mistral-7B (M3 Max) vs manual KV-cache I/O.

Multi-Function Models (Adapters/LoRA)

Deploy multiple adapters sharing base weights in one model:

from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction

desc = MultiFunctionDescriptor()
desc.add_function("sticker", "sticker.mlpackage")
desc.add_function("storybook", "storybook.mlpackage")
save_multifunction(desc, "MultiAdapter.mlpackage")

Load specific function:

let config = MLModelConfiguration()
config.functionName = "sticker"
let model = try MLModel(contentsOf: url, configuration: config)

MLTensor Operations (iOS 18+)

// Create
let tensor = MLTensor([[1.0, 2.0], [3.0, 4.0]])
let zeros = MLTensor(zeros: [3, 3], scalarType: Float.self)

// Math: +, *, mean(), sum(), max(), softmax()
// Comparison: .> for boolean masks
// Indexing: tensor[0], tensor[.all, 0], tensor[0..<2, 1..<3]
// Reshaping: reshaped(to:), expandingShape(at:)
// Materialize: await tensor.shapedArray(of: Float.self)

Operations are async. Call shapedArray() to materialize results (blocks until complete).

Async Prediction

Thread-safe concurrent predictions:

try await withThrowingTaskGroup(of: Output.self) { group in
    for image in images {
        group.addTask {
            try Task.checkCancellation()
            return try await model.prediction(from: self.prepareInput(image))
        }
    }
    return try await group.reduce(into: []) { $0.append($1) }
}

Limit concurrency to avoid memory pressure from multiple input/output buffers.

Caching Behavior

First load triggers device specialization (slow). Subsequent loads use cache. Cache keyed by (model path + configuration + device). Invalidated by: system updates, low disk space, model modification.

Prewarm at launch:

Task.detached(priority: .background) { _ = try? await MLModel.load(contentsOf: modelURL) }

Compute Availability

let devices = MLModel.availableComputeDevices  // CPU, GPU, Neural Engine

ComputeUnits	Behavior
`.all`	Best performance (default)
`.cpuOnly`	CPU only
`.cpuAndGPU`	Exclude Neural Engine
`.cpuAndNeuralEngine`	Exclude GPU

Conversion Shape Types

ct.TensorType(shape=(1, 3, 224, 224))                    # Fixed
ct.TensorType(shape=(1, ct.RangeDim(1, 2048)))           # Range
ct.TensorType(shape=ct.EnumeratedShapes(shapes=[...]))   # Enumerated

Deployment Target Features

Target	Key Features
iOS 16	Weight compression (palettization, quantization, pruning)
iOS 17	Async prediction, MLComputeDevice, activation quantization
iOS 18	MLTensor, State, SDPA fusion, per-block quantization, multi-function

Diagnostics

Load Failures

Symptom	Cause	Fix
"Unsupported model version"	Spec version > device iOS	Re-convert with lower `minimum_deployment_target`
"Failed to create compute plan"	Unsupported ops for compute unit	Use `.cpuOnly` or convert with `FLOAT16` precision
General load error	File missing, not compiled, corrupt	Check `.mlmodelc` exists, disk space, re-convert

Spec version mapping: 4=iOS13, 5=iOS14, 6=iOS15, 7=iOS16, 8=iOS17, 9=iOS18.

Performance

Symptom	Cause	Fix
First load slow, subsequent fast	Cache miss	Prewarm in background at launch
All predictions slow	Wrong compute units	Profile with Instruments, check `computeUnits` config
Slow on specific device	Hardware mismatch	Palettization for NE, quantization for GPU, profile on target
Dynamic shapes recompiling	Variable input sizes	Use fixed or enumerated shapes

Profile compute unit usage:

let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
    let info = plan.computeDeviceInfo(for: op)
    print("\(op.name): \(info.preferredDevice)")
}

Memory

Symptom	Cause	Fix
Memory grows during predictions	Concurrent prediction buffers	Limit concurrent predictions (2-3 max)
Out of memory on load	Model too large	Compress (8-bit = 2x, 4-bit = 4x smaller)

Compression Accuracy Loss

Technique	Fix Progression
Palettization	per_grouped_channel (iOS 18+) -> more bits -> calibration -> training-time
Quantization	per_block (iOS 18+) -> calibration data -> higher dtype
Pruning	Lower sparsity -> calibration-based -> training-time (>50% needs training)

Key insight: 4-bit per-tensor = only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per group = much better.

Conversion Issues

Symptom	Fix
Unsupported operation	Upgrade coremltools, decompose into supported ops, or register custom op
Correct conversion but wrong output	Check input normalization, shape ordering (NCHW vs NHWC), precision (Float16 vs 32), mode

Debug output differences:

torch_out = model(input).detach().numpy()
coreml_out = mlmodel.predict({"input": input.numpy()})["output"]
print(f"Max diff: {np.max(np.abs(torch_out - coreml_out))}")

Profiling Tools

Xcode Performance Reports: Open model in Xcode > Performance tab > Create report
Core ML Instrument: Load events ("cached" vs "prepare and cache"), prediction intervals, compute unit usage
MLComputePlan (iOS 18+): Programmatic per-operation compute device and cost inspection

ax-foundation-models -- Apple's on-device LLM (for text generation, not custom models)
ax-metal -- GPU programming and Metal migration
WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161

Checklist Before Deploying

Set minimum_deployment_target to latest supported iOS
Profile Float16 baseline performance on target devices
Compress incrementally with accuracy testing at each step
Use async prediction for concurrent workloads
Limit concurrent predictions to manage memory
Use state for transformer KV-cache
Use multi-function for adapter variants
Test on actual devices (not just simulator -- simulator uses host Mac hardware)

Source

git clone https://github.com/Kasempiternal/axiom-v2/blob/main/axiom-plugin/skills/ax-coreml/SKILL.mdView on GitHub

Overview

ax-coreml enables on-device machine learning with CoreML. It covers PyTorch-to-CoreML model conversion, post-training compression (palettization, quantization, pruning), stateful KV-cache for LLMs, multi-function models via MLTensor, and async prediction with diagnostics.

How This Skill Works

Convert PyTorch models to CoreML using torch.jit.trace and coremltools, then apply post-training compression (palettization, quantization, pruning) as needed. At inference time, load the CoreML model asynchronously in Swift, use a state object for KV-cache, and leverage MLTensor for multi-function outputs, with async predictions to keep the UI responsive.

When to Use It

Need offline inference for a mobile app with strict latency
You have a PyTorch model that must run on iOS devices via CoreML
Model size is too large for the device and you want to compress it
You require a KV-cache/stateful decoder for efficient LLM inference
You want to stitch multiple models into a single on-device pipeline using MLTensor

Quick Start

Step 1: Convert a PyTorch model to CoreML (trace with torch.jit.trace and ct.convert, target iOS18)
Step 2: In Swift, load the model asynchronously (MLModel.load) and run prediction (model.prediction(from: input))
Step 3: Optional: apply post-training compression (palettize_weights) and enable KV-cache / MLTensor for multi-function pipelines

Best Practices

Start with a Float16 baseline before quantization to avoid baseline accuracy loss across steps
Always set minimum_deployment_target to ct.target.iOS18 for SDPA fusion and MLTensor support
Profile memory and latency before and after compression (palettization, quantization, pruning)
Use async loading and async prediction to keep the UI responsive
Leverage KV-cache via model.makeState and state-aware predictions for LLM decoding

Example Use Cases

Offline language assistant: CoreML LLM with KV-cache running entirely on-device
Mobile translator: PyTorch-to-CoreML conversion with a multi-function pipeline via MLTensor
Edge analytics app: compressed models using palettization/quantization for Neural Engine
Async prediction demo: Swift app loading models asynchronously and predicting in background
Diagnostics suite: on-device ML with CoreML for privacy-preserving inference and performance checks

Frequently Asked Questions

Add this skill to your agents