Get the FREE Ultimate OpenClaw Setup Guide →

ax-coreml

npx machina-cli add skill Kasempiternal/axiom-v2/ax-coreml --openclaw
Files (1)
SKILL.md
12.5 KB

CoreML — On-Device Machine Learning

Quick Patterns

Basic Conversion (PyTorch to CoreML)

import coremltools as ct
import torch

model.eval()
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=example_input.shape)],
    minimum_deployment_target=ct.target.iOS18
)
mlmodel.save("MyModel.mlpackage")

Load and Predict (Swift)

// Async load (preferred)
let config = MLModelConfiguration()
config.computeUnits = .all  // .cpuOnly, .cpuAndGPU, .cpuAndNeuralEngine
let model = try await MLModel.load(contentsOf: url, configuration: config)

// Async prediction (thread-safe)
let output = try await model.prediction(from: input)

Post-Training Compression

from coremltools.optimize.coreml import OpPalettizerConfig, OptimizationConfig, palettize_weights

config = OpPalettizerConfig(mode="kmeans", nbits=4, granularity="per_grouped_channel", group_size=16)
compressed = palettize_weights(model, OptimizationConfig(global_config=config))

Stateful Model (KV-Cache)

let state = model.makeState()
let output = try model.prediction(from: input, using: state) // state updated in-place

MLTensor (iOS 18+)

let scores = MLTensor(shape: [1, vocab_size], scalars: logits)
let topK = scores.topK(k: 10)
let probs = (topK.values / temperature).softmax()
let sampled = probs.multinomial(numSamples: 1)
let result = await sampled.shapedArray(of: Int32.self) // materialize

Decision Tree

Need on-device ML?
|
+-- Text generation (simple prompts, structured output)?
|   -> Foundation Models (ax-foundation-models), not CoreML
|
+-- Custom trained model / fine-tuned LLM?
|   -> CoreML
|   |
|   +-- PyTorch model to convert?
|   |   -> Pattern: Basic Conversion
|   +-- Model too large for device?
|   |   -> Pattern: Compression (palettization > quantization > pruning)
|   +-- Transformer with KV-cache?
|   |   -> Pattern: Stateful Models
|   +-- Multiple LoRA adapters?
|   |   -> Pattern: Multi-Function Models
|   +-- Pipeline stitching between models?
|   |   -> Pattern: MLTensor
|   +-- Concurrent predictions needed?
|       -> Pattern: Async Prediction
|
+-- Issue / not working?
    +-- Model won't load?          -> Diagnostics: Load Failures
    +-- Slow inference?            -> Diagnostics: Performance
    +-- High memory?               -> Diagnostics: Memory
    +-- Accuracy lost after compress? -> Diagnostics: Compression
    +-- Conversion fails?          -> Diagnostics: Conversion

Anti-Patterns

Loading models on main thread at launch

MLModel(contentsOf:) blocks. Use async loading in background task.

Reloading model for each prediction

Model loading is expensive. Load once, keep reference, reuse.

Compressing without profiling

Don't jump to 2-bit. Start with Float16 baseline, then 8-bit, 6-bit, 4-bit with grouped channels, testing accuracy at each step.

Missing deployment target

Always set minimum_deployment_target=ct.target.iOS18 to enable SDPA fusion, per-block quantization, MLTensor, and state support.

Unlimited concurrent predictions

Each prediction allocates input/output buffers. Limit concurrency to 2-3 to avoid memory pressure.


Deep Patterns

Model Compression

Three techniques with different tradeoffs:

Palettization (best for Neural Engine): Clusters weights into lookup tables.

from coremltools.optimize.coreml import OpPalettizerConfig, OptimizationConfig, palettize_weights

# 4-bit with grouped channels (iOS 18+)
config = OpPalettizerConfig(mode="kmeans", nbits=4, granularity="per_grouped_channel", group_size=16)
compressed = palettize_weights(model, OptimizationConfig(global_config=config))
BitsCompressionAccuracy Impact
8-bit2xMinimal
6-bit2.7xLow
4-bit4xModerate (use grouped channels)
2-bit8xHigh (requires training-time)

Quantization (best for GPU on Mac): Linear mapping to INT8/INT4.

from coremltools.optimize.coreml import OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights

config = OpLinearQuantizerConfig(mode="linear", dtype="int4", granularity="per_block", block_size=32)
compressed = linear_quantize_weights(model, OptimizationConfig(global_config=config))

Pruning: Sets weights to zero for sparse representation.

from coremltools.optimize.coreml import OpMagnitudePrunerConfig, OptimizationConfig, prune_weights

config = OpMagnitudePrunerConfig(target_sparsity=0.4)
sparse = prune_weights(model, OptimizationConfig(global_config=config))

Training-Time Compression

When post-training loses too much accuracy, fine-tune with compression:

from coremltools.optimize.torch.palettization import DKMPalettizerConfig, DKMPalettizer

config = DKMPalettizerConfig(global_config={"n_bits": 4})
palettizer = DKMPalettizer(model, config)
prepared = palettizer.prepare()

for epoch in range(epochs):
    train_epoch(prepared, data_loader)
    palettizer.step()

final = palettizer.finalize()

Calibration-Based Compression (iOS 18+)

Middle ground between post-training and full training:

from coremltools.optimize.torch.pruning import MagnitudePrunerConfig, LayerwiseCompressor

config = MagnitudePrunerConfig(target_sparsity=0.4, n_samples=128)
compressor = LayerwiseCompressor(model, config)
compressed = compressor.compress(calibration_loader)

Stateful Models (KV-Cache for LLMs)

PyTorch model registers state buffers, converted with ct.StateType:

mlmodel = ct.convert(
    traced,
    inputs=[
        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
        ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
    ],
    states=[
        ct.StateType(name="keyCache", wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))),
        ct.StateType(name="valueCache", wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128)))
    ],
    minimum_deployment_target=ct.target.iOS18
)

Runtime: let state = model.makeState() then model.prediction(from: input, using: state). State updated in-place. 1.6x speedup on Mistral-7B (M3 Max) vs manual KV-cache I/O.

Multi-Function Models (Adapters/LoRA)

Deploy multiple adapters sharing base weights in one model:

from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction

desc = MultiFunctionDescriptor()
desc.add_function("sticker", "sticker.mlpackage")
desc.add_function("storybook", "storybook.mlpackage")
save_multifunction(desc, "MultiAdapter.mlpackage")

Load specific function:

let config = MLModelConfiguration()
config.functionName = "sticker"
let model = try MLModel(contentsOf: url, configuration: config)

MLTensor Operations (iOS 18+)

// Create
let tensor = MLTensor([[1.0, 2.0], [3.0, 4.0]])
let zeros = MLTensor(zeros: [3, 3], scalarType: Float.self)

// Math: +, *, mean(), sum(), max(), softmax()
// Comparison: .> for boolean masks
// Indexing: tensor[0], tensor[.all, 0], tensor[0..<2, 1..<3]
// Reshaping: reshaped(to:), expandingShape(at:)
// Materialize: await tensor.shapedArray(of: Float.self)

Operations are async. Call shapedArray() to materialize results (blocks until complete).

Async Prediction

Thread-safe concurrent predictions:

try await withThrowingTaskGroup(of: Output.self) { group in
    for image in images {
        group.addTask {
            try Task.checkCancellation()
            return try await model.prediction(from: self.prepareInput(image))
        }
    }
    return try await group.reduce(into: []) { $0.append($1) }
}

Limit concurrency to avoid memory pressure from multiple input/output buffers.

Caching Behavior

First load triggers device specialization (slow). Subsequent loads use cache. Cache keyed by (model path + configuration + device). Invalidated by: system updates, low disk space, model modification.

Prewarm at launch:

Task.detached(priority: .background) { _ = try? await MLModel.load(contentsOf: modelURL) }

Compute Availability

let devices = MLModel.availableComputeDevices  // CPU, GPU, Neural Engine
ComputeUnitsBehavior
.allBest performance (default)
.cpuOnlyCPU only
.cpuAndGPUExclude Neural Engine
.cpuAndNeuralEngineExclude GPU

Conversion Shape Types

ct.TensorType(shape=(1, 3, 224, 224))                    # Fixed
ct.TensorType(shape=(1, ct.RangeDim(1, 2048)))           # Range
ct.TensorType(shape=ct.EnumeratedShapes(shapes=[...]))   # Enumerated

Deployment Target Features

TargetKey Features
iOS 16Weight compression (palettization, quantization, pruning)
iOS 17Async prediction, MLComputeDevice, activation quantization
iOS 18MLTensor, State, SDPA fusion, per-block quantization, multi-function

Diagnostics

Load Failures

SymptomCauseFix
"Unsupported model version"Spec version > device iOSRe-convert with lower minimum_deployment_target
"Failed to create compute plan"Unsupported ops for compute unitUse .cpuOnly or convert with FLOAT16 precision
General load errorFile missing, not compiled, corruptCheck .mlmodelc exists, disk space, re-convert

Spec version mapping: 4=iOS13, 5=iOS14, 6=iOS15, 7=iOS16, 8=iOS17, 9=iOS18.

Performance

SymptomCauseFix
First load slow, subsequent fastCache missPrewarm in background at launch
All predictions slowWrong compute unitsProfile with Instruments, check computeUnits config
Slow on specific deviceHardware mismatchPalettization for NE, quantization for GPU, profile on target
Dynamic shapes recompilingVariable input sizesUse fixed or enumerated shapes

Profile compute unit usage:

let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
    let info = plan.computeDeviceInfo(for: op)
    print("\(op.name): \(info.preferredDevice)")
}

Memory

SymptomCauseFix
Memory grows during predictionsConcurrent prediction buffersLimit concurrent predictions (2-3 max)
Out of memory on loadModel too largeCompress (8-bit = 2x, 4-bit = 4x smaller)

Compression Accuracy Loss

TechniqueFix Progression
Palettizationper_grouped_channel (iOS 18+) -> more bits -> calibration -> training-time
Quantizationper_block (iOS 18+) -> calibration data -> higher dtype
PruningLower sparsity -> calibration-based -> training-time (>50% needs training)

Key insight: 4-bit per-tensor = only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per group = much better.

Conversion Issues

SymptomFix
Unsupported operationUpgrade coremltools, decompose into supported ops, or register custom op
Correct conversion but wrong outputCheck input normalization, shape ordering (NCHW vs NHWC), precision (Float16 vs 32), mode

Debug output differences:

torch_out = model(input).detach().numpy()
coreml_out = mlmodel.predict({"input": input.numpy()})["output"]
print(f"Max diff: {np.max(np.abs(torch_out - coreml_out))}")

Profiling Tools

  • Xcode Performance Reports: Open model in Xcode > Performance tab > Create report
  • Core ML Instrument: Load events ("cached" vs "prepare and cache"), prediction intervals, compute unit usage
  • MLComputePlan (iOS 18+): Programmatic per-operation compute device and cost inspection

Related

  • ax-foundation-models -- Apple's on-device LLM (for text generation, not custom models)
  • ax-metal -- GPU programming and Metal migration
  • WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161

Checklist Before Deploying

  • Set minimum_deployment_target to latest supported iOS
  • Profile Float16 baseline performance on target devices
  • Compress incrementally with accuracy testing at each step
  • Use async prediction for concurrent workloads
  • Limit concurrent predictions to manage memory
  • Use state for transformer KV-cache
  • Use multi-function for adapter variants
  • Test on actual devices (not just simulator -- simulator uses host Mac hardware)

Source

git clone https://github.com/Kasempiternal/axiom-v2/blob/main/axiom-plugin/skills/ax-coreml/SKILL.mdView on GitHub

Overview

ax-coreml enables on-device machine learning with CoreML. It covers PyTorch-to-CoreML model conversion, post-training compression (palettization, quantization, pruning), stateful KV-cache for LLMs, multi-function models via MLTensor, and async prediction with diagnostics.

How This Skill Works

Convert PyTorch models to CoreML using torch.jit.trace and coremltools, then apply post-training compression (palettization, quantization, pruning) as needed. At inference time, load the CoreML model asynchronously in Swift, use a state object for KV-cache, and leverage MLTensor for multi-function outputs, with async predictions to keep the UI responsive.

When to Use It

  • Need offline inference for a mobile app with strict latency
  • You have a PyTorch model that must run on iOS devices via CoreML
  • Model size is too large for the device and you want to compress it
  • You require a KV-cache/stateful decoder for efficient LLM inference
  • You want to stitch multiple models into a single on-device pipeline using MLTensor

Quick Start

  1. Step 1: Convert a PyTorch model to CoreML (trace with torch.jit.trace and ct.convert, target iOS18)
  2. Step 2: In Swift, load the model asynchronously (MLModel.load) and run prediction (model.prediction(from: input))
  3. Step 3: Optional: apply post-training compression (palettize_weights) and enable KV-cache / MLTensor for multi-function pipelines

Best Practices

  • Start with a Float16 baseline before quantization to avoid baseline accuracy loss across steps
  • Always set minimum_deployment_target to ct.target.iOS18 for SDPA fusion and MLTensor support
  • Profile memory and latency before and after compression (palettization, quantization, pruning)
  • Use async loading and async prediction to keep the UI responsive
  • Leverage KV-cache via model.makeState and state-aware predictions for LLM decoding

Example Use Cases

  • Offline language assistant: CoreML LLM with KV-cache running entirely on-device
  • Mobile translator: PyTorch-to-CoreML conversion with a multi-function pipeline via MLTensor
  • Edge analytics app: compressed models using palettization/quantization for Neural Engine
  • Async prediction demo: Swift app loading models asynchronously and predicting in background
  • Diagnostics suite: on-device ML with CoreML for privacy-preserving inference and performance checks

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers