Are Melange SDK calls blocking the UI thread?

Yes. All MLange SDK calls are synchronous and blocking. Always dispatch to a background thread (Android: Dispatchers.IO, iOS: Task/actor).

How do I prevent concurrent inferences from crashing the native driver?

Use a single-consumer access pattern with a Mutex on Android or an actor on iOS to serialize model.run() calls.

What model formats does Melange support for mobile?

Supported upload formats are PyTorch .pt2, ONNX .onnx, and TorchScript .pt. Load models using the account_name/project_name format and your Personal Key.

zetic-mlange

npx machina-cli add skill mouchegmouradian/claude-code-skills/zetic-mlange --openclaw

Files (1)

SKILL.md

17.8 KB

ZETIC Melange (MLange) SDK Integration Skill

This skill helps users integrate the ZETIC Melange on-device AI SDK into mobile applications.

Naming: The product is called Melange but source code uses MLange (e.g., ZeticMLangeModel).

Key Resources

Prerequisites

Users need two things from the Melange Dashboard:

Personal Key — authentication credential
Model Name — in account_name/project_name format (e.g., "Steve/YOLOv11_comparison")

Supported upload formats: PyTorch .pt2, ONNX .onnx, TorchScript .pt

SDK Setup

Android (Gradle):

// app/build.gradle.kts
dependencies {
    implementation("com.zeticai.mlange:mlange:+")
}

android {
    packaging { jniLibs.useLegacyPackaging = true }
}

Add <uses-permission android:name="android.permission.INTERNET" /> to AndroidManifest.xml (SDK downloads model weights on first run).

iOS (CocoaPods):

pod 'ZeticMLange'

iOS (SPM): Add https://github.com/zetic-ai/ZeticMLangeiOS as a Swift Package dependency.

CRITICAL: Threading & Concurrency Rules

All MLange SDK calls are synchronous and blocking. They directly interface with native hardware (NPU/GPU/CPU) using fixed memory buffers. Every code example you generate MUST follow these rules:

Rule 1: Never Call on the Main/UI Thread

model.run(), model.waitForNextToken(), and even model construction can block for significant time. Always dispatch to a background thread.

Android (Kotlin):

// Always use Dispatchers.IO for blocking SDK calls (large thread pool designed for blocking work).
// Do NOT use Dispatchers.Default — it's sized to CPU cores and blocking calls would starve other coroutines.
withContext(Dispatchers.IO) {
    val outputs = model.run(inputs)
}

iOS (Swift):

// Use an actor or Task to move off @MainActor
func infer() async throws -> [Tensor] {
    try await Task.detached {
        try self.model.run(inputs: inputs)  // Runs on cooperative thread pool
    }.value
}

Rule 2: Single-Consumer Access

Model instances use fixed native memory buffers. Concurrent run() calls will overwrite inputs or crash the native driver. Guarantee serial access via Mutex + Dispatchers.IO (Android) or actors (iOS).

Android (Kotlin):

// Mutex serializes access; Dispatchers.IO borrows a thread from the shared pool only when needed.
private val modelMutex = Mutex()

suspend fun infer(inputs: Array<Tensor>): Array<Tensor> = modelMutex.withLock {
    withContext(Dispatchers.IO) {
        model.run(inputs)  // Only one call at a time, guaranteed by mutex
    }
}

iOS (Swift):

// Use a dedicated actor to serialize access
actor ModelInferenceActor {
    private let model: ZeticMLangeModel

    init(model: ZeticMLangeModel) {
        self.model = model
    }

    func infer(inputs: [Tensor]) throws -> [Tensor] {
        try model.run(inputs: inputs)  // Actor guarantees serial access
    }
}

// Usage — safe from any async context
let outputs = try await inferenceActor.infer(inputs: inputs)

Rule 3: Lifecycle-Aware Cleanup

Models hold significant native memory and hardware handles. Always close/deinit in lifecycle callbacks.

Android: Call model.close() in ViewModel.onCleared() or Activity.onDestroy(). iOS: Call cleanup from .onDisappear or an explicit teardown() method. Do NOT put cleanup in deinit via Task { } — deinit is synchronous and the Task is fire-and-forget with no guarantee it runs before the process suspends.

Failing to close a model can leak memory and prevent other apps from accessing the NPU.

Rule 4: Zero-Allocation in Hot Loops

For real-time inference (camera feed, audio stream), pre-allocate buffers once and reuse them every frame. Creating new Tensor/FloatArray/ByteBuffer objects per frame triggers GC pauses.

// GOOD: Pre-allocate once
val inputBuffers = model.getInputBuffers()
// Reuse in every frame
fun onFrame(frameData: ByteArray) {
    inputBuffers[0].copy(frameData)
    val outputs = model.run()
}

// BAD: Allocating per frame
fun onFrame(frameData: ByteArray) {
    val tensor = Tensor(frameData) // GC pressure!
    val outputs = model.run(arrayOf(tensor))
}

Rule 5: Drop-Frame Strategy for Real-Time Data

If the model is busy when a new frame arrives, skip it rather than queueing. This keeps the app responsive.

private val isRunning = AtomicBoolean(false)

fun onCameraFrame(frame: ByteArray) {
    if (!isRunning.compareAndSet(false, true)) return // Model busy — drop frame
    scope.launch(Dispatchers.IO) {
        try {
            inputBuffers[0].copy(frame)
            val outputs = model.run()
            withContext(Dispatchers.Main) { updateUI(outputs) }
        } finally {
            isRunning.set(false)
        }
    }
}

Rule 6: Reset State Between Sessions (LLM)

LLM models maintain internal KV cache state. Always call cleanUp() before starting a new conversation or prompt session to avoid stale context contamination.

// Before each new conversation
model.cleanUp()
model.run("New conversation prompt...")

General Model Inference (ZeticMLangeModel)

Android (Kotlin)

import com.zeticai.mlange.core.model.ZeticMLangeModel
import com.zeticai.mlange.core.tensor.Tensor
import kotlinx.coroutines.*
import kotlinx.coroutines.sync.Mutex
import kotlinx.coroutines.sync.withLock

// --- In a Repository or ViewModel ---

private val modelMutex = Mutex()
private var model: ZeticMLangeModel? = null

// Initialize off main thread. Mutex prevents double-init races.
suspend fun initModel(context: Context) = modelMutex.withLock {
    if (model == null) {
        withContext(Dispatchers.IO) {
            model = ZeticMLangeModel(context, "PERSONAL_KEY", "account/model_name")
        }
    }
}

// Thread-safe inference — mutex serializes, IO provides background thread.
suspend fun infer(inputs: Array<Tensor>): Array<Tensor> = modelMutex.withLock {
    val m = model ?: throw IllegalStateException("Model not initialized — call initModel() first")
    withContext(Dispatchers.IO) {
        m.run(inputs)
    }
}

// Lifecycle cleanup (e.g., in ViewModel.onCleared or Repository.close)
fun close() {
    model?.close()
    model = null
}

Why Mutex + Dispatchers.IO: The mutex guarantees only one call touches the model at a time. Dispatchers.IO borrows a thread from the shared pool only during the blocking call — no dedicated thread sits idle between inferences. This is lightweight and scales to multiple model instances (pipelines, encoder-decoder pairs). For the simplest single-model case, newSingleThreadContext("Model") is an alternative that avoids the mutex entirely via thread confinement.

Constructor variants (all blocking — always wrap in background dispatcher):

(context, personalKey, name) — auto mode
(context, personalKey, name, modelMode=) — specify inference mode
(context, personalKey, name, target=) — specify hardware target (Pro tier)
(context, personalKey, name, target=, apType=) — target + processor (Pro tier)

All constructors accept optional: version: Int?, onProgress: ((Float) -> Unit)?, onStatusChanged: ((ModelLoadingStatus) -> Unit)?

iOS (Swift)

import ZeticMLange

// --- Use a dedicated actor for thread-safe serial access ---

actor MLangeModelActor {
    private var model: ZeticMLangeModel?

    func initialize(personalKey: String, name: String) throws {
        model = try ZeticMLangeModel(personalKey: personalKey, name: name)
    }

    func infer(inputs: [Tensor]) throws -> [Tensor] {
        guard let model else { throw ZeticMLangeError("ModelNotInitialized", "Call initialize() before using the model") }
        return try model.run(inputs: inputs)
    }

    func cleanup() {
        model = nil  // releases native resources
    }
}

// --- In a ViewModel (@MainActor) ---

@MainActor
class InferenceViewModel: ObservableObject {
    private let modelActor = MLangeModelActor()

    func setup() async throws {
        try await modelActor.initialize(
            personalKey: "PERSONAL_KEY", name: "account/model_name")
    }

    func runInference(inputs: [Tensor]) async throws -> [Tensor] {
        try await modelActor.infer(inputs: inputs)
    }

    // Call explicitly from .onDisappear or when done — do NOT rely on deinit.
    func teardown() async {
        await modelActor.cleanup()
    }
}

Constructor variants (all blocking — always call from actor or detached Task):

LLM Inference (ZeticMLangeLLMModel)

Android (Kotlin)

import com.zeticai.mlange.core.model.llm.*
import kotlinx.coroutines.*
import kotlinx.coroutines.flow.*
import kotlinx.coroutines.sync.Mutex
import kotlinx.coroutines.sync.withLock

private val llmMutex = Mutex()
private var llmModel: ZeticMLangeLLMModel? = null

// Initialize off main thread — model downloads weights on first run (~700MB for LLMs).
// Always show download progress to the user.
suspend fun initLLM(context: Context, onProgress: (Float) -> Unit) = llmMutex.withLock {
    if (llmModel == null) {
        withContext(Dispatchers.IO) {
            llmModel = ZeticMLangeLLMModel(context, "PERSONAL_KEY", "account/model_name",
                target = LLMTarget.LLAMA_CPP,
                quantType = LLMQuantType.GGUF_QUANT_Q4_K_M,
                onProgress = onProgress)
        }
    }
}

// Stream tokens as a Flow — idiomatic Kotlin, supports cancellation between emissions.
// Mutex prevents concurrent generate() calls (e.g., from screen rotation re-collecting).
fun generate(prompt: String): Flow<String> = flow {
    llmMutex.withLock {
        val model = llmModel ?: throw IllegalStateException("Model not initialized")
        model.cleanUp() // Reset KV cache for new conversation
        model.run(prompt)

        while (currentCoroutineContext().isActive) {
            val result: LLMNextTokenResult = model.waitForNextToken()
            if (result.status != 0) break
            emit(result.token)
        }
    }
}.flowOn(Dispatchers.IO)

// Cleanup
fun close() {
    llmModel?.deinit()
    llmModel = null
}

Cancellation note: waitForNextToken() is a blocking JNI call — Job.cancel() cannot interrupt it mid-block. Cancellation takes effect after the current token returns, when flow {} checks isActive before the next emit(). For instant stop, check if the SDK's model exposes a cancel() or stop() method.

LLM Quantization Types: GGUF_QUANT_ORG, GGUF_QUANT_F16, GGUF_QUANT_BF16, GGUF_QUANT_Q8_0, GGUF_QUANT_Q6_K, GGUF_QUANT_Q4_K_M, GGUF_QUANT_Q3_K_M, GGUF_QUANT_Q2_K

LLM KV Cache Policies: CLEAN_UP_ON_FULL (default), DO_NOT_CLEAN_UP

iOS (Swift)

import ZeticMLange

actor LLMModelActor {
    private var model: ZeticMLangeLLMModel?

    // Model downloads weights on first run (~700MB for LLMs).
    // Pass onDownload to show progress to the user.
    func initialize(personalKey: String, name: String,
                    onDownload: ((Float) -> Void)? = nil) throws {
        model = try ZeticMLangeLLMModel(personalKey: personalKey, name: name,
            target: .LLAMA_CPP, quantType: .GGUF_QUANT_Q4_K_M,
            onDownload: onDownload)
    }

    /// Returns an AsyncStream of tokens. Call from any async context.
    /// Note: waitForNextToken() blocks its thread — cancellation takes effect
    /// after the current token returns, not mid-block.
    func chat(prompt: String) throws -> AsyncStream<String> {
        guard let model else { throw ZeticMLangeError("ModelNotInitialized", "Call initialize() before using the model") }
        try model.cleanUp()
        let _ = try model.run(prompt)

        return AsyncStream { continuation in
            while true {
                let result = model.waitForNextToken()
                if result.isFinished || Task.isCancelled {
                    continuation.finish()
                    break
                }
                continuation.yield(result.token)
            }
        }
    }

    func cleanup() {
        model?.forceDeinit()
        model = nil
    }
}

// --- Usage in a @MainActor ViewModel ---

@MainActor
class ChatViewModel: ObservableObject {
    @Published var response = ""
    private let llmActor = LLMModelActor()
    private var generateTask: Task<Void, Error>?

    func send(prompt: String) async throws {
        response = ""
        let stream = try await llmActor.chat(prompt: prompt)
        for await token in stream {
            response += token  // @MainActor — safe to update UI
        }
    }

    // Call explicitly from .onDisappear — do NOT rely on deinit for cleanup.
    // deinit is synchronous; Task {} inside it is fire-and-forget with no guarantee of execution.
    func teardown() async {
        generateTask?.cancel()
        await llmActor.cleanup()
    }
}

HuggingFace Model Loading (ZeticMLangeHFModel)

Android (Kotlin)

import com.zeticai.mlange.core.model.ZeticMLangeHFModel

// Load from HuggingFace repo (public)
val model = ZeticMLangeHFModel(context, "username/repo-name")

// With access token (private repos)
val model = ZeticMLangeHFModel(context, "username/repo-name",
    userAccessToken = "hf_your_token")

val outputs = model.run(inputTensors)
model.close()

iOS (Swift)

import ZeticMLange

let model = try await ZeticMLangeHFModel("username/repo-name",
    userAccessToken: "hf_your_token")
let outputs = try model.run(inputs: inputTensors)

Note: iOS HF model loading is async — use await.

Common Patterns

Pipeline (e.g., Face Detection → Emotion Recognition)

// Both models serialized by running on Dispatchers.IO within a single coroutine
suspend fun detectEmotion(bitmap: Bitmap): EmotionResult = withContext(Dispatchers.IO) {
    // Step 1: Detect faces
    val detectionOutputs = detectionModel.run(preprocessedImage)

    // Step 2: Extract face region
    val faceRegion = extractFaceRegion(bitmap, detectionOutputs)

    // Step 3: Classify emotion
    val emotionOutputs = emotionModel.run(faceRegion)
    parseEmotionResult(emotionOutputs)
}

Encoder-Decoder (e.g., Whisper)

val encoder = ZeticMLangeModel(context, personalKey, "OpenAI/whisper-tiny-encoder")
val decoder = ZeticMLangeModel(context, personalKey, "OpenAI/whisper-tiny-decoder")

val encoderOutputs = encoder.run(melSpectrogramInputs)
// Feed encoder outputs into decoder
val decoderOutputs = decoder.run(arrayOf(decoderInputIds, encoderOutputs[0], attentionMask))

Hardware Targets (General Models)

Target	Description
TFLITE_FP32/FP16/QUANT	TensorFlow Lite backends
ORT / ORT_NNAPI	ONNX Runtime (with optional NNAPI)
QNN / QNN_FP16 / QNN_QUANT	Qualcomm Neural Network
COREML / COREML_FP32 / COREML_QUANT	Apple CoreML
NEUROPILOT / NEUROPILOT_QUANT	MediaTek NeuroPilot
EXYNOS / EXYNOS_QUANT	Samsung Exynos
KIRIN / KIRIN_QUANT	Huawei Kirin
LITERT_FP32/FP16/QUANT	Google LiteRT

ModelMode.RUN_AUTO automatically selects the best target for the device.

Platform Notes

Progress callback naming: Android uses onProgress: ((Float) -> Unit)?, iOS uses onDownload: ((Float) -> Void)?. Same behavior, different parameter names.
Error handling: model.run() throws ZeticMLangeException (Android) or is marked throws (iOS) on input size mismatch and other failures. Wrap in try/catch in production code.
Actor blocking (iOS): waitForNextToken() blocks the actor's serial executor. This is fine for a dedicated model actor, but don't add unrelated methods to the same actor — they'll be blocked until token generation finishes.

Troubleshooting

ANR / App Not Responding: You called model.run() or constructor on the main thread. All SDK calls are blocking — always use a background dispatcher.
Native crash / SIGSEGV: Likely concurrent run() calls on the same model. Use a Mutex + Dispatchers.IO (Android) or an actor (iOS) to serialize access — one inference at a time per model instance.
Memory leak / NPU unavailable: Forgot to call model.close() / deinit() / forceDeinit() in lifecycle callbacks.
Micro-stutters in real-time apps: Allocating new Tensor/ByteBuffer objects per frame. Pre-allocate with getInputBuffers() and reuse.
Wrong input size error: Ensure inputs.size matches the model's expected input count.
Target not supported: Some targets require specific hardware (e.g., QNN = Qualcomm SoC, CoreML = Apple).
NPU access on Free tier: NPU is only available on Samsung S25/S25 Ultra in Free tier; Pro tier unlocks all.
Model name format: Must be "account_name/project_name" — throws ZeticMLangeException otherwise.
LLM produces stale/mixed output: Forgot to call cleanUp() before starting a new conversation. Always reset KV cache between sessions.

Demo Model Keys (for testing)

"face_detection" — Face detection
"face_landmark" — Face landmark
"deepseek-r1-distill-qwen-1.5b-f16" — LLM demo
"Steve/YOLOv11_comparison" — Object detection
"OpenAI/whisper-tiny-encoder" / "OpenAI/whisper-tiny-decoder" — Speech recognition

Source

git clone https://github.com/mouchegmouradian/claude-code-skills/blob/main/skills/zetic-mlange/SKILL.mdView on GitHub

Overview

This skill helps you integrate the ZETIC Melange on-device AI SDK into Android and iOS apps. It covers loading and running models (including ZeticMLangeModel, ZeticMLangeLLMModel, and ZeticMLangeHFModel), LLM streaming, and HuggingFace model loading for on-device inference with NPU acceleration. It also addresses prerequisites like the Personal Key and model name, plus workflows for quantized GGUF LLMs and mobile deployment.

How This Skill Works

Melange SDK calls are synchronous and blocking and execute directly against the device hardware via fixed memory buffers. All in-app inference must be dispatched to a background thread (Android: Dispatchers.IO, iOS: Task/actor) to avoid UI stalls. Access to model instances must be serialized (Android: Mutex + Dispatchers.IO; iOS: actor) to prevent data races. The workflow typically involves authenticating with a Personal Key, selecting a model via account_name/project_name, loading supported formats (PyTorch .pt2, ONNX .onnx, TorchScript .pt), and then invoking model.run() for inference or streaming tokens for LLMs.

When to Use It

Running LLMs on-device with GGUF quantization for mobile privacy and latency benefits
Deploying ML models to Android/iOS with on-device NPU/GPU acceleration
Loading HuggingFace models for on-device inference and experimentation
Converting and deploying PyTorch (.pt2) or ONNX (.onnx) models for mobile inference
General model inference and LLM inference (including streaming token generation) on mobile

Quick Start

Step 1: Add the Melange SDK dependency to your project (Android: implementation 'com.zeticai.mlange:mlange:+'; iOS: CocoaPods or SPM) and request internet permission on Android
Step 2: In the Melange Dashboard, obtain your Personal Key and set modelName to your account_name/project_name (e.g., 'Steve/YOLOv11_comparison') and ensure the model is uploaded in a supported format (.pt2, .onnx, .pt)
Step 3: Load the model in a background thread and call model.run(inputs) (Android: with Dispatchers.IO and Mutex; iOS: Task/actor) to perform inference or streaming token generation

Best Practices

Always dispatch SDK calls to a background thread; never run model.run() on the main UI thread
Serialize access to a model instance using a Mutex (Android) or an actor (iOS) to ensure single-consumer behavior
Use the supported upload formats (.pt2, .onnx, .pt) and load models via the Melange dashboard (account_name/project_name) with a Personal Key
Prefer ZeticMLangeModel for general inference, ZeticMLangeLLMModel for LLM tasks, and ZeticMLangeHFModel for HuggingFace integrations
Plan for NPU/accelerator usage and test performance on target devices; monitor memory and thread usage to avoid bottlenecks

Example Use Cases

Android app (Kotlin) loading a PyTorch .pt2 model via ZeticMLangeModel and performing image inference on an NPU
iOS app (Swift) running an LLM using ZeticMLangeLLMModel with streaming token generation on a background task
HuggingFace model loading on-device using ZeticMLangeHFModel and streaming outputs to the UI
Converting and deploying a PyTorch/ONNX model to mobile with GGUF quantization for reduced latency
Thread-safe inference workflow using a Mutex (Android) or actor (iOS) to ensure serial access during concurrent inferences

Frequently Asked Questions

Add this skill to your agents