zetic-mlange
npx machina-cli add skill mouchegmouradian/claude-code-skills/zetic-mlange --openclawZETIC Melange (MLange) SDK Integration Skill
This skill helps users integrate the ZETIC Melange on-device AI SDK into mobile applications.
Naming: The product is called Melange but source code uses MLange (e.g., ZeticMLangeModel).
Key Resources
- Dashboard: https://melange.zetic.ai
- Docs: https://docs.zetic.ai
- GitHub: https://github.com/zetic-ai
- Contact: contact@zetic.ai
Prerequisites
Users need two things from the Melange Dashboard:
- Personal Key — authentication credential
- Model Name — in
account_name/project_nameformat (e.g.,"Steve/YOLOv11_comparison")
Supported upload formats: PyTorch .pt2, ONNX .onnx, TorchScript .pt
SDK Setup
Android (Gradle):
// app/build.gradle.kts
dependencies {
implementation("com.zeticai.mlange:mlange:+")
}
android {
packaging { jniLibs.useLegacyPackaging = true }
}
Add <uses-permission android:name="android.permission.INTERNET" /> to AndroidManifest.xml (SDK downloads model weights on first run).
iOS (CocoaPods):
pod 'ZeticMLange'
iOS (SPM): Add https://github.com/zetic-ai/ZeticMLangeiOS as a Swift Package dependency.
CRITICAL: Threading & Concurrency Rules
All MLange SDK calls are synchronous and blocking. They directly interface with native hardware (NPU/GPU/CPU) using fixed memory buffers. Every code example you generate MUST follow these rules:
Rule 1: Never Call on the Main/UI Thread
model.run(), model.waitForNextToken(), and even model construction can block for significant time. Always dispatch to a background thread.
Android (Kotlin):
// Always use Dispatchers.IO for blocking SDK calls (large thread pool designed for blocking work).
// Do NOT use Dispatchers.Default — it's sized to CPU cores and blocking calls would starve other coroutines.
withContext(Dispatchers.IO) {
val outputs = model.run(inputs)
}
iOS (Swift):
// Use an actor or Task to move off @MainActor
func infer() async throws -> [Tensor] {
try await Task.detached {
try self.model.run(inputs: inputs) // Runs on cooperative thread pool
}.value
}
Rule 2: Single-Consumer Access
Model instances use fixed native memory buffers. Concurrent run() calls will overwrite inputs or crash the native driver. Guarantee serial access via Mutex + Dispatchers.IO (Android) or actors (iOS).
Android (Kotlin):
// Mutex serializes access; Dispatchers.IO borrows a thread from the shared pool only when needed.
private val modelMutex = Mutex()
suspend fun infer(inputs: Array<Tensor>): Array<Tensor> = modelMutex.withLock {
withContext(Dispatchers.IO) {
model.run(inputs) // Only one call at a time, guaranteed by mutex
}
}
iOS (Swift):
// Use a dedicated actor to serialize access
actor ModelInferenceActor {
private let model: ZeticMLangeModel
init(model: ZeticMLangeModel) {
self.model = model
}
func infer(inputs: [Tensor]) throws -> [Tensor] {
try model.run(inputs: inputs) // Actor guarantees serial access
}
}
// Usage — safe from any async context
let outputs = try await inferenceActor.infer(inputs: inputs)
Rule 3: Lifecycle-Aware Cleanup
Models hold significant native memory and hardware handles. Always close/deinit in lifecycle callbacks.
Android: Call model.close() in ViewModel.onCleared() or Activity.onDestroy().
iOS: Call cleanup from .onDisappear or an explicit teardown() method. Do NOT put cleanup in deinit via Task { } — deinit is synchronous and the Task is fire-and-forget with no guarantee it runs before the process suspends.
Failing to close a model can leak memory and prevent other apps from accessing the NPU.
Rule 4: Zero-Allocation in Hot Loops
For real-time inference (camera feed, audio stream), pre-allocate buffers once and reuse them every frame. Creating new Tensor/FloatArray/ByteBuffer objects per frame triggers GC pauses.
// GOOD: Pre-allocate once
val inputBuffers = model.getInputBuffers()
// Reuse in every frame
fun onFrame(frameData: ByteArray) {
inputBuffers[0].copy(frameData)
val outputs = model.run()
}
// BAD: Allocating per frame
fun onFrame(frameData: ByteArray) {
val tensor = Tensor(frameData) // GC pressure!
val outputs = model.run(arrayOf(tensor))
}
Rule 5: Drop-Frame Strategy for Real-Time Data
If the model is busy when a new frame arrives, skip it rather than queueing. This keeps the app responsive.
private val isRunning = AtomicBoolean(false)
fun onCameraFrame(frame: ByteArray) {
if (!isRunning.compareAndSet(false, true)) return // Model busy — drop frame
scope.launch(Dispatchers.IO) {
try {
inputBuffers[0].copy(frame)
val outputs = model.run()
withContext(Dispatchers.Main) { updateUI(outputs) }
} finally {
isRunning.set(false)
}
}
}
Rule 6: Reset State Between Sessions (LLM)
LLM models maintain internal KV cache state. Always call cleanUp() before starting a new conversation or prompt session to avoid stale context contamination.
// Before each new conversation
model.cleanUp()
model.run("New conversation prompt...")
General Model Inference (ZeticMLangeModel)
Android (Kotlin)
import com.zeticai.mlange.core.model.ZeticMLangeModel
import com.zeticai.mlange.core.tensor.Tensor
import kotlinx.coroutines.*
import kotlinx.coroutines.sync.Mutex
import kotlinx.coroutines.sync.withLock
// --- In a Repository or ViewModel ---
private val modelMutex = Mutex()
private var model: ZeticMLangeModel? = null
// Initialize off main thread. Mutex prevents double-init races.
suspend fun initModel(context: Context) = modelMutex.withLock {
if (model == null) {
withContext(Dispatchers.IO) {
model = ZeticMLangeModel(context, "PERSONAL_KEY", "account/model_name")
}
}
}
// Thread-safe inference — mutex serializes, IO provides background thread.
suspend fun infer(inputs: Array<Tensor>): Array<Tensor> = modelMutex.withLock {
val m = model ?: throw IllegalStateException("Model not initialized — call initModel() first")
withContext(Dispatchers.IO) {
m.run(inputs)
}
}
// Lifecycle cleanup (e.g., in ViewModel.onCleared or Repository.close)
fun close() {
model?.close()
model = null
}
Why Mutex + Dispatchers.IO: The mutex guarantees only one call touches the model at a time. Dispatchers.IO borrows a thread from the shared pool only during the blocking call — no dedicated thread sits idle between inferences. This is lightweight and scales to multiple model instances (pipelines, encoder-decoder pairs). For the simplest single-model case, newSingleThreadContext("Model") is an alternative that avoids the mutex entirely via thread confinement.
Constructor variants (all blocking — always wrap in background dispatcher):
(context, personalKey, name)— auto mode(context, personalKey, name, modelMode=)— specify inference mode(context, personalKey, name, target=)— specify hardware target (Pro tier)(context, personalKey, name, target=, apType=)— target + processor (Pro tier)
All constructors accept optional: version: Int?, onProgress: ((Float) -> Unit)?, onStatusChanged: ((ModelLoadingStatus) -> Unit)?
iOS (Swift)
import ZeticMLange
// --- Use a dedicated actor for thread-safe serial access ---
actor MLangeModelActor {
private var model: ZeticMLangeModel?
func initialize(personalKey: String, name: String) throws {
model = try ZeticMLangeModel(personalKey: personalKey, name: name)
}
func infer(inputs: [Tensor]) throws -> [Tensor] {
guard let model else { throw ZeticMLangeError("ModelNotInitialized", "Call initialize() before using the model") }
return try model.run(inputs: inputs)
}
func cleanup() {
model = nil // releases native resources
}
}
// --- In a ViewModel (@MainActor) ---
@MainActor
class InferenceViewModel: ObservableObject {
private let modelActor = MLangeModelActor()
func setup() async throws {
try await modelActor.initialize(
personalKey: "PERSONAL_KEY", name: "account/model_name")
}
func runInference(inputs: [Tensor]) async throws -> [Tensor] {
try await modelActor.infer(inputs: inputs)
}
// Call explicitly from .onDisappear or when done — do NOT rely on deinit.
func teardown() async {
await modelActor.cleanup()
}
}
Constructor variants (all blocking — always call from actor or detached Task):
LLM Inference (ZeticMLangeLLMModel)
Android (Kotlin)
import com.zeticai.mlange.core.model.llm.*
import kotlinx.coroutines.*
import kotlinx.coroutines.flow.*
import kotlinx.coroutines.sync.Mutex
import kotlinx.coroutines.sync.withLock
private val llmMutex = Mutex()
private var llmModel: ZeticMLangeLLMModel? = null
// Initialize off main thread — model downloads weights on first run (~700MB for LLMs).
// Always show download progress to the user.
suspend fun initLLM(context: Context, onProgress: (Float) -> Unit) = llmMutex.withLock {
if (llmModel == null) {
withContext(Dispatchers.IO) {
llmModel = ZeticMLangeLLMModel(context, "PERSONAL_KEY", "account/model_name",
target = LLMTarget.LLAMA_CPP,
quantType = LLMQuantType.GGUF_QUANT_Q4_K_M,
onProgress = onProgress)
}
}
}
// Stream tokens as a Flow — idiomatic Kotlin, supports cancellation between emissions.
// Mutex prevents concurrent generate() calls (e.g., from screen rotation re-collecting).
fun generate(prompt: String): Flow<String> = flow {
llmMutex.withLock {
val model = llmModel ?: throw IllegalStateException("Model not initialized")
model.cleanUp() // Reset KV cache for new conversation
model.run(prompt)
while (currentCoroutineContext().isActive) {
val result: LLMNextTokenResult = model.waitForNextToken()
if (result.status != 0) break
emit(result.token)
}
}
}.flowOn(Dispatchers.IO)
// Cleanup
fun close() {
llmModel?.deinit()
llmModel = null
}
Cancellation note: waitForNextToken() is a blocking JNI call — Job.cancel() cannot interrupt it mid-block. Cancellation takes effect after the current token returns, when flow {} checks isActive before the next emit(). For instant stop, check if the SDK's model exposes a cancel() or stop() method.
LLM Quantization Types: GGUF_QUANT_ORG, GGUF_QUANT_F16, GGUF_QUANT_BF16, GGUF_QUANT_Q8_0, GGUF_QUANT_Q6_K, GGUF_QUANT_Q4_K_M, GGUF_QUANT_Q3_K_M, GGUF_QUANT_Q2_K
LLM KV Cache Policies: CLEAN_UP_ON_FULL (default), DO_NOT_CLEAN_UP
iOS (Swift)
import ZeticMLange
actor LLMModelActor {
private var model: ZeticMLangeLLMModel?
// Model downloads weights on first run (~700MB for LLMs).
// Pass onDownload to show progress to the user.
func initialize(personalKey: String, name: String,
onDownload: ((Float) -> Void)? = nil) throws {
model = try ZeticMLangeLLMModel(personalKey: personalKey, name: name,
target: .LLAMA_CPP, quantType: .GGUF_QUANT_Q4_K_M,
onDownload: onDownload)
}
/// Returns an AsyncStream of tokens. Call from any async context.
/// Note: waitForNextToken() blocks its thread — cancellation takes effect
/// after the current token returns, not mid-block.
func chat(prompt: String) throws -> AsyncStream<String> {
guard let model else { throw ZeticMLangeError("ModelNotInitialized", "Call initialize() before using the model") }
try model.cleanUp()
let _ = try model.run(prompt)
return AsyncStream { continuation in
while true {
let result = model.waitForNextToken()
if result.isFinished || Task.isCancelled {
continuation.finish()
break
}
continuation.yield(result.token)
}
}
}
func cleanup() {
model?.forceDeinit()
model = nil
}
}
// --- Usage in a @MainActor ViewModel ---
@MainActor
class ChatViewModel: ObservableObject {
@Published var response = ""
private let llmActor = LLMModelActor()
private var generateTask: Task<Void, Error>?
func send(prompt: String) async throws {
response = ""
let stream = try await llmActor.chat(prompt: prompt)
for await token in stream {
response += token // @MainActor — safe to update UI
}
}
// Call explicitly from .onDisappear — do NOT rely on deinit for cleanup.
// deinit is synchronous; Task {} inside it is fire-and-forget with no guarantee of execution.
func teardown() async {
generateTask?.cancel()
await llmActor.cleanup()
}
}
HuggingFace Model Loading (ZeticMLangeHFModel)
Android (Kotlin)
import com.zeticai.mlange.core.model.ZeticMLangeHFModel
// Load from HuggingFace repo (public)
val model = ZeticMLangeHFModel(context, "username/repo-name")
// With access token (private repos)
val model = ZeticMLangeHFModel(context, "username/repo-name",
userAccessToken = "hf_your_token")
val outputs = model.run(inputTensors)
model.close()
iOS (Swift)
import ZeticMLange
let model = try await ZeticMLangeHFModel("username/repo-name",
userAccessToken: "hf_your_token")
let outputs = try model.run(inputs: inputTensors)
Note: iOS HF model loading is async — use await.
Common Patterns
Pipeline (e.g., Face Detection → Emotion Recognition)
// Both models serialized by running on Dispatchers.IO within a single coroutine
suspend fun detectEmotion(bitmap: Bitmap): EmotionResult = withContext(Dispatchers.IO) {
// Step 1: Detect faces
val detectionOutputs = detectionModel.run(preprocessedImage)
// Step 2: Extract face region
val faceRegion = extractFaceRegion(bitmap, detectionOutputs)
// Step 3: Classify emotion
val emotionOutputs = emotionModel.run(faceRegion)
parseEmotionResult(emotionOutputs)
}
Encoder-Decoder (e.g., Whisper)
val encoder = ZeticMLangeModel(context, personalKey, "OpenAI/whisper-tiny-encoder")
val decoder = ZeticMLangeModel(context, personalKey, "OpenAI/whisper-tiny-decoder")
val encoderOutputs = encoder.run(melSpectrogramInputs)
// Feed encoder outputs into decoder
val decoderOutputs = decoder.run(arrayOf(decoderInputIds, encoderOutputs[0], attentionMask))
Hardware Targets (General Models)
| Target | Description |
|---|---|
| TFLITE_FP32/FP16/QUANT | TensorFlow Lite backends |
| ORT / ORT_NNAPI | ONNX Runtime (with optional NNAPI) |
| QNN / QNN_FP16 / QNN_QUANT | Qualcomm Neural Network |
| COREML / COREML_FP32 / COREML_QUANT | Apple CoreML |
| NEUROPILOT / NEUROPILOT_QUANT | MediaTek NeuroPilot |
| EXYNOS / EXYNOS_QUANT | Samsung Exynos |
| KIRIN / KIRIN_QUANT | Huawei Kirin |
| LITERT_FP32/FP16/QUANT | Google LiteRT |
ModelMode.RUN_AUTO automatically selects the best target for the device.
Platform Notes
- Progress callback naming: Android uses
onProgress: ((Float) -> Unit)?, iOS usesonDownload: ((Float) -> Void)?. Same behavior, different parameter names. - Error handling:
model.run()throwsZeticMLangeException(Android) or is markedthrows(iOS) on input size mismatch and other failures. Wrap in try/catch in production code. - Actor blocking (iOS):
waitForNextToken()blocks the actor's serial executor. This is fine for a dedicated model actor, but don't add unrelated methods to the same actor — they'll be blocked until token generation finishes.
Troubleshooting
- ANR / App Not Responding: You called
model.run()or constructor on the main thread. All SDK calls are blocking — always use a background dispatcher. - Native crash / SIGSEGV: Likely concurrent
run()calls on the same model. Use aMutex+Dispatchers.IO(Android) or an actor (iOS) to serialize access — one inference at a time per model instance. - Memory leak / NPU unavailable: Forgot to call
model.close()/deinit()/forceDeinit()in lifecycle callbacks. - Micro-stutters in real-time apps: Allocating new Tensor/ByteBuffer objects per frame. Pre-allocate with
getInputBuffers()and reuse. - Wrong input size error: Ensure
inputs.sizematches the model's expected input count. - Target not supported: Some targets require specific hardware (e.g., QNN = Qualcomm SoC, CoreML = Apple).
- NPU access on Free tier: NPU is only available on Samsung S25/S25 Ultra in Free tier; Pro tier unlocks all.
- Model name format: Must be
"account_name/project_name"— throwsZeticMLangeExceptionotherwise. - LLM produces stale/mixed output: Forgot to call
cleanUp()before starting a new conversation. Always reset KV cache between sessions.
Demo Model Keys (for testing)
"face_detection"— Face detection"face_landmark"— Face landmark"deepseek-r1-distill-qwen-1.5b-f16"— LLM demo"Steve/YOLOv11_comparison"— Object detection"OpenAI/whisper-tiny-encoder"/"OpenAI/whisper-tiny-decoder"— Speech recognition
Source
git clone https://github.com/mouchegmouradian/claude-code-skills/blob/main/skills/zetic-mlange/SKILL.mdView on GitHub Overview
This skill helps you integrate the ZETIC Melange on-device AI SDK into Android and iOS apps. It covers loading and running models (including ZeticMLangeModel, ZeticMLangeLLMModel, and ZeticMLangeHFModel), LLM streaming, and HuggingFace model loading for on-device inference with NPU acceleration. It also addresses prerequisites like the Personal Key and model name, plus workflows for quantized GGUF LLMs and mobile deployment.
How This Skill Works
Melange SDK calls are synchronous and blocking and execute directly against the device hardware via fixed memory buffers. All in-app inference must be dispatched to a background thread (Android: Dispatchers.IO, iOS: Task/actor) to avoid UI stalls. Access to model instances must be serialized (Android: Mutex + Dispatchers.IO; iOS: actor) to prevent data races. The workflow typically involves authenticating with a Personal Key, selecting a model via account_name/project_name, loading supported formats (PyTorch .pt2, ONNX .onnx, TorchScript .pt), and then invoking model.run() for inference or streaming tokens for LLMs.
When to Use It
- Running LLMs on-device with GGUF quantization for mobile privacy and latency benefits
- Deploying ML models to Android/iOS with on-device NPU/GPU acceleration
- Loading HuggingFace models for on-device inference and experimentation
- Converting and deploying PyTorch (.pt2) or ONNX (.onnx) models for mobile inference
- General model inference and LLM inference (including streaming token generation) on mobile
Quick Start
- Step 1: Add the Melange SDK dependency to your project (Android: implementation 'com.zeticai.mlange:mlange:+'; iOS: CocoaPods or SPM) and request internet permission on Android
- Step 2: In the Melange Dashboard, obtain your Personal Key and set modelName to your account_name/project_name (e.g., 'Steve/YOLOv11_comparison') and ensure the model is uploaded in a supported format (.pt2, .onnx, .pt)
- Step 3: Load the model in a background thread and call model.run(inputs) (Android: with Dispatchers.IO and Mutex; iOS: Task/actor) to perform inference or streaming token generation
Best Practices
- Always dispatch SDK calls to a background thread; never run model.run() on the main UI thread
- Serialize access to a model instance using a Mutex (Android) or an actor (iOS) to ensure single-consumer behavior
- Use the supported upload formats (.pt2, .onnx, .pt) and load models via the Melange dashboard (account_name/project_name) with a Personal Key
- Prefer ZeticMLangeModel for general inference, ZeticMLangeLLMModel for LLM tasks, and ZeticMLangeHFModel for HuggingFace integrations
- Plan for NPU/accelerator usage and test performance on target devices; monitor memory and thread usage to avoid bottlenecks
Example Use Cases
- Android app (Kotlin) loading a PyTorch .pt2 model via ZeticMLangeModel and performing image inference on an NPU
- iOS app (Swift) running an LLM using ZeticMLangeLLMModel with streaming token generation on a background task
- HuggingFace model loading on-device using ZeticMLangeHFModel and streaming outputs to the UI
- Converting and deploying a PyTorch/ONNX model to mobile with GGUF quantization for reduced latency
- Thread-safe inference workflow using a Mutex (Android) or actor (iOS) to ensure serial access during concurrent inferences