Get the FREE Ultimate OpenClaw Setup Guide →
I

MLOps

Verified

@ivangdavila

npx machina-cli add skill @ivangdavila/mlops --openclaw
Files (1)
SKILL.md
1.8 KB

Quick Reference

TopicFileKey Trap
CI/CD and DAGspipelines.mdCoupling training/inference deps
Model servingserving.mdCold start with large models
Drift and alertsmonitoring.mdOnly technical metrics
Versioningreproducibility.mdNot versioning preprocessing
GPU infrastructuregpu.mdGPU request = full device

Critical Traps

Training-Serving Skew:

  • Preprocessing in notebook ≠ preprocessing in service → silent bugs
  • Pandas in notebook → memory leaks in production (use native types)
  • Feature store values at training time ≠ serving time without proper joins

GPU Memory:

  • requests.nvidia.com/gpu: 1 reserves ENTIRE GPU, not partial memory
  • MIG/MPS sharing has real limitations (not plug-and-play)
  • OOM on GPU kills pod with no useful logs

Model Versioning ≠ Code Versioning:

  • Model artifacts need separate versioning (MLflow, W&B, DVC)
  • Training data version + preprocessing version + code version = reproducibility
  • Rollback requires keeping old model versions deployable

Drift Detection Timing:

  • Retraining trigger isn't just "drift > threshold" → cost/benefit matters
  • Delayed ground truth makes concept drift detection lag weeks
  • Upstream data pipeline changes cause drift without model issues

Scope

This skill ONLY covers:

  • CI/CD pipelines for models
  • Model serving and scaling
  • Monitoring and drift detection
  • Reproducibility practices
  • GPU infrastructure patterns

Does NOT cover: ML algorithms, feature engineering, hyperparameter tuning.

Source

git clone https://clawhub.ai/ivangdavila/mlopsView on GitHub

Overview

MLOps is the practice of deploying ML models to production using repeatable CI/CD pipelines, scalable model serving, and continuous monitoring. It emphasizes reproducibility, drift detection, and efficient GPU infrastructure, covering versioning, serving, pipelines, and monitoring patterns.

How This Skill Works

MLOps stitches together CI/CD pipelines for training and deployment, using DAGs and containerized artifacts, with separate model versioning (MLflow, W&B, DVC). It then serves models at scale with autoscaling endpoints, while monitoring drift and alerting to trigger retraining when needed.

When to Use It

  • You need repeatable pipelines from training to deployment with clear versioning
  • You must serve models at scale with autoscaling and reliable endpoints
  • You require drift detection and alerts to trigger retraining or rollback
  • You want end-to-end reproducibility across training data, preprocessing, and code
  • You're optimizing GPU infrastructure and memory usage for model workloads

Quick Start

  1. Step 1: Define artifact/versioning strategy (MLflow/W&B/DVC) and version data, preprocessing, and code separately
  2. Step 2: Build CI/CD pipelines and DAGs for training, validation, and deployment; containerize dependencies
  3. Step 3: Deploy a scalable model serving endpoint, enable drift monitoring and alerts, and configure GPU resource requests

Best Practices

  • Version model artifacts separately from code, using MLflow, W&B, or DVC
  • Version training data, preprocessing, and code to ensure end-to-end reproducibility
  • Keep preprocessing consistent between training and serving to avoid training-serving skew
  • Build CI/CD pipelines and DAGs with dependency isolation and clear failure modes
  • Plan GPU resource requests and memory management, accounting for MIG/MPS constraints and partial memory usage

Example Use Cases

  • A fintech team uses MLflow for artifact versioning, a GitOps-style CI/CD pipeline, and Kubernetes for scalable serving with drift alerts
  • A retailer deploys automated retraining triggered by drift alerts rather than fixed schedules, reducing stale models
  • A data science team separates training and deployment with DAG-based pipelines to avoid training-serving skew
  • An AI team configures explicit GPU memory requests to prevent full GPU reservation and handles MIG constraints
  • An organization maintains rollback by keeping old model versions deployable and testable in staging before production

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers