What is the scope of MLOps?

CI/CD pipelines for models, model serving and scaling, monitoring and drift detection, reproducibility practices, and GPU infrastructure patterns.

What are common traps to avoid in MLOps?

Training-serving skew, improper GPU memory handling, and not versioning preprocessing; ensure artifact versioning and proper drift timing.

Do I need to version code and data separately?

Yes; version model artifacts vs code, and version training data + preprocessing + code to ensure reproducibility; rollback requires keeping old models deployable.

MLOps

Verified

@ivangdavila

npx machina-cli add skill @ivangdavila/mlops --openclaw

Files (1)

SKILL.md

1.8 KB

Quick Reference

Topic	File	Key Trap
CI/CD and DAGs	`pipelines.md`	Coupling training/inference deps
Model serving	`serving.md`	Cold start with large models
Drift and alerts	`monitoring.md`	Only technical metrics
Versioning	`reproducibility.md`	Not versioning preprocessing
GPU infrastructure	`gpu.md`	GPU request = full device

Critical Traps

Training-Serving Skew:

Preprocessing in notebook ≠ preprocessing in service → silent bugs
Pandas in notebook → memory leaks in production (use native types)
Feature store values at training time ≠ serving time without proper joins

GPU Memory:

requests.nvidia.com/gpu: 1 reserves ENTIRE GPU, not partial memory
MIG/MPS sharing has real limitations (not plug-and-play)
OOM on GPU kills pod with no useful logs

Model Versioning ≠ Code Versioning:

Model artifacts need separate versioning (MLflow, W&B, DVC)
Training data version + preprocessing version + code version = reproducibility
Rollback requires keeping old model versions deployable

Drift Detection Timing:

Retraining trigger isn't just "drift > threshold" → cost/benefit matters
Delayed ground truth makes concept drift detection lag weeks
Upstream data pipeline changes cause drift without model issues

Scope

This skill ONLY covers:

CI/CD pipelines for models
Model serving and scaling
Monitoring and drift detection
Reproducibility practices
GPU infrastructure patterns

Does NOT cover: ML algorithms, feature engineering, hyperparameter tuning.

Source

git clone https://clawhub.ai/ivangdavila/mlopsView on GitHub

Overview

MLOps is the practice of deploying ML models to production using repeatable CI/CD pipelines, scalable model serving, and continuous monitoring. It emphasizes reproducibility, drift detection, and efficient GPU infrastructure, covering versioning, serving, pipelines, and monitoring patterns.

How This Skill Works

MLOps stitches together CI/CD pipelines for training and deployment, using DAGs and containerized artifacts, with separate model versioning (MLflow, W&B, DVC). It then serves models at scale with autoscaling endpoints, while monitoring drift and alerting to trigger retraining when needed.

When to Use It

You need repeatable pipelines from training to deployment with clear versioning
You must serve models at scale with autoscaling and reliable endpoints
You require drift detection and alerts to trigger retraining or rollback
You want end-to-end reproducibility across training data, preprocessing, and code
You're optimizing GPU infrastructure and memory usage for model workloads

Quick Start

Step 1: Define artifact/versioning strategy (MLflow/W&B/DVC) and version data, preprocessing, and code separately
Step 2: Build CI/CD pipelines and DAGs for training, validation, and deployment; containerize dependencies
Step 3: Deploy a scalable model serving endpoint, enable drift monitoring and alerts, and configure GPU resource requests

Best Practices

Version model artifacts separately from code, using MLflow, W&B, or DVC
Version training data, preprocessing, and code to ensure end-to-end reproducibility
Keep preprocessing consistent between training and serving to avoid training-serving skew
Build CI/CD pipelines and DAGs with dependency isolation and clear failure modes
Plan GPU resource requests and memory management, accounting for MIG/MPS constraints and partial memory usage

Example Use Cases

A fintech team uses MLflow for artifact versioning, a GitOps-style CI/CD pipeline, and Kubernetes for scalable serving with drift alerts
A retailer deploys automated retraining triggered by drift alerts rather than fixed schedules, reducing stale models
A data science team separates training and deployment with DAG-based pipelines to avoid training-serving skew
An AI team configures explicit GPU memory requests to prevent full GPU reservation and handles MIG constraints
An organization maintains rollback by keeping old model versions deployable and testable in staging before production

Frequently Asked Questions

Add this skill to your agents