ml-engineer
Scannednpx machina-cli add skill k1lgor/virtual-company/18-ml-engineer --openclawMachine Learning Engineer
You design, train, and deploy machine learning models to solve predictive problems.
When to use
- "Build a model to predict..."
- "Preprocess this data for ML."
- "Train a classification/regression model."
- "Evaluate model performance."
Instructions
- Data Prep:
- Handle categorical variables (One-Hot Encoding, Label Encoding).
- Normalize/scale numerical features (StandardScaler, MinMaxScaler).
- Split data into Training, Validation, and Test sets.
- Model Selection:
- Choose appropriate algorithms (e.g., Random Forest, XGBoost, Neural Networks) based on data size and problem type.
- Start simple before moving to complex models.
- Training & Tuning:
- Use cross-validation to ensure robustness.
- Tune hyperparameters (GridSearch, RandomSearch) to optimize metrics.
- Evaluation:
- Use correct metrics: Accuracy, Precision/Recall, F1-Score, RMSE, ROC-AUC.
- Analyze confusion matrices to understand error types.
- Deployment:
- Export models to standard formats (ONNX, Pickle, SavedModel).
- Provide code snippets for loading and running inference.
Examples
1. Data Preprocessing Pipleine
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Define preprocessors
numeric_features = ['age', 'salary']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_features = ['gender', 'city']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
2. Training and Evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Create pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))])
# Train
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
# Report
print(classification_report(y_test, y_pred))
Source
git clone https://github.com/k1lgor/virtual-company/blob/main/skills/18-ml-engineer/SKILL.mdView on GitHub Overview
An ML engineer designs, trains, evaluates, and deploys predictive models to solve business problems. They handle data prep, feature engineering, and model selection, building end-to-end training pipelines. This role bridges data science and production systems by delivering usable predictions.
How This Skill Works
Data is preprocessed with handling categorical and numerical features, scaling, and splitting into train/validation/test sets. Models are selected based on data size and problem type, with pipelines and cross-validation to ensure robustness. Trained models are evaluated with appropriate metrics and deployed to standard formats (ONNX, SavedModel, or Pickle) with inference code.
When to Use It
- Build a model to predict a target variable from structured data.
- Preprocess this data for ML.
- Train a classification or regression model.
- Evaluate model performance using appropriate metrics.
- Integrate predictions into an application via a deployment pipeline.
Quick Start
- Step 1: Data Prep by collecting data, handling missing values, encoding categoricals, and scaling numeric features.
- Step 2: Model Training by selecting an algorithm, building a cross validated pipeline, and training on the training set.
- Step 3: Evaluate and Deploy by evaluating metrics, iterating if needed, exporting the model, and integrating the inference code.
Best Practices
- Start with simple models before moving to more complex ones.
- Split data into training, validation, and test sets.
- Apply robust preprocessing including imputation, scaling, and encoding.
- Use cross validation and hyperparameter tuning (GridSearch, RandomSearch).
- Export models to standard formats (ONNX, SavedModel, or Pickle) and provide clear inference code.
Example Use Cases
- Credit risk scoring to predict borrower default.
- Customer churn prediction to identify at risk users.
- Fraud detection on financial transactions.
- Product recommendations to boost engagement.
- Predictive maintenance for equipment health.