From skillry-data-ml-ai-engineering
Use when you need to review or design model serving and inference — batch vs online patterns, latency/throughput, model versioning, rollback, and monitoring for drift and skew.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-data-ml-ai-engineering:314-model-serving-and-inferenceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review or design how a trained model is served so that predictions are correct, fast enough, versioned, rollback-able, and monitored. Cover serving pattern selection (batch scoring vs online/real-time vs streaming), latency and throughput targets, request validation, model versioning and registry, safe rollout/rollback, training/serving skew, and production monitoring for input drift and predic...
Review or design how a trained model is served so that predictions are correct, fast enough, versioned, rollback-able, and monitored. Cover serving pattern selection (batch scoring vs online/real-time vs streaming), latency and throughput targets, request validation, model versioning and registry, safe rollout/rollback, training/serving skew, and production monitoring for input drift and prediction quality. The goal is a serving path where the deployed model version is known, can be reverted instantly, applies the exact same preprocessing as training, and raises an alert when inputs or outputs drift.
find . -name "serve*.py" -o -name "predict*.py" -o -name "*inference*.py" -o -name "app.py" -o -name "main.py" 2>/dev/null | grep -v __pycache__ | head -30
grep -rn "@app.post\|FastAPI\|predict(\|load_model\|TorchServe\|BentoML\|mlflow.pyfunc\|triton" . --include="*.py" | head -25
Classify as batch (scheduled scoring to a table/file), online (sync request/response), or streaming.
grep -rn "model_version\|registry\|load_model\|stage=\|@latest\|MODEL_URI\|artifact" . --include="*.py" --include="*.yaml" | head -25
Confirm the served version is pinned and recorded (registry stage or explicit URI), not an implicit "latest" that changes silently.
grep -rn "transform\|preprocess\|StandardScaler\|tokenizer\|feature_\|pipeline\|joblib.load\|pickle.load" . --include="*.py" | head -25
The serving path must apply the identical preprocessing artifact used at training (same fitted scaler/encoder), ideally a saved pipeline — not a hand-rewritten transform that can drift.
grep -rn "batch_size\|timeout\|max_workers\|torch.no_grad\|eval()\|@torch.inference_mode\|onnx\|half()\|to(device)" . --include="*.py" | head -25
Confirm inference uses no-grad/eval mode, batches where possible, sets request timeouts, and bounds concurrency.
grep -rn "canary\|shadow\|rollback\|previous_version\|drift\|evidently\|prometheus\|log_prediction\|monitor" . --include="*.py" --include="*.yaml" | head -25
Confirm there is a documented rollback to the prior version and monitoring for input drift, prediction distribution, latency, and error rate.
no_grad/inference_mode; batching is used where throughput matters.Online endpoint with input validation, pinned version, and parity (FastAPI):
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, conint, confloat
import joblib
app = FastAPI()
MODEL_VERSION = "churn:3" # pinned, not "latest"
pipeline = joblib.load("artifacts/churn_v3.joblib") # same pipeline as training
class Features(BaseModel): # validates schema + ranges
tenure: conint(ge=0, le=120)
monthly_charge: confloat(ge=0, le=10_000)
@app.post("/predict")
def predict(f: Features):
try:
proba = pipeline.predict_proba([[f.tenure, f.monthly_charge]])[0][1]
except Exception as e:
raise HTTPException(status_code=400, detail=f"inference error: {e}")
log_prediction(MODEL_VERSION, f.dict(), proba) # for drift monitoring
return {"model_version": MODEL_VERSION, "churn_proba": round(proba, 4)}
Efficient batch scoring (PyTorch):
import torch
@torch.inference_mode() # no autograd overhead
def score_batches(model, loader, device):
model.eval()
out = []
for batch in loader: # batched, not row-by-row
preds = model(batch.to(device)).cpu()
out.append(preds)
return torch.cat(out)
Input drift check with a population stability index (Python):
import numpy as np
def psi(expected, actual, bins=10):
"""Population Stability Index. >0.2 typically signals meaningful drift."""
qs = np.quantile(expected, np.linspace(0, 1, bins + 1))
e = np.histogram(expected, qs)[0] / len(expected) + 1e-6
a = np.histogram(actual, qs)[0] / len(actual) + 1e-6
return float(np.sum((a - e) * np.log(a / e)))
MLflow model registry promote/rollback (CLI):
# Promote a validated version to Production
mlflow models serve -m "models:/churn/Production" -p 5001 --no-conda
# Roll back: re-point the Production stage to the prior version
mlflow registry transition-stage --name churn --version 2 --stage Production
models:/name/latest so the served model changes silently on every new registration.Produce a structured report with:
file:line.file:line | issue | risk | concrete fix.npx claudepluginhub fluxonlab/skillry --plugin skillry-data-ml-ai-engineeringSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.