Skill

ml-canary

This skill should be used after a model deployment — when the user says "canary", "watch the rollout", "monitor the new model", "is the deploy healthy", "post-deploy check", or a release produced a CANARY_PLAN.md that now needs executing. Audits production behavior against the model card's stated expectations.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/mlforge:ml-canary

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Post-deploy watch. The model card made promises; the canary checks them. This is the boomerang — plan vs reality, made explicit and written down.

SKILL.md

50 lines · ~752 tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

ML canary

Post-deploy watch. The model card made promises; the canary checks them. This is the boomerang — plan vs reality, made explicit and written down.

Inputs

ml/releases/<version>/MODEL_CARD.md ("Expected production behavior" section) and CANARY_PLAN.md. No model card → run against best-known expectations, and note that the release skipped ml-ship (a process leak for the next retro).

The checks, per ramp stage

Prediction distribution vs card: mean/p10/p90 of live scores against the card's stated expectations. Shifted distribution with stable inputs = serving-path suspect → ml-production-debug step 3 (don't wait for the online metric to confirm what the score histogram already shows).
Feature health: per-feature null/default/out-of-range rates vs the card's assumptions. A feature silently going 100% default flattens output with zero errors thrown.
System: latency p50/p99 vs budget, error rate, throughput at current traffic share.
Flip spot-check: sample entities scored by both old and new models; flip rate in line with the offline flip analysis? Offline said 8%, live shows 20% → stop the ramp.
Online metric at the pre-registered horizon: only at the horizon and maturation window the experiment plan committed to. No peeking-driven decisions; sequential bounds if the plan prescribed them.

Stage gates

Each ramp stage (shadow → N% → 100%) advances only on its CANARY_PLAN numeric gates. A failed gate triggers the rollback plan — which the card already wrote, with numeric triggers and an owner. Execute it; don't renegotiate it mid-incident.

Output — boomerang report

Append to ml/releases/<version>/CANARY_REPORT.md per checkpoint:

## Checkpoint [stage, date]
| Check | Card said | Production says | Verdict |
[score dist | nulls | latency | flips | online metric]
**Decision**: advance / hold / rollback — [numeric reason]

On completion (100% + horizon passed): final verdict in the report, canary_passed (or rolled_back) gate appended to ml/gates.json, outcome logged in experiments/journal.md, and any card-vs-reality miss handed to ml-retro's boomerang audit — systematically wrong cards are a calibration problem, not bad luck.

Rules

Checks are computed from real logs/metrics when available — ask for the data, run the comparison in code.
Drift alerts inform; only guardrail breaches page. Don't convert the canary into alarm fatigue.
The watch has an end date. A canary that never concludes is monitoring, and belongs to the standing monitors recommended by ml-production-debug.

ml-canary

Invocation

Context Preview

SKILL.md

ml-canary

Invocation

Context Preview

SKILL.md

ML canary

Inputs

The checks, per ramp stage

Stage gates

Output — boomerang report

Rules

Similar Skills

ML canary

Inputs

The checks, per ramp stage

Stage gates

Output — boomerang report

Rules

Similar Skills