From skillry-cloud-and-infrastructure
Use when you need to review or design a deployment topology and rollback strategy — blue-green, canary, and rolling deploys, health gates, automated rollback, zero-downtime cutover, and safe infrastructure change sequencing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-cloud-and-infrastructure:336-deploy-topology-and-rollbackThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review and design how a service is rolled out and rolled back — selecting blue-green, canary, or rolling strategies, wiring health and metric gates that automatically halt or revert a bad release, sequencing infrastructure changes so they are reversible, and achieving zero-downtime cutover including backward-compatible database changes. The goal is that every deploy can be paused and reverted q...
Review and design how a service is rolled out and rolled back — selecting blue-green, canary, or rolling strategies, wiring health and metric gates that automatically halt or revert a bad release, sequencing infrastructure changes so they are reversible, and achieving zero-downtime cutover including backward-compatible database changes. The goal is that every deploy can be paused and reverted quickly on a clear signal, with no manual heroics and no extended outage. This is a planning and authoring skill: it defines and reviews the rollout mechanism but does not trigger production deploys without human approval. Ground analysis in the project's actual deploy config and traffic-shaping resources.
# Kubernetes strategy / progressive delivery controllers
grep -rn "strategy:\|RollingUpdate\|maxUnavailable\|maxSurge" . --include="*.yaml"
grep -rln "kind: Rollout\|Argo\|Flagger\|canary\|blueGreen" . --include="*.yaml"
# Traffic shaping: weighted target groups, service mesh, DNS/LB weights
grep -rn "weight\|target_group\|listener\|traffic\|VirtualService\|DestinationRule" . --include="*.tf" --include="*.yaml" | head
# Readiness probes feed rollout progress; without them rollouts "succeed" while broken
grep -rn "readinessProbe\|livenessProbe\|minReadySeconds\|progressDeadlineSeconds" . --include="*.yaml"
# Analysis/metric gates for canary (success rate, latency thresholds)
grep -rn "analysis\|successCondition\|metric\|threshold\|prometheus" . --include="*.yaml" | head
# Kubernetes keeps revision history for rollback
grep -rn "revisionHistoryLimit" . --include="*.yaml"
# Documented/automated rollback hooks or pipeline steps
grep -rniE "rollback|revert|undo|previous version|abort" . --include="*.yaml" --include="*.sh" docs/ 2>/dev/null | head
# Graceful shutdown: preStop hook + terminationGracePeriod so in-flight requests drain
grep -rn "preStop\|terminationGracePeriodSeconds\|lifecycle:" . --include="*.yaml"
# PodDisruptionBudget protects availability during rollout/node drain
grep -rln "kind: PodDisruptionBudget" . --include="*.yaml"
# Look for destructive/blocking migrations that break the old version mid-rollout
grep -rniE "drop column|drop table|rename column|alter .*not null|drop constraint" . --include="*.sql" migrations/ 2>/dev/null | head
# Expand/contract pattern markers
grep -rniE "expand|contract|backfill|add column|nullable" . --include="*.sql" migrations/ 2>/dev/null | head
Order changes so each step is independently revertible: add the new resource, shift a small traffic slice, observe gates, increase, then retire the old resource last. Avoid changes that are hard to undo (DNS TTL flips, deleting the old environment) until the new one is proven.
maxUnavailable/maxSurge and progressDeadlineSeconds.rollout undo, redeploy previous, flip weight back).revisionHistoryLimit retains enough history to roll back.preStop, terminationGracePeriodSeconds) so connections drain.PodDisruptionBudget (or equivalent) preserves capacity during the rollout.# Rolling update with health gating (zero-downtime) — review target
apiVersion: apps/v1
kind: Deployment
spec:
minReadySeconds: 10
progressDeadlineSeconds: 120 # rollout fails (and can auto-revert) if stuck
revisionHistoryLimit: 10 # keeps revisions for rollback
strategy:
type: RollingUpdate
rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: api
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
lifecycle:
preStop: { exec: { command: ["sleep", "10"] } } # drain before exit
# Canary with automatic abort (Argo Rollouts style)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis: { templates: [{ templateName: success-rate }] } # aborts on regression
- setWeight: 50
- pause: { duration: 5m }
# Rollback commands — RUN ONLY WITH HUMAN APPROVAL
kubectl rollout status deploy/api --timeout=120s # observe (safe)
kubectl rollout history deploy/api # inspect revisions (safe)
kubectl rollout undo deploy/api --to-revision=<n> # MUTATING — approval required
maxUnavailable: 100%) on a user-facing service — guaranteed downtime.DROP COLUMN, NOT NULL without default) shipped with the code that needs it — old pods crash mid-rollout and rollback is impossible.Produce a structured report with:
file:line | issue | severity | concrete fix.kubectl rollout undo/restart, argo rollouts promote, weight changes, DNS/LB cutover) without explicit human approval.rollout status, rollout history, kubectl get, reading config) is safe to run unattended.****; do not print them in the report.Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub fluxonlab/skillry --plugin skillry-cloud-and-infrastructure