From sre-skills
Investigates live Kubernetes incidents: anchors timeline, bisects recent changes (rollouts, ConfigMaps, RBAC, HPA), classifies failure paths (OOM, DNS, cascading), and proposes mitigations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sre-skills:kubectl-investigatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Methodology skill for investigating a live or recent incident on **Kubernetes**. Produces a timeline, a ranked set of hypotheses, a blast-radius estimate, and a recommended mitigation. Hands off cleanly to `postmortem-author` once the incident is mitigated.
FAILURE_MODES.mdREADME.mdexamples/01-oom-cascade.mdexamples/02-dns-resolution-failure.mdexamples/03-cascading-failure-retry-storm.mdexamples/04-deploy-correlator-serialization.mdexamples/05-outside-reference-paths-third-party-rate-limit.mdexamples/06-ambiguous-t0-slow-burn.mdexamples/07-blast-radius-asymmetric-revert.mdexamples/08-deploy-correlator-confirmation-bias.mdexamples/09-zero-changes-external-cert-expiry.mdexamples/10-multi-region-asymmetry.mdexamples/11-capacity-bound-organic-growth.mdfixtures/01-oom-cascade/deploys.jsonfixtures/01-oom-cascade/logs.jsonlfixtures/01-oom-cascade/metrics.jsonfixtures/01-oom-cascade/pod_events.jsonlfixtures/01-oom-cascade/traces.jsonlfixtures/02-dns-resolution-failure/deploys.jsonfixtures/02-dns-resolution-failure/logs.jsonlMethodology skill for investigating a live or recent incident on Kubernetes. Produces a timeline, a ranked set of hypotheses, a blast-radius estimate, and a recommended mitigation. Hands off cleanly to postmortem-author once the incident is mitigated.
Scope: workloads running on Kubernetes (Deployments, StatefulSets, DaemonSets, Jobs/CronJobs) and the cluster primitives around them (Services, Ingress, CoreDNS, ConfigMaps/Secrets, RBAC, HPA, nodes). External dependencies (third-party APIs, partner TLS endpoints, managed databases) are in scope only as seen from a Kubernetes workload — the methodology investigates the cluster-side symptom and the in-cluster change surface.
PrometheusRule / Alertmanager alert just fired on a workload and the agent needs to triage before paging a human.kubectl rollout / Helm release / Argo CD sync went out in the last hour and a metric moved; need to know whether they are linked.OOMKilled, or Pending, or customer impact is reported with no alert yet; need to find the failing surface.The order matters. Skipping a step produces confident wrong answers.
Lock two timestamps before doing anything else:
OOMKilled event" timestamp just because one exists; the alert fire time is the agreed-upon coordination point for the incident.OOMKilled event, first SERVFAIL, first error-rate inflection), and mark T0 as ambiguous (see below).Every later signal is filtered to [T0 - 15min, Tnow]. The 15-minute lead-in catches changes that landed just before the symptom surfaced (a rollout's pods take time to roll, an HPA scale-down takes time to bite).
If T0 is ambiguous (operator-triggered with no alert, or "slow all morning"-class reports), the methodology's recommended mitigation in step 6 must begin with "re-run the investigation with a widened window" before any irreversible action. The change identified within the original narrow window is likely incomplete; the actual causal change may sit outside it. Do not silently round and do not skip the re-run step.
Pull every change event that overlaps the window. On Kubernetes the change surface is:
kubectl rollout / kubectl apply / kubectl set image, new image tags, new ReplicaSets, Helm releases, Argo CD / Flux syncs.requests/limits edits, HPA / VPA changes, PodDisruptionBudget edits, PV/PVC/StorageClass changes.Corefile ConfigMap edits, Ingress/NetworkPolicy changes, feature-flag flips.If the window has zero change events, treat it as a strong signal in itself: the failure is likely external (upstream provider, certificate expiry, DNS, capacity drift from organic growth) rather than a self-inflicted regression.
Match the failure shape to one of these four canonical paths first. They cover the majority of Kubernetes incidents; only branch out once they are ruled out.
| Path | Tell-tale signals | Confirming evidence |
|---|---|---|
| OOM | Container restart count climbing, CrashLoopBackOff, RSS / working-set at the container memory limit, OOMKilled pod events, retry storm from upstream | Container exit code 137, reason: OOMKilled in pod events / kubectl describe pod, working-set metric at or above resources.limits.memory at T0, a recent rollout that increased per-pod memory footprint |
| DNS | Connection failures with NXDOMAIN / SERVFAIL in logs, getaddrinfo / no such host errors, sudden latency on in-cluster Service calls, *.svc.cluster.local resolution failing while external hosts resolve | CoreDNS error / SERVFAIL counts elevated, a recent change to the CoreDNS Corefile ConfigMap, kube-dns/CoreDNS pod restarts, ndots/search-domain or NetworkPolicy change in the window |
| Cascading-failure | One in-cluster dependency degrades, retry counts spike across callers, connection pools / thread pools / sidecar (Envoy) circuits saturate, queue depth grows | Latency increases hop-by-hop toward the root Service, retry-budget metrics, circuit-breaker state changes, 2nd-order Deployments start failing, upstream Pod Unhealthy / readiness-probe failures |
| Deploy-correlator | Metric breaks within 5 minutes of a rollout on the failing surface, only pods from the new ReplicaSet show the symptom | Canary / blue-green or rolling-update split shows old-RS healthy / new-RS failing, kubectl rollout undo restores the metric, the rollout diff touches the failing code path |
If the failure does not match any of the four, classify as "outside reference paths" and document why. Outside-reference-paths means the methodology has no reference path for the failure shape, so its confidence in the root cause is low and escalation to a human is mandatory (step 6). It does not mean the agent does nothing: a pre-approved safe mitigation (traffic-shift to a healthy peer, feature-flag-off) is still recommended as the top action when available, with the root-cause investigation escalated in parallel. See step 6 for the exact ordering.
Never declare a hypothesis on one signal. Require at least three of the following, drawn from independent sources:
kubectl get events, kubelet: OOMKilled, BackOff, Unhealthy, FailedScheduling).Two signals from the same source (e.g. two log lines) count as one. The independence requirement is the guard against confirmation bias.
Split aggregate signals before trusting them. Error rate, latency, and saturation are usually reported as a single number across every region, cluster, AZ, shard, or canary/stable split. Before classifying, break each aggregate down along these dimensions. A per-dimension asymmetry — one region failing while its peer is healthy, one shard hot while the rest are flat — is a first-class diagnostic signal, and aggregate metrics actively hide it (a 25% failure in one of two equally-sized regions shows up as a moderate ~12% aggregate that matches no clean reference path).
When the failing and healthy slices run the same image tag / same code, a code-regression path (OOM, deploy-correlator) is ruled out by construction: identical code cannot fail in one slice and not the other. The cause is environmental — config/GitOps drift, a stale Service reference, per-region capacity, an external dependency reachable from only one slice. A confirmed asymmetry short-circuits the four-path search: stop trying to fit OOM/DNS/cascade/deploy-correlator on the aggregate, classify "outside reference paths" with a regional-asymmetry (or shard/AZ-asymmetry) reason, and move to step 5. Continuing to hunt for a reference-path match on aggregate signals after an asymmetry is detected wastes the investigation and is the single most common way this step runs long.
Before recommending action, estimate:
A wrong mitigation that touches more surface than the incident itself is worse than the incident.
Mitigation comes first. Root cause comes after the bleeding stops.
Hard constraints, in order. Check these before ranking the standard actions below:
If the classification from step 3 is "outside reference paths", escalation to a human is mandatory — but it is not automatically the top action. Two mitigations are pre-approved as safe because they are reversible and contained, and when one of them is available it becomes the top recommended action:
Every other option — contacting an external provider, irreversible config/RBAC/state changes, anything touching the failing slice directly — is surfaced as an alternative for the human to approve, not executed by the agent. The principle (from FAILURE_MODES M1): outside-reference-paths means low confidence in root cause, so escalate the root cause; it does not forbid the safe, reversible mitigation that an on-call would reach for first.
If T0 was flagged as ambiguous in step 1, the top recommended action is "re-run the investigation with a widened window". Only after that re-run identifies a fuller change surface should any irreversible mitigation (rollout undo, RBAC change, cluster/infra rollback) be recommended.
If the implicated change has bundle_size > 1 (multiple changes shipped in one rollout), rollout undo remains the top recommendation but requires explicit human approval before execution. Surface the asymmetry explicitly: "the rollback reverts N changes when the incident affects only K of them".
If the classification is "cascading-failure", the top action is to break the amplification loop at its source, not to undo a rollout. A pure cascade typically has no rollout in the window (the trigger is a degraded dependency, not a deploy), so there is nothing to revert. Recommend, in order: open the circuit breaker on / shed load from the degraded dependency itself (the root of the cascade), then cap or disable the retry budget at the callers driving the retry storm. Shedding the callers' retries alone treats the symptom (the amplification) while leaving the degraded dependency saturated; opening the circuit at the dependency stops the loop at its origin and lets the dependency recover.
Standard mitigation order (applies when the constraints above do not fire):
kubectl rollout undo the workload identified in step 2, if one rollout is clearly implicated and reversible (or revert the implicated ConfigMap / RBAC change).kubectl scale / raise the HPA ceiling / raise resources.limits), if the path is capacity-bound and not regression-bound.kubectl delete pod to force a fresh restart, kill a stuck Job) as a last resort, with explicit acknowledgement that it does not address cause — pods will recreate from the same broken spec.If no safe mitigation exists even after applying the above, surface that explicitly and escalate.
Produce a structured handoff for postmortem-author. All four elements below are mandatory and must appear as labelled sections, even when an element is empty (write "Open questions: none identified", not nothing — a silently missing section reads as "investigation incomplete" to the next responder):
The agent's final message in any invocation must include:
T0 = ..., Tnow = ....postmortem-author, containing all four labelled sections from step 7 — timeline, ranked hypotheses, mitigation, and open questions. Do not collapse or omit any of them; an absent "open questions" section is treated as an incomplete handoff.Eleven end-to-end examples are committed under examples/, each with fixtures and a runnable replay test.
Reference paths (one canonical example per path):
examples/01-oom-cascade.md: OOM in a payments Deployment triggering a retry storm from the API gateway.examples/02-dns-resolution-failure.md: CoreDNS Corefile ConfigMap misconfiguration causing intermittent SERVFAIL for an internal Service.examples/03-cascading-failure-retry-storm.md: pure cascade from an upstream DB query-plan slowdown; no rollout in window.examples/04-deploy-correlator-serialization.md: pure deploy-correlator regression (serialization change breaks downstream parsers).Escalation cases (exercise the FAILURE_MODES.md rules):
examples/05-outside-reference-paths-third-party-rate-limit.md: a third-party API rate-limits a workload; the methodology escalates rather than force-fitting one of the four paths (M1).examples/06-ambiguous-t0-slow-burn.md: slow-burn memory leak where T0 is genuinely ambiguous; escalation (M2) recommends re-running with a widened window.examples/07-blast-radius-asymmetric-revert.md: a rollout bundling six unrelated changes; rollout undo is the top mitigation but escalates (M3) because the rollback blast radius exceeds the incident.examples/08-deploy-correlator-confirmation-bias.md: a rollout and an RBAC change collide in time; the methodology rejects the deploy-correlator classification (M4 guard) because the rollout diff does not touch the failing surface.Edge / boundary cases:
examples/09-zero-changes-external-cert-expiry.md: zero changes in window, failure is an external partner TLS certificate expiry seen from a cluster workload.examples/10-multi-region-asymmetry.md: same image deployed to two clusters/regions, one fails; the methodology surfaces the per-region asymmetry as a first-class signal.examples/11-capacity-bound-organic-growth.md: organic traffic growth saturates capacity; the methodology recommends scaling (HPA) instead of rollout undo.The examples mirror the seven methodology steps so contributors can see the methodology in motion, not just described.
Every example has a replay test in tests/ that runs the methodology against committed fixtures, with no external credentials (no live cluster needed). Run from the skill directory:
for t in tests/replay_*.py; do python "$t" || exit 1; done
The 11 tests cover the four reference paths, the FAILURE_MODES.md escalation rules (M1, M2, M3, M4), and the edge cases (zero changes, multi-region asymmetry, capacity saturation). Tests exit non-zero if the methodology produces the wrong classification, mitigation, or escalation against known-good fixtures. See tests/README.md for the fixture schema and how to add a new replay test.
This skill is wrong in predictable ways. Read FAILURE_MODES.md before relying on it for production triage. Highlights:
The methodology above runs end-to-end with whatever telemetry, rollout/event source, and RBAC audit log you already have for your cluster (kubectl, the Kubernetes events API, kube-state-metrics, Prometheus). No Anyshift dependency.
The Anyshift MCP can act as a context primer for step 2 (change surface) by exposing a versioned resource graph that links rollouts, RBAC changes, and cluster/infrastructure changes to the specific Kubernetes resources implicated in the incident. See the per-skill README for the measured "with vs without" delta on the OOM and DNS examples (published once the integration has been exercised against the replay tests).
npx claudepluginhub anyshift-io/claude-plugins --plugin sre-skillsExpert DevOps troubleshooter for rapid incident response, log analysis, distributed tracing, Kubernetes debugging, network troubleshooting, and performance analysis. Guides users through root cause analysis and system reliability best practices.
Diagnoses and fixes Kubernetes pod failures like CrashLoopBackOff, Pending, DNS, networking, storage mounts, and rollout issues using kubectl workflows and scripts.
Orchestrates SRE incident response on Google Cloud Platform. Starts outage investigation, maps architecture via gcp-architecture-discovery, and coordinates GKE/Cloud Run mitigation.