From ai-skills
Use this skill when investigating a production incident, when an alert fires (latency spike, error rate, pod crashloop), when a customer-reported issue needs prod telemetry, or as the diagnosis step of an incident-response or production-bugfix flow — including when the user describes a prod symptom without asking to "analyze" — to analyze the production environment by collecting Kubernetes pod status, managed database health, logs, metrics, and networking and diagnosing issues, supporting GCP, Azure, and AWS via the `cloud-platforms` skill and applying the SRE or DevOps role.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-skills:analyze-prod service name, symptom, or incident descriptionservice name, symptom, or incident descriptionThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.
Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.
Cloud platform detection: Read CLAUDE.md to identify the cloud platform, then load the matching reference module (GCP / Azure / AWS) from @cloud-platforms.
⚠️ SAFETY: READ-ONLY commands only. No mutations (scale, delete, restart, deploy) without explicit user approval at Step 6.
Ask the user:
If invoked as part of a bugfix / incident flow, extract context from the parent conversation instead.
Select role by problem type:
| Problem Type | Primary Role |
|---|---|
| Pod crashes, restarts, health-check failures | Agent(sre-engineer) |
| High latency, error-rate spikes, SLO burn | Agent(sre-engineer) |
| Managed DB issues (connections, replication, CPU) | Agent(sre-engineer) |
| Networking, ingress, DNS, connectivity | Agent(sre-engineer) + Agent(devops-engineer) |
| Deployment failures, rollback needed | Agent(devops-engineer) |
| Terraform / infra drift, resource config | Agent(devops-engineer) |
| Application errors in logs | Stack-specific (Agent(java-engineer) / Agent(python-engineer) / Agent(frontend-engineer)) |
| General / unclear | Agent(sre-engineer) |
Announce the applied role(s). For P1/P2 incidents, always apply Agent(sre-engineer).
Detect platform from CLAUDE.md and verify authentication via @cloud-platforms:
gcloud config get-value project + gcloud auth listaz account show + az aks listaws sts get-caller-identity + aws eks list-clustersConfirm the active project / subscription / account matches the user's target.
// turbo
kubectl config current-context
kubectl cluster-info
Run the platform-specific cluster list command from @cloud-platforms to verify cluster status.
Record: Cluster name, location, version, node count, current kubectl context.
Run the following read-only diagnostics. Present results as a structured summary.
// turbo
kubectl get nodes -o wide
kubectl top nodes
Flag: NotReady nodes, CPU/memory >80%, version skew.
For the affected namespace (or all namespaces if unspecified):
// turbo
kubectl get pods -n <namespace> -o wide --sort-by='.status.startTime'
kubectl get pods -n <namespace> --field-selector=status.phase!=Running
Flag: CrashLoopBackOff / Error / Pending / ImagePullBackOff, restart count >3 in last hour, replica count mismatch (kubectl get deployments).
kubectl describe pod <pod-name> -n <namespace>
kubectl logs --tail=200 --timestamps <pod-name> -n <namespace>
kubectl logs --previous --tail=50 <pod-name> -n <namespace>
Look for: OOMKilled (exit 137), app exceptions, connection errors, startup failures, readiness probe failures.
// turbo
kubectl top pods -n <namespace> --sort-by=memory
kubectl get hpa -n <namespace>
Flag: Pods near memory limits, HPA at max replicas, CPU throttling.
DB diagnostic commands per cloud — see @cloud-platforms. Key metrics: CPU / memory / disk utilization, active vs max connections, replication lag (HA / read replicas), failed connection count.
// turbo
kubectl get ingress -n <namespace>
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
Flag: Services with 0 endpoints (no healthy backends), Ingress with no address, port mismatches.
// turbo
kubectl get events -n <namespace> --sort-by='.lastTimestamp' --field-selector type!=Normal
Record: Warning / Error events — especially FailedScheduling, OOMKilled, FailedMount, Unhealthy, BackOff.
Observability methodology (Golden Signals / RED / USE / Distributed Tracing) — see @observability-methods. Pick a named method per problem class, then cross-reference SLI metrics, error-budget burn, and active alerts against that method's signals.
Telemetry stack patterns + per-vendor queries — see @telemetry-stacks. Identify the stack from CLAUDE.md, helm charts, prometheus-operator CRDs, or the OTel collector config, then query directly using the vendor patterns documented there.
Using the applied role's expertise:
Agent(sre-engineer) is active): error budget consumed, burn rate.<common_prod_issues>
Structure the diagnosis:
## Production Environment Summary
- Cloud: [GCP/Azure/AWS] | Project/Sub/Account: [id] | Cluster: [name] ([location])
- Nodes: [count] ([healthy]/[total]) | K8s: [version] | Managed DB: [instance] ([state], [tier])
## SLO Impact (if applicable)
- SLO: [target] | Current: [actual] | Error budget: [remaining]% | Burn rate: [Nx] | Time to exhaustion: [duration]
## Findings
### [Issue 1: title]
- **Symptom / Evidence / Root cause / Severity (P1–P4) / Blast radius**
## Recommendations
1. [Immediate mitigation] — [command] ⚠️ REQUIRES APPROVAL
2. [Root cause fix] — [change description]
3. [Prevention] — [long-term improvement]
## Environment Health: [HEALTHY | DEGRADED | OUTAGE]
⚠️ All production mutations require explicit user approval.
/analyze-local), deploy via normal CI/CD.Agent(devops-engineer) — never direct apply.After any fix, re-run relevant Step 4 commands to verify resolution.
/bugfix (production environment diagnostics)Agent(sre-engineer), Agent(devops-engineer)@cloud-platforms (platform-specific CLI commands, managed service diagnostics), @observability-methods (Golden Signals / RED / USE / Tracing), @telemetry-stacks (Prometheus / Datadog / Honeycomb / New Relic / Sentry / OTel queries)npx claudepluginhub alex-voloshin-dev/ai-skills --plugin ai-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.