Skill

analyze-prod

Use this skill when investigating a production incident, when an alert fires (latency spike, error rate, pod crashloop), when a customer-reported issue needs prod telemetry, or as the diagnosis step of an incident-response or production-bugfix flow — including when the user describes a prod symptom without asking to "analyze" — to analyze the production environment by collecting Kubernetes pod status, managed database health, logs, metrics, and networking and diagnosing issues, supporting GCP, Azure, and AWS via the `cloud-platforms` skill and applying the SRE or DevOps role.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ai-skills:analyze-prod service name, symptom, or incident description

User invocable

Model invocable

Forked subagent

Default effort

Argument hintservice name, symptom, or incident description

Tool Access

This skill is limited to the following tools:

ReadGrepGlobBash

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.

SKILL.md

216 lines · ~2.3k tokens

Stats

LanguagePython

Parent stars1

MaintenanceExcellent

Last CommitMay 23, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Analyze Production Environment

Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.

Cloud platform detection: Read CLAUDE.md to identify the cloud platform, then load the matching reference module (GCP / Azure / AWS) from @cloud-platforms.

⚠️ SAFETY: READ-ONLY commands only. No mutations (scale, delete, restart, deploy) without explicit user approval at Step 6.

1. Clarify the Scope

Ask the user:

Problem / question? (e.g., "pods crashlooping", "high latency on /api/orders", "DB connections exhausted", "general health check")
Affected service(s)? (deployment name, namespace, or "all")
Cloud environment? (project / subscription / account + cluster name)
When did it start? (deploy, config change, traffic spike, or unknown)
Active incident? (P1–P4 severity, or none)

If invoked as part of a bugfix / incident flow, extract context from the parent conversation instead.

2. Apply Appropriate Role

Select role by problem type:

Problem Type	Primary Role
Pod crashes, restarts, health-check failures	`Agent(sre-engineer)`
High latency, error-rate spikes, SLO burn	`Agent(sre-engineer)`
Managed DB issues (connections, replication, CPU)	`Agent(sre-engineer)`
Networking, ingress, DNS, connectivity	`Agent(sre-engineer)` + `Agent(devops-engineer)`
Deployment failures, rollback needed	`Agent(devops-engineer)`
Terraform / infra drift, resource config	`Agent(devops-engineer)`
Application errors in logs	Stack-specific (`Agent(java-engineer)` / `Agent(python-engineer)` / `Agent(frontend-engineer)`)
General / unclear	`Agent(sre-engineer)`

Announce the applied role(s). For P1/P2 incidents, always apply Agent(sre-engineer).

3. Establish Context

3a. Cloud Platform Context

Detect platform from CLAUDE.md and verify authentication via @cloud-platforms:

GCP: gcloud config get-value project + gcloud auth list
Azure: az account show + az aks list
AWS: aws sts get-caller-identity + aws eks list-clusters

Confirm the active project / subscription / account matches the user's target.

3b. Kubernetes Cluster

// turbo
kubectl config current-context
kubectl cluster-info

Run the platform-specific cluster list command from @cloud-platforms to verify cluster status.

Record: Cluster name, location, version, node count, current kubectl context.

4. Collect Production Snapshot

Run the following read-only diagnostics. Present results as a structured summary.

4a. Node Health

// turbo
kubectl get nodes -o wide
kubectl top nodes

Flag: NotReady nodes, CPU/memory >80%, version skew.

4b. Pod Status

For the affected namespace (or all namespaces if unspecified):

// turbo
kubectl get pods -n <namespace> -o wide --sort-by='.status.startTime'
kubectl get pods -n <namespace> --field-selector=status.phase!=Running

Flag: CrashLoopBackOff / Error / Pending / ImagePullBackOff, restart count >3 in last hour, replica count mismatch (kubectl get deployments).

4c. Pod Details (for problematic pods)

kubectl describe pod <pod-name> -n <namespace>
kubectl logs --tail=200 --timestamps <pod-name> -n <namespace>
kubectl logs --previous --tail=50 <pod-name> -n <namespace>

Look for: OOMKilled (exit 137), app exceptions, connection errors, startup failures, readiness probe failures.

4d. Resource Usage

// turbo
kubectl top pods -n <namespace> --sort-by=memory
kubectl get hpa -n <namespace>

Flag: Pods near memory limits, HPA at max replicas, CPU throttling.

4e. Managed Database Health

DB diagnostic commands per cloud — see @cloud-platforms. Key metrics: CPU / memory / disk utilization, active vs max connections, replication lag (HA / read replicas), failed connection count.

4f. Networking and Ingress

// turbo
kubectl get ingress -n <namespace>
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>

Flag: Services with 0 endpoints (no healthy backends), Ingress with no address, port mismatches.

4g. Recent Events

// turbo
kubectl get events -n <namespace> --sort-by='.lastTimestamp' --field-selector type!=Normal

Record: Warning / Error events — especially FailedScheduling, OOMKilled, FailedMount, Unhealthy, BackOff.

4h. Observability Methodology

Observability methodology (Golden Signals / RED / USE / Distributed Tracing) — see @observability-methods. Pick a named method per problem class, then cross-reference SLI metrics, error-budget burn, and active alerts against that method's signals.

4i. Production Telemetry Surface

Telemetry stack patterns + per-vendor queries — see @telemetry-stacks. Identify the stack from CLAUDE.md, helm charts, prometheus-operator CRDs, or the OTel collector config, then query directly using the vendor patterns documented there.

5. Analyze Findings

Using the applied role's expertise:

Correlate the user's problem with collected data.
Identify root cause using the role's debugging methodology.
Assess SLO impact (when Agent(sre-engineer) is active): error budget consumed, burn rate.

<common_prod_issues>

CrashLoopBackOff: OOM (check limits), app startup failure, missing config/secret, failed dependency (DB unreachable)
High latency: DB slow queries, connection-pool exhaustion, CPU throttling (requests vs limits), upstream timeout
DB connection exhaustion: too many connections from pods, app leak, pool misconfig, no pooler / proxy
Pods Pending: insufficient cluster resources (autoscaler max), affinity/taint mismatch, PVC not bound
5xx errors: app exception, upstream timeout, readiness probe failing, resource exhaustion
Deployment stuck: image pull error, readiness probe failure, PDB blocking rollout, insufficient quota
Network unreachable: NetworkPolicy block, service selector mismatch, DNS failure, VPC / firewall rule </common_prod_issues>

6. Present Diagnosis

Structure the diagnosis:

## Production Environment Summary
- Cloud: [GCP/Azure/AWS] | Project/Sub/Account: [id] | Cluster: [name] ([location])
- Nodes: [count] ([healthy]/[total]) | K8s: [version] | Managed DB: [instance] ([state], [tier])

## SLO Impact (if applicable)
- SLO: [target] | Current: [actual] | Error budget: [remaining]% | Burn rate: [Nx] | Time to exhaustion: [duration]

## Findings
### [Issue 1: title]
- **Symptom / Evidence / Root cause / Severity (P1–P4) / Blast radius**

## Recommendations
1. [Immediate mitigation] — [command] ⚠️ REQUIRES APPROVAL
2. [Root cause fix] — [change description]
3. [Prevention] — [long-term improvement]

## Environment Health: [HEALTHY | DEGRADED | OUTAGE]

7. Fix or Escalate

⚠️ All production mutations require explicit user approval.

Immediate mitigation (rollback, restart, scale): present the exact command, wait for approval, execute, verify.
Application code fix: transition to a stack-specific role. Fix, test locally (optionally /analyze-local), deploy via normal CI/CD.
Infrastructure fix (Terraform, Helm, K8s config): apply via PR with Agent(devops-engineer) — never direct apply.
Root cause unclear: propose additional diagnostics (increased log verbosity, tracing, profiling, query analysis).

After any fix, re-run relevant Step 4 commands to verify resolution.

8. Summary

Problem: original report / incident ID
Severity: P1–P4
Role(s) applied
Root cause
SLO impact: error budget consumed
Fix applied (or "investigation only — no fix applied")
Verification: confirmation the issue is resolved
Action items: postmortem (if P1/P2), prevention, monitoring improvements

Integration

Called by: /bugfix (production environment diagnostics)
Roles: Agent(sre-engineer), Agent(devops-engineer)
Skills: @cloud-platforms (platform-specific CLI commands, managed service diagnostics), @observability-methods (Golden Signals / RED / USE / Tracing), @telemetry-stacks (Prometheus / Datadog / Honeycomb / New Relic / Sentry / OTel queries)

analyze-prod

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

analyze-prod

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Analyze Production Environment

1. Clarify the Scope

2. Apply Appropriate Role

3. Establish Context

3a. Cloud Platform Context

3b. Kubernetes Cluster

4. Collect Production Snapshot

4a. Node Health

4b. Pod Status

4c. Pod Details (for problematic pods)

4d. Resource Usage

4e. Managed Database Health

4f. Networking and Ingress

4g. Recent Events

4h. Observability Methodology

4i. Production Telemetry Surface

5. Analyze Findings

6. Present Diagnosis

7. Fix or Escalate

8. Summary

Integration

Similar Skills

Analyze Production Environment

1. Clarify the Scope

2. Apply Appropriate Role

3. Establish Context

3a. Cloud Platform Context

3b. Kubernetes Cluster

4. Collect Production Snapshot

4a. Node Health

4b. Pod Status

4c. Pod Details (for problematic pods)

4d. Resource Usage

4e. Managed Database Health

4f. Networking and Ingress

4g. Recent Events

4h. Observability Methodology

4i. Production Telemetry Surface

5. Analyze Findings

6. Present Diagnosis

7. Fix or Escalate

8. Summary

Integration

Similar Skills