Skill

k8s-debug

Use when the user wants to debug, investigate, or troubleshoot Kubernetes clusters, pods, deployments, services, nodes, or any k8s resource. Trigger on keywords like "pod crashing", "CrashLoopBackOff", "OOMKilled", "ImagePullBackOff", "pending pod", "node pressure", "cluster health", "kubectl", "k8s issue", "what's wrong with my deployment", "debug namespace", "check logs", "pod not starting", "service not reachable", "resource limits", "evicted pods", "kubeconfig", "switch cluster", "which context". Also trigger when the user asks about Kubernetes events, resource usage, Helm release status, or wants to inspect anything running in a cluster.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/k8s-debug:k8s-debug

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Safe, structured investigation of Kubernetes issues using local kubeconfigs.

SKILL.md

300 lines · ~2.1k tokens

Stats

Parent stars0

MaintenanceGood

Last CommitMay 13, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Kubernetes Cluster Debugging

Safe, structured investigation of Kubernetes issues using local kubeconfigs. Always read before you act: gather full context before suggesting any change.

Prerequisites

kubectl installed and on PATH
One or more kubeconfig files (default: ~/.kube/config)
Optional but recommended: kubectx/kubens, stern, kubecolor, helm

Cluster and context selection

Multiple clusters are common. Always confirm the target before running anything.

# List all contexts across all kubeconfigs
kubectl config get-contexts

# Show current context
kubectl config current-context

# Switch context (ask user to confirm first)
kubectl config use-context <context-name>

# Or use kubectx for fast switching
kubectx                  # list
kubectx <context-name>   # switch

If the user has kubeconfigs in non-default locations:

# Point to a specific file
kubectl --kubeconfig /path/to/config get nodes

# Merge multiple files for this session
$env:KUBECONFIG = "C:\Users\me\.kube\config-prod;C:\Users\me\.kube\config-staging"
kubectl config get-contexts

Always use --context and --namespace flags explicitly in commands you run for the user — never rely on the ambient default silently targeting the wrong cluster.

Investigation order

Work read-only from broad to narrow. Never suggest a fix before completing step 3.

Cluster health — nodes, resource pressure
Namespace overview — what's unhealthy?
Resource detail — describe the specific failing object
Logs — container stdout/stderr
Events — cluster-wide timeline of what happened
Hypothesis → fix — only after the above

Step 1 — Cluster health

# Node status and conditions
kubectl get nodes -o wide --context <ctx>

# Node resource usage (requires metrics-server)
kubectl top nodes --context <ctx>

# Check for node pressure conditions
kubectl describe nodes --context <ctx> | grep -A5 "Conditions:"

# All pods across all namespaces — spot non-Running at a glance
kubectl get pods -A --context <ctx> | grep -v Running | grep -v Completed

Step 2 — Namespace overview

# List namespaces
kubectl get namespaces --context <ctx>

# Everything in a namespace
kubectl get all -n <ns> --context <ctx>

# Just pods with status
kubectl get pods -n <ns> -o wide --context <ctx>

# Pod resource usage
kubectl top pods -n <ns> --context <ctx>

# Recent events in namespace (sorted by time)
kubectl get events -n <ns> --sort-by='.lastTimestamp' --context <ctx>

Step 3 — Describe the failing resource

describe is the single most useful command — always run it before checking logs.

# Pod detail: conditions, events, resource requests, image, mounts
kubectl describe pod <pod-name> -n <ns> --context <ctx>

# Deployment rollout status
kubectl describe deployment <name> -n <ns> --context <ctx>

# Service endpoints (is it selecting any pods?)
kubectl describe service <name> -n <ns> --context <ctx>

# PVC binding state
kubectl describe pvc <name> -n <ns> --context <ctx>

Step 4 — Logs

# Current container logs
kubectl logs <pod> -n <ns> --context <ctx>

# Previous container (after a crash)
kubectl logs <pod> -n <ns> --previous --context <ctx>

# Specific container in a multi-container pod
kubectl logs <pod> -c <container> -n <ns> --context <ctx>

# Tail + follow
kubectl logs <pod> -n <ns> --tail=100 -f --context <ctx>

# All pods matching a label (requires stern)
stern -n <ns> --context <ctx> <label-selector>

Step 5 — Events

# All events in namespace, newest last
kubectl get events -n <ns> --sort-by='.lastTimestamp' --context <ctx>

# Filter to a specific pod
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name> --context <ctx>

# Warning events only
kubectl get events -n <ns> --field-selector type=Warning --context <ctx>

Common failure patterns

CrashLoopBackOff

kubectl describe pod <pod> -n <ns> --context <ctx>   # check Exit Code and Last State
kubectl logs <pod> -n <ns> --previous --context <ctx>  # logs from crashed container

Exit codes to know:

1 — application error (check app logs)
137 — OOMKilled or SIGKILL (check memory limits)
139 — segfault
143 — SIGTERM (graceful shutdown, usually harmless)

OOMKilled

# Confirm OOMKilled in describe output
kubectl describe pod <pod> -n <ns> --context <ctx> | grep -A3 "Last State"

# Check current memory usage vs limits
kubectl top pod <pod> -n <ns> --context <ctx>

# Check what limits are set
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources}' --context <ctx>

ImagePullBackOff / ErrImagePull

kubectl describe pod <pod> -n <ns> --context <ctx>  # look at Events section
# Common causes: wrong image name/tag, missing imagePullSecret, registry unreachable

Pending pod

kubectl describe pod <pod> -n <ns> --context <ctx>
# Look for: Insufficient cpu/memory, no nodes match affinity, PVC not bound, taint/toleration mismatch
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name> --context <ctx>

Service not reachable

# Check endpoints — if empty, selector doesn't match any pods
kubectl get endpoints <svc> -n <ns> --context <ctx>
kubectl describe service <svc> -n <ns> --context <ctx>

# Check pod labels match service selector
kubectl get pods -n <ns> --show-labels --context <ctx>

Node pressure / evictions

kubectl describe nodes --context <ctx> | grep -E "Pressure|Evict|Condition"
kubectl get pods -A --field-selector status.phase=Failed --context <ctx>
kubectl get events -A --field-selector reason=Evicted --context <ctx>

Helm releases

# List all releases
helm list -A --kube-context <ctx>

# Release status and last deployed
helm status <release> -n <ns> --kube-context <ctx>

# Values currently in use
helm get values <release> -n <ns> --kube-context <ctx>

# Rendered manifests
helm get manifest <release> -n <ns> --kube-context <ctx>

# History of rollouts
helm history <release> -n <ns> --kube-context <ctx>

Quick health snapshot (run as a first pass)

# Paste these 4 commands to get a full picture fast
kubectl get nodes -o wide --context <ctx>
kubectl get pods -A --context <ctx> | grep -v -E "Running|Completed"
kubectl get events -A --sort-by='.lastTimestamp' --context <ctx> | tail -30
kubectl top nodes --context <ctx>

Safety rules

Never run mutating commands (kubectl delete, kubectl apply, kubectl rollout restart, helm upgrade, helm rollback) without explicit user confirmation and stating exactly what will change.
Always include --context explicitly — never rely on the current-context ambient default when the user has multiple clusters.
Always include --namespace (-n) — never assume default.
When suggesting a fix, state: which cluster, which namespace, what the command does, and what the rollback looks like.
Prefer --dry-run=client when available to preview changes before applying.
For destructive operations (delete, force-delete), stop and confirm even if the user said "just fix it".

Useful aliases to suggest

# Add to shell profile
alias kctx='kubectl config use-context'
alias kns='kubectl config set-context --current --namespace'
alias kgp='kubectl get pods -o wide'
alias kge='kubectl get events --sort-by=.lastTimestamp'
alias kdp='kubectl describe pod'

Works well with

argocd skill — for GitOps sync/rollback after diagnosing
helm-qa skill — for validating charts before re-deployment
dips-core:spector skill — for checking what version is deployed per environment

k8s-debug

Invocation

Context Preview

SKILL.md

k8s-debug

Invocation

Context Preview

SKILL.md

Kubernetes Cluster Debugging

Prerequisites

Cluster and context selection

Investigation order

Step 1 — Cluster health

Step 2 — Namespace overview

Step 3 — Describe the failing resource

Step 4 — Logs

Step 5 — Events

Common failure patterns

CrashLoopBackOff

OOMKilled

ImagePullBackOff / ErrImagePull

Pending pod

Service not reachable

Node pressure / evictions

Helm releases

Quick health snapshot (run as a first pass)

Safety rules

Useful aliases to suggest

Works well with

Similar Skills

Kubernetes Cluster Debugging

Prerequisites

Cluster and context selection

Investigation order

Step 1 — Cluster health

Step 2 — Namespace overview

Step 3 — Describe the failing resource

Step 4 — Logs

Step 5 — Events

Common failure patterns

CrashLoopBackOff

OOMKilled

ImagePullBackOff / ErrImagePull

Pending pod

Service not reachable

Node pressure / evictions

Helm releases

Quick health snapshot (run as a first pass)

Safety rules

Useful aliases to suggest

Works well with

Similar Skills