From k8s-debug
Use when the user wants to debug, investigate, or troubleshoot Kubernetes clusters, pods, deployments, services, nodes, or any k8s resource. Trigger on keywords like "pod crashing", "CrashLoopBackOff", "OOMKilled", "ImagePullBackOff", "pending pod", "node pressure", "cluster health", "kubectl", "k8s issue", "what's wrong with my deployment", "debug namespace", "check logs", "pod not starting", "service not reachable", "resource limits", "evicted pods", "kubeconfig", "switch cluster", "which context". Also trigger when the user asks about Kubernetes events, resource usage, Helm release status, or wants to inspect anything running in a cluster.
How this skill is triggered — by the user, by Claude, or both
Slash command
/k8s-debug:k8s-debugThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Safe, structured investigation of Kubernetes issues using local kubeconfigs.
Safe, structured investigation of Kubernetes issues using local kubeconfigs. Always read before you act: gather full context before suggesting any change.
kubectl installed and on PATH~/.kube/config)kubectx/kubens, stern, kubecolor, helmMultiple clusters are common. Always confirm the target before running anything.
# List all contexts across all kubeconfigs
kubectl config get-contexts
# Show current context
kubectl config current-context
# Switch context (ask user to confirm first)
kubectl config use-context <context-name>
# Or use kubectx for fast switching
kubectx # list
kubectx <context-name> # switch
If the user has kubeconfigs in non-default locations:
# Point to a specific file
kubectl --kubeconfig /path/to/config get nodes
# Merge multiple files for this session
$env:KUBECONFIG = "C:\Users\me\.kube\config-prod;C:\Users\me\.kube\config-staging"
kubectl config get-contexts
Always use --context and --namespace flags explicitly in commands you
run for the user — never rely on the ambient default silently targeting the
wrong cluster.
Work read-only from broad to narrow. Never suggest a fix before completing step 3.
# Node status and conditions
kubectl get nodes -o wide --context <ctx>
# Node resource usage (requires metrics-server)
kubectl top nodes --context <ctx>
# Check for node pressure conditions
kubectl describe nodes --context <ctx> | grep -A5 "Conditions:"
# All pods across all namespaces — spot non-Running at a glance
kubectl get pods -A --context <ctx> | grep -v Running | grep -v Completed
# List namespaces
kubectl get namespaces --context <ctx>
# Everything in a namespace
kubectl get all -n <ns> --context <ctx>
# Just pods with status
kubectl get pods -n <ns> -o wide --context <ctx>
# Pod resource usage
kubectl top pods -n <ns> --context <ctx>
# Recent events in namespace (sorted by time)
kubectl get events -n <ns> --sort-by='.lastTimestamp' --context <ctx>
describe is the single most useful command — always run it before checking logs.
# Pod detail: conditions, events, resource requests, image, mounts
kubectl describe pod <pod-name> -n <ns> --context <ctx>
# Deployment rollout status
kubectl describe deployment <name> -n <ns> --context <ctx>
# Service endpoints (is it selecting any pods?)
kubectl describe service <name> -n <ns> --context <ctx>
# PVC binding state
kubectl describe pvc <name> -n <ns> --context <ctx>
# Current container logs
kubectl logs <pod> -n <ns> --context <ctx>
# Previous container (after a crash)
kubectl logs <pod> -n <ns> --previous --context <ctx>
# Specific container in a multi-container pod
kubectl logs <pod> -c <container> -n <ns> --context <ctx>
# Tail + follow
kubectl logs <pod> -n <ns> --tail=100 -f --context <ctx>
# All pods matching a label (requires stern)
stern -n <ns> --context <ctx> <label-selector>
# All events in namespace, newest last
kubectl get events -n <ns> --sort-by='.lastTimestamp' --context <ctx>
# Filter to a specific pod
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name> --context <ctx>
# Warning events only
kubectl get events -n <ns> --field-selector type=Warning --context <ctx>
kubectl describe pod <pod> -n <ns> --context <ctx> # check Exit Code and Last State
kubectl logs <pod> -n <ns> --previous --context <ctx> # logs from crashed container
Exit codes to know:
1 — application error (check app logs)137 — OOMKilled or SIGKILL (check memory limits)139 — segfault143 — SIGTERM (graceful shutdown, usually harmless)# Confirm OOMKilled in describe output
kubectl describe pod <pod> -n <ns> --context <ctx> | grep -A3 "Last State"
# Check current memory usage vs limits
kubectl top pod <pod> -n <ns> --context <ctx>
# Check what limits are set
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources}' --context <ctx>
kubectl describe pod <pod> -n <ns> --context <ctx> # look at Events section
# Common causes: wrong image name/tag, missing imagePullSecret, registry unreachable
kubectl describe pod <pod> -n <ns> --context <ctx>
# Look for: Insufficient cpu/memory, no nodes match affinity, PVC not bound, taint/toleration mismatch
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name> --context <ctx>
# Check endpoints — if empty, selector doesn't match any pods
kubectl get endpoints <svc> -n <ns> --context <ctx>
kubectl describe service <svc> -n <ns> --context <ctx>
# Check pod labels match service selector
kubectl get pods -n <ns> --show-labels --context <ctx>
kubectl describe nodes --context <ctx> | grep -E "Pressure|Evict|Condition"
kubectl get pods -A --field-selector status.phase=Failed --context <ctx>
kubectl get events -A --field-selector reason=Evicted --context <ctx>
# List all releases
helm list -A --kube-context <ctx>
# Release status and last deployed
helm status <release> -n <ns> --kube-context <ctx>
# Values currently in use
helm get values <release> -n <ns> --kube-context <ctx>
# Rendered manifests
helm get manifest <release> -n <ns> --kube-context <ctx>
# History of rollouts
helm history <release> -n <ns> --kube-context <ctx>
# Paste these 4 commands to get a full picture fast
kubectl get nodes -o wide --context <ctx>
kubectl get pods -A --context <ctx> | grep -v -E "Running|Completed"
kubectl get events -A --sort-by='.lastTimestamp' --context <ctx> | tail -30
kubectl top nodes --context <ctx>
kubectl delete, kubectl apply, kubectl rollout restart, helm upgrade, helm rollback) without explicit user confirmation and stating exactly what will change.--context explicitly — never rely on the current-context ambient default when the user has multiple clusters.--namespace (-n) — never assume default.--dry-run=client when available to preview changes before applying.# Add to shell profile
alias kctx='kubectl config use-context'
alias kns='kubectl config set-context --current --namespace'
alias kgp='kubectl get pods -o wide'
alias kge='kubectl get events --sort-by=.lastTimestamp'
alias kdp='kubectl describe pod'
argocd skill — for GitOps sync/rollback after diagnosinghelm-qa skill — for validating charts before re-deploymentdips-core:spector skill — for checking what version is deployed per environmentCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub thashmikax/marketplace --plugin k8s-debug