From debugging-pipeline-failures
Systematically debugs Konflux CI/CD pipeline failures using kubectl and logs. Covers PipelineRun/TaskRun issues, build errors, and resource constraints.
How this skill is triggered — by the user, by Claude, or both
Slash command
/debugging-pipeline-failures:debugging-pipeline-failuresThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Core Principle**: Systematic investigation of Konflux CI/CD failures by correlating logs, events, and resource states to identify root causes.
README.mdtests/README.mdtests/results/image-pull-failure-diagnosis.1.txttests/results/image-pull-failure-diagnosis.2.txttests/results/image-pull-failure-diagnosis.3.txttests/results/log-analysis-methodology.1.txttests/results/log-analysis-methodology.2.txttests/results/log-analysis-methodology.3.txttests/results/resource-constraint-recognition.1.txttests/results/resource-constraint-recognition.2.txttests/results/resource-constraint-recognition.3.txttests/results/root-cause-vs-symptom.1.txttests/results/root-cause-vs-symptom.2.txttests/results/root-cause-vs-symptom.3.txttests/results/stuck-pipeline-investigation.1.txttests/results/stuck-pipeline-investigation.2.txttests/results/stuck-pipeline-investigation.3.txttests/results/systematic-investigation-approach.1.txttests/results/systematic-investigation-approach.2.txttests/results/systematic-investigation-approach.3.txtCore Principle: Systematic investigation of Konflux CI/CD failures by correlating logs, events, and resource states to identify root causes.
Key Abbreviations:
Invoke when encountering:
| Symptom | First Check | Common Cause |
|---|---|---|
| ImagePullBackOff | Pod events, image name | Registry auth, typo, missing image |
| TaskRun timeout | Step execution time in logs | Slow operation, network issues |
| Pending TaskRun | Resource quotas, node capacity | Quota exceeded, insufficient resources |
| Permission denied | ServiceAccount, RBAC | Missing Role/RoleBinding |
| Volume mount error | PVC status, workspace config | PVC not bound, wrong access mode |
| Exit code 127 | Container logs, command | Command not found, wrong image |
PipelineRun Status Check:
kubectl get pipelinerun <pr-name> -n <namespace>
kubectl describe pipelinerun <pr-name> -n <namespace>
Look for:
TaskRun Identification:
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
Identify failed TaskRuns by status.
Get TaskRun Pod Logs:
# Find the pod
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
# Get logs from specific step
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
# Get logs from all containers
kubectl logs <pod-name> --all-containers=true -n <namespace>
# For previous failures
kubectl logs <pod-name> -c step-<step-name> --previous -n <namespace>
What to Look For:
Check Kubernetes Events:
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Filter for specific resource
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
Critical Events:
FailedScheduling → Resource constraintsFailedMount → Volume/PVC issuesImagePullBackOff → Registry/image problemsEvicted → Resource pressurePipelineRun Details:
kubectl get pipelinerun <pr-name> -n <namespace> -o yaml
Check:
TaskRun Details:
kubectl get taskrun <tr-name> -n <namespace> -o yaml
Examine:
Pod Inspection:
kubectl describe pod <pod-name> -n <namespace>
Look for:
Correlate Findings:
Distinguish Symptom from Cause:
Symptoms: ImagePullBackOff, ErrImagePull
Investigation:
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events"
Check:
Common Fixes:
Symptoms: OOMKilled, Pending pods, quota errors
Investigation:
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
kubectl top pods -n <namespace>
kubectl describe node | grep -A5 "Allocated resources"
Common Causes:
Fixes:
Symptoms: Non-zero exit code, "command not found"
Investigation:
kubectl logs <pod-name> -c step-build -n <namespace>
Check:
Fixes:
Symptoms: TaskRun shows timeout in status
Investigation:
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}{"\n"}{.status.completionTime}'
Common Causes:
Fixes:
Symptoms: CreateContainerError, volume mount failures
Investigation:
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
Check:
Fixes:
Symptoms: "Forbidden", "unauthorized", RBAC errors
Investigation:
kubectl get sa <sa-name> -n <namespace>
kubectl get rolebindings -n <namespace>
kubectl auth can-i create pods --as=system:serviceaccount:<namespace>:<sa-name>
Check:
Fixes:
"Pipeline failed, let me rerun it immediately"
"Let me check logs and events to understand why it failed, then fix the root cause"
"Build timed out. I'll set timeout to 2 hours"
"Let me check what operation is slow in the logs, then optimize or increase timeout if truly needed"
"Too many logs to read, I'll just try changing something"
"I'll search logs for error keywords and check the last successful step before failure"
1. GET PIPELINERUN STATUS
↓
2. IDENTIFY FAILED TASKRUN(S)
↓
3. CHECK POD LOGS (specific step that failed)
↓
4. REVIEW EVENTS (timing correlation)
↓
5. INSPECT RESOURCE YAML (config issues)
↓
6. CORRELATE FINDINGS → IDENTIFY ROOT CAUSE
↓
7. APPLY FIX → VERIFY → DOCUMENT
Q: Is the PipelineRun stuck in "Running"?
Q: Which TaskRun failed first?
Q: What does the pod log show?
Q: Do events show image, volume, or scheduling issues?
Konflux pipeline failure, Tekton debugging, PipelineRun failed, TaskRun errors, build failures, CI/CD troubleshooting, ImagePullBackOff, OOMKilled, kubectl logs, pipeline timeout, workspace errors, RBAC permissions
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub konflux-ci/agent-plugins --plugin debugging-pipeline-failures