From ci
Automatically retry failed Prow CI jobs with intelligent failure analysis and exponential backoff. Monitors job status, classifies failures (infrastructure vs code), and retries infrastructure failures automatically. Use when users want to monitor Prow jobs, automatically retry failed tests, or handle flaky infrastructure. Triggers on phrases like "retry this prow job", "monitor and retry", "auto-retry failed test", or any mention of automatic Prow job retry with failure analysis.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ci:prow-job-retryThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Automatically monitor Prow CI job status, analyze failures to distinguish infrastructure vs. code issues, and retry infrastructure failures with intelligent exponential backoff.
Automatically monitor Prow CI job status, analyze failures to distinguish infrastructure vs. code issues, and retry infrastructure failures with intelligent exponential backoff.
Optimized for long-running tests (2-4 hours):
User request example:
Monitor ci job https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30585/pull-ci-openshift-origin-main-okd-scos-images/2021101652626378752 for PR https://github.com/openshift/origin/pull/30585
What happens:
Default timeline for 4-hour test:
T+0h: Initial run starts
T+4h: Fails (infrastructure issue detected)
T+5h: Retry #1 starts (waited 60min ± 6min)
T+9h: Retry #1 fails
T+11h: Retry #2 starts (waited 120min ± 12min)
T+15h: Complete (max 3 runs, ~15 hours total)
Extract:
https://prow.ci.openshift.org/view/gs/test-platform-results/...https://github.com/{org}/{repo}/pull/{number}import re
prow_pattern = r'https://prow\.ci\.openshift\.org/view/gs/(test-platform-results/.+?)(?:\s|$)'
pr_pattern = r'https://github\.com/([\w-]+/[\w-]+)/pull/(\d+)'
prow_match = re.search(prow_pattern, user_input)
pr_match = re.search(pr_pattern, user_input)
bucket_path = prow_match.group(1) # e.g., "test-platform-results/pr-logs/pull/30585/..."
repo = pr_match.group(1) # e.g., "openshift/origin"
pr_number = pr_match.group(2) # e.g., "30585"
build_id = bucket_path.split('/')[-1] # Last segment
Execute the scripts/monitor_job.sh script with parsed parameters:
./scripts/monitor_job.sh \
--bucket-path "{bucket_path}" \
--pr-repo "{repo}" \
--pr-number "{pr_number}"
The script handles:
The script outputs real-time progress:
=================================================================
CI Job Monitor Started
=================================================================
Build ID: 2021101652626378752
PR: https://github.com/openshift/origin/pull/30585
Start Time: 2026-02-10 14:30:00
Check Interval: 600s (10min)
Max Retries: 5
=================================================================
┌────────────────────────────────────────────────────────────┐
│ Check #1 at 2026-02-10 14:30:00
│ Retry Count: 0/5
└────────────────────────────────────────────────────────────┘
Status: ONGOING
⏳ Job still running, checking again in 600s...
When the job completes or max retries reached, a report is generated at .work/prow-job-retry/{build_id}/report.md.
Success report:
# CI Job Monitoring Report - SUCCESS ✅
## Job Information
- Build ID: 2021101652626378752
- Job Name: pull-ci-openshift-origin-main-okd-scos-images
- PR: https://github.com/openshift/origin/pull/30585
- Status: SUCCESS
- Total Retries: 2/5
## Summary
The CI job completed successfully after 2 retries.
Failure report includes:
Short tests (< 30min): Quick iterations, more retries
export CI_MONITOR_CHECK_INTERVAL=300 # 5 minutes
export CI_MONITOR_MAX_RETRIES=3 # 3 retries
export CI_MONITOR_BASE_BACKOFF=300 # 5 minutes
export CI_MONITOR_BACKOFF_JITTER=10 # ±10%
Medium tests (30min-2h): Balanced approach
export CI_MONITOR_CHECK_INTERVAL=600 # 10 minutes
export CI_MONITOR_MAX_RETRIES=2 # 2 retries
export CI_MONITOR_BASE_BACKOFF=1200 # 20 minutes
export CI_MONITOR_BACKOFF_JITTER=10 # ±10%
Long tests (2-4h): ⭐ DEFAULT - Conservative, cost-effective
export CI_MONITOR_CHECK_INTERVAL=1800 # 30 minutes
export CI_MONITOR_MAX_RETRIES=2 # 2 retries
export CI_MONITOR_BASE_BACKOFF=3600 # 60 minutes
export CI_MONITOR_BACKOFF_JITTER=10 # ±10%
Super long tests (> 4h): Minimal retries, long recovery
export CI_MONITOR_CHECK_INTERVAL=3600 # 60 minutes
export CI_MONITOR_MAX_RETRIES=1 # 1 retry only
export CI_MONITOR_BASE_BACKOFF=7200 # 120 minutes
export CI_MONITOR_BACKOFF_JITTER=15 # ±15%
See references/retry_strategy_presets.md for detailed analysis and rationale.
Full list of configuration options:
export CI_MONITOR_CHECK_INTERVAL=1800 # Check interval in seconds (default: 1800 = 30min)
export CI_MONITOR_MAX_RETRIES=2 # Max retry attempts (default: 2)
export CI_MONITOR_BASE_BACKOFF=3600 # Base backoff in seconds (default: 3600 = 60min)
export CI_MONITOR_BACKOFF_JITTER=10 # Jitter percentage (default: 10 = ±10%)
export CI_MONITOR_ANALYZE_FAILURES=true # Enable failure classification (default: true)
# Skip failure analysis (faster, but less intelligent)
./scripts/monitor_job.sh ... --fast
# Disable automatic retries
./scripts/monitor_job.sh ... --no-retry
# Custom retry count
./scripts/monitor_job.sh ... --max-retries 3
# Quiet mode (less output)
./scripts/monitor_job.sh ... --quiet
The scripts/classify_failure.sh script analyzes failures:
Keywords detected in build-log.txt:
Action: Retry with exponential backoff
Keywords detected:
Action: Report to user, manual fix required
Scenarios:
Action: Conservative retry (default behavior)
Retry delays increase exponentially to give infrastructure time to recover, with random jitter to avoid thundering herd:
Default (long tests, 2-4h):
Retry #1: 60 minutes ± 6min (3600s * 2^0 ± 10%)
Retry #2: 120 minutes ± 12min (3600s * 2^1 ± 10%)
Total: ~15 hours for 3 runs (4h each + backoff)
Short tests (< 30min):
Retry #1: 5 minutes ± 30s (300s * 2^0 ± 10%)
Retry #2: 10 minutes ± 1min (300s * 2^1 ± 10%)
Retry #3: 20 minutes ± 2min (300s * 2^2 ± 10%)
Total: ~2 hours for 4 runs
Why exponential + jitter?
When infrastructure failure detected:
Post comment to PR:
/test {job-name}
Automatic retry #1/5 triggered by CI monitor.
Previous job failed due to infrastructure issue.
Backoff: 10 minutes.
Wait for backoff period (10min, 20min, 40min, etc.)
Find new job ID from PR checks
Switch monitoring to new job
Continue monitoring until success or max retries
After posting retry comment, the script queries GitHub API to find the new job:
gh pr view {pr-number} --repo {org/repo} --json statusCheckRollup | \
jq -r '.statusCheckRollup[] |
select(.name | contains("{job-name}")) |
.detailsUrl' | \
grep -oP '\d{19}' | tail -1
Run in background and check later:
nohup ./scripts/monitor_job.sh \
--bucket-path "..." \
--pr-repo "..." \
--pr-number "..." \
> monitor.log 2>&1 &
# Check progress
tail -f monitor.log
# Or view report
cat .work/prow-job-retry/{build_id}/report.md
Adjust backoff for different job types:
# Aggressive (short backoff for fast tests)
export CI_MONITOR_BASE_BACKOFF=300 # 5 minutes
# Conservative (long backoff for expensive E2E)
export CI_MONITOR_BASE_BACKOFF=1200 # 20 minutes
Skip failure analysis for quicker monitoring:
./scripts/monitor_job.sh ... --fast
Trade-off: Retries all failures (no intelligent classification)
Cause: Invalid job URL or GCS access denied
Solution:
gcloud auth listgcloud storage cat gs://test-platform-results/.../prowjob.jsonCause: GitHub authentication or permission issue
Solution:
gh auth statusgh pr comment {pr} --repo {repo} --body "test"Cause: Job hasn't started yet or GitHub API delay
Solution: Normal behavior - script continues monitoring current job and will pick up new job on next check
Cause: No clear keywords in build-log.txt
Solution: Script retries conservatively. Review .work/prow-job-retry/{build_id}/logs/build-log.txt manually
monitor_job.sh (Main script)
classify_failure.sh (Failure analyzer)
{
"classification": "INFRASTRUCTURE|CODE|UNKNOWN",
"confidence": "HIGH|MEDIUM|LOW",
"reason": "Description of why classified this way"
}
prow_api.md (Detailed reference)
job_types_and_commands.md (Job type identification)
Load these references when you need details about:
Use appropriate retry counts
Monitor important jobs only
Review failure reports
Tune for your environment
Check authentication first
gcloud auth list # GCS access
gh auth status # GitHub access
User: "monitor ci job https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30585/pull-ci-openshift-origin-main-e2e-aws/123456789 for PR https://github.com/openshift/origin/pull/30585"
Actions:
./scripts/monitor_job.sh with defaultsUser: "monitor ci job <url> for PR <url>, skip failure analysis"
Actions:
--fast flagUser: "monitor ci job <url> for PR <url>, just notify me when done"
Actions:
--no-retry flagUser: "monitor ci job <url> for PR <url>, retry up to 3 times"
Actions:
--max-retries 3All artifacts stored in .work/prow-job-retry/{build_id}/:
.work/prow-job-retry/2021101652626378752/
├── prowjob.json # Job metadata
├── logs/
│ ├── monitor.log # Full monitoring log
│ └── build-log.txt # Downloaded build log
├── tmp/
│ ├── analysis_0.json # Initial failure analysis
│ ├── analysis_1.json # Retry #1 analysis
│ └── analysis_2.json # Retry #2 analysis
└── report.md # Final report (SUCCESS/FAILURE)
Access artifacts:
# View monitoring progress
cat .work/prow-job-retry/{build_id}/logs/monitor.log
# Check failure analysis
cat .work/prow-job-retry/{build_id}/tmp/analysis_*.json
# Read final report
cat .work/prow-job-retry/{build_id}/report.md
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub wangke19/my-claude-skills --plugin ci