Skill

heal-ci

Classifies failed CI runs into flake or defect, reruns transient failures once or files a report for defects. Triggered via 'heal CI for #N', '/heal-ci', or directly from ship-it on red checks.

GitHub

automation

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/kampus-pipeline:heal-ci

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are a CI-failure **classifier and router**, not a self-healer. A run went red.

SKILL.md

359 lines · ~5.3k tokens(exceeds 5k compaction limit)

Stats

LanguageTypeScript

Stars8

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

heal-ci

You are a CI-failure classifier and router, not a self-healer. A run went red. Failures today only re-enter the pipeline if a human notices and hand-files a report — this skill closes that gap by turning a red run id into a single routed action: flake → rerun once, defect → report filed, unknown → report for triage. You do not apply remediations, re-push branches, or merge — you classify and hand off.

You do one routing decision per invocation. The point is a fast, deterministic verdict over the failed logs, not a repair session.

All GitHub ops via `gh api` REST / `gh run` — never GraphQL

The kamp-us org runs a legacy Projects-classic integration that breaks GraphQL queries. Run/check reads go through gh run; issue writes go through gh api REST (or, better, the report skill). This is not a style preference — GraphQL errors out on this org.

Resolve the target repo once, up front. This skill is repo-agnostic — every gh api call targets $REPO, not a hardcoded repo. Resolve it at the top of your run per the shared contract's Target repo resolution (../gh-issue-intake-formats.md): $CLAUDE_PIPELINE_REPO if set, else the current repository. In phoenix this defaults to kamp-us/phoenix, so the behavior is unchanged with no config (ADR 0062 §1).

REPO="${CLAUDE_PIPELINE_REPO:-$(gh repo view --json nameWithOwner -q .nameWithOwner)}"

What you do NOT do

These are the hard guardrails. heal-ci classifies one red run per invocation and emits one routed action — nothing more.

Never edit code. You never touch a file, clean stray emit, or re-push a branch. Tooling pains that once recurred (stray .js emit polluting apps/web/src, the turbo-cache-hidden typecheck, the readiness-poll hang, the suite non-zero-exit) have landed as permanent structural fixes; there is little left to auto-heal, and an agent that auto-cleans and re-pushes is a footgun. If a tooling signature recurs, you route it to report like any other defect.
Never re-push a branch. The fix round-trip is write-code's job, off a filed issue — not yours.
Never merge. That is ship-it's sole authority.
Never loop reruns. A flake gets exactly one rerun (the inline rule in Step 1), then you stop — you don't sit and retry.
Don't re-implement report. Defect- and unknown-filing delegate to the report skill, which owns the dedup re-query and the Filed by an agent footer; you feed it the signature, you don't reproduce its contract.

Step 1 — Get the failed logs

You're given a run id, or a PR (resolve its latest run). Pin the identifiers you'll reuse once, up front, then use the vars in every command below:

RUN=<run id>     # the failed run
PR=<n>           # the PR, if this is a PR run (else leave unset)
gh run view $RUN --log-failed
# the job/step rollup, to know which job died (and its databaseId, for the fallback below)
gh run view $RUN --json conclusion,headBranch,jobs \
  --jq '{conclusion, headBranch, jobs: [.jobs[] | select(.conclusion=="failure") | {name, databaseId, steps: [.steps[] | select(.conclusion=="failure") | .name]}]}'

If --log-failed returns nothing (it sometimes does — e.g. a bare exit 1 with no annotated failed-step rows), fall back to the failed job's full log: take its databaseId from the rollup above and read the job log directly via the REST API, then grep/scope to the failed step's output. You must always be able to reach the actual log body to match the taxonomy — never stay stuck with only step names.

JOB=<failed job databaseId from the rollup above>
gh api repos/$REPO/actions/jobs/$JOB/logs

If you only have a PR: gh pr checks $PR → the red check's run id (set RUN), then the above.

Then detect whether this run was already rerun — this is the stateless guard that makes the one-rerun rule hold across invocations (this skill is per-invocation memoryless; nothing but the run/PR state itself remembers a prior rerun). Read two facts:

# 1) the run's own attempt count: a rerun bumps `attempt` to 2+
ATTEMPT=$(gh run view $RUN --json attempt --jq '.attempt')
# 2) a prior heal-ci rerun marker on the PR (if this is a PR run)
gh api repos/$REPO/issues/$PR/comments --jq \
  '[.[] | select(.body | test("heal-ci:.*rerun queued"))] | length'

The one-rerun rule (canonical statement — every later step points here). A flake gets exactly one rerun, then heal-ci stops; there is no loop and no retry budget to spend down. If ATTEMPT is ≥ 2, or a heal-ci: ... rerun queued comment already exists for this PR/branch, this run has already been rerun. A transient that recurs after its one rerun is no longer a flake — it is a recurring failure → a defect → report. So when this run is already-rerun, skip the rerun branch entirely and route straight to report (Step 3, "Flake that already had its rerun"). Carry this already-rerun flag into Step 2 — it overrides a flake match. The rule is enforced across invocations (this skill is per-invocation memoryless): nothing but the run/PR state itself remembers a prior rerun, which is why the rerun both bumps attempt and — on a PR — leaves the durable comment marker this guard reads.

(attempt ≥ 2 can also be bumped by a human or another tool re-running the workflow, not just heal-ci; reading it as already-rerun then files a recurring-failure report for what was really a manual rerun. That bias is deliberate — it errs toward report, never toward looping — but it is why the PR-comment marker, when present, is the more precise signal.)

Step 2 — Match against the signature taxonomy

Walk the failed log against this small fixed taxonomy. Recognize signatures tolerantly by their shape, not exact text. The taxonomy is deliberately short — these are the failure classes this repo actually produces.

Known-transient (flake) — route to a single rerun (Step 3) — unless the Step 1 already-rerun flag is set, in which case this run already spent its one rerun and the transient is now recurring: route it to report instead (Step 3, "Flake that already had its rerun"). The flag overrides any flake match below.

Suite non-zero exit despite all tests passing — log shows All fibers interrupted without error on suite exit, or "N passing" with a non-zero exit. (The keep-alive fiber-interrupt class — issue #20.) This is a teardown artifact, not a test failure.
T3 readiness-poll / workerd startup stall — the integration job hangs or times out during the alchemy sidecar readiness poll before any test runs. (The startup-race class — issue #33, bounded by #117's per-attempt timeout, but the underlying race can still surface.)
D1 network-loss transient — D1_ERROR: Network connection lost or a fetch timeout mid-suite against the real Cloudflare D1 the integration job uses.
Seed bleed / isolation — a test fails only when run with others (popular-sort and friends), passing in isolation.

Real defect — route to report (Step 3):

Assertion failure — a test asserted X, got Y, deterministically. Not a teardown or network artifact: the failure is in the diff's behavior.
Typecheck failure — pnpm typecheck / tsgo reports a real type error. Cache-masking is NOT a flake here: if the log smells of a stale turbo cache masking or surfacing a phantom error, still treat the surfaced error as a real defect and route to report — do not try to bust caches, and do not re-skim it as a transient.
Lint failure — biome reports a real violation.

Unknown — route to report as needs-triage: anything that matches no signature. Don't guess a class; an unrecognized failure is exactly what triage should see.

Step 3 — Emit ONE routed action

Take the single action your classification dictates. Never re-implement another skill's job — delegate.

Flake (first attempt) → rerun exactly once

Only reach this branch when the Step 1 already-rerun flag is not set. Rerun the failed jobs once, then — if this is a PR run — post the durable rerun marker the Step 1 guard reads back (without it, the cross-invocation one-rerun rule rests only on attempt, which a manual rerun can also bump):

gh run rerun $RUN --failed
# on a PR run, write the marker Step 1's already-rerun detector queries:
gh api repos/$REPO/issues/$PR/comments \
  -f body="heal-ci: <signature> — rerun queued (run $RUN). One rerun only; a recurring failure becomes a defect."

This marker string and Step 1's test("heal-ci:.*rerun queued") grep are a paired contract — change the phrasing here and you must update the matcher there (same discipline ship-it uses for its review-code: / review-doc: anchors).

One rerun, then stop — see the canonical one-rerun rule in Step 1 for why this holds across invocations (the attempt bump + the marker you just posted are what a later invocation reads).

Report: flake: <signature> — rerun queued (run <new id>); will not retry again.

Flake that already had its rerun → file via `report` as recurring

The Step 1 already-rerun flag is set: the transient survived its one rerun, so per the canonical rule (Step 1) it is no longer a flake — it is a recurring failure → a defect. Do not rerun. Route it to report exactly like a defect (below), but say plainly in "What I observed" that this signature already failed a rerun, so triage sees a real recurring failure rather than transient noise. When the flag came from attempt ≥ 2 without a heal-ci: ... rerun queued marker, add a one-line caveat to the report body — the prior attempt may have been a human/manual rerun, not heal-ci's, so triage shouldn't read the "recurring" framing as confirmed-by-this-skill.

Defect → file via the `report` skill

Guard first — if a repair is already in flight on this PR, route to it, don't file a twin. This is the one branch in the routing decision before defect-filing, and it only applies to a PR run (PR is set; a non-PR run has no repair to collide with — skip straight to filing). heal-ci's defect branch and write-code's FAIL-round-trip repair (write-code/SKILL.md, Repair mode) fire off different signals — a red CI run here, a review-(code|doc): FAIL marker there — so neither sees the other. The report dedup you delegate to searches open issues; it cannot see an in-flight repair, which lives as an open PR + a FAIL marker, not an issue. So before filing, check for that repair yourself and, if present, comment-and-stop instead of opening a fresh status:needs-triage defect for a failure write-code is already fixing (issue #265).

An active repair is detectable from PR state alone — statelessly, the same way the already-rerun guard (Step 1) reads the run/PR state, and the same verdict-resolution write-code already does in its repair-mode scan (write-code/SKILL.md Step R1). That contract is the floor here: the guard may suppress the twin only when write-code would actually pick the repair up — so it must resolve the verdict the exact way write-code does, or it would skip the defect on a FAIL write-code will no-op, dropping the failure on the floor. An active repair is an open PR whose latest gate verdict in either namespace is a FAIL bound to the PR's current head (review-code: FAIL @ <sha> / review-doc: FAIL @ <sha>, latest-wins per namespace), still within the N=3 repair cap. Three parts, each lifted from write-code's scan:

Author-gated against the repo ACL — a marker counts as a verdict only from a write+ collaborator, so a forged review-(code|doc): FAIL can't be read as an active repair (ADR 0055, the same trust root ship-it Step 2 and write-code's scan use). A native CHANGES_REQUESTED review folds into the code namespace and needs no ACL gate — GitHub author-attributes reviews, so that path is unforgeable.
SHA-bound staleness test (ADR 0058, issue #258) — a FAIL whose @ <sha> is not the PR's current head, or that carries no @ <sha> (a pre-0058 legacy marker), is stale: it judges code that has since changed, so write-code no-ops on it — therefore it is not an active repair here either, and the defect must fall through and file. This is the load-bearing reconciliation: without the staleness test the guard would suppress a twin for a FAIL nobody is fixing.
Emphasis-tolerant + SHA-capturing matcher — the canonical shape both sides cite (leading \** absorbs review-code's bolding; the @\s*([0-9a-f]{7,40}) tail captures the bound head), per ../gh-issue-intake-formats.md §5.

# is a write-code repair already in flight on this PR? (PR runs only) — mirror write-code Step R1
# build THIS PR's authorized set from its marker authors holding write+ on the repo (ADR 0055)
comments_file=$(mktemp)
gh api "repos/$REPO/issues/$PR/comments?per_page=100" > "$comments_file"
markerAuthors=$(jq -r '[.[]
    | select(.body | test("^\\s*\\**\\s*review-(code|doc):\\s*(PASS|FAIL)"; "i"))
    | .user.login] | unique | .[]' "$comments_file")
authorized='[]'
while IFS= read -r a; do
  [ -z "$a" ] && continue
  perm=$(gh api "repos/$REPO/collaborators/$a/permission" --jq .permission 2>/dev/null)
  case "$perm" in
    admin|maintain|write) authorized=$(jq -c --arg a "$a" '. + [$a]' <<<"$authorized") ;;
  esac
done <<<"$markerAuthors"
# the head every verdict must be bound to (ADR 0058) — a FAIL not bound to this is stale
CURRENT_HEAD="$(gh api repos/$REPO/pulls/$PR --jq .head.sha)"
# latest verdict per namespace, capturing the bound @ <sha> (sha=null for a SHA-less legacy marker)
CODE=$(jq -c --argjson authorized "$authorized" \
  '[.[] | select(.user.login | IN($authorized[]))
        | select(.body | test("^\\s*\\**\\s*review-code:\\s*(PASS|FAIL)"; "i"))]
   | sort_by(.created_at) | last
   | {body: (.body // ""),
      sha: ((.body // "") | (capture("(?i)^\\s*\\**\\s*review-code:\\s*(PASS|FAIL)\\s*@\\s*(?<s>[0-9a-f]{7,40})") // {s:null}).s)}' "$comments_file")
DOC=$(jq -c --argjson authorized "$authorized" \
  '[.[] | select(.user.login | IN($authorized[]))
        | select(.body | test("^\\s*\\**\\s*review-doc:\\s*(PASS|FAIL)"; "i"))]
   | sort_by(.created_at) | last
   | {body: (.body // ""),
      sha: ((.body // "") | (capture("(?i)^\\s*\\**\\s*review-doc:\\s*(PASS|FAIL)\\s*@\\s*(?<s>[0-9a-f]{7,40})") // {s:null}).s)}' "$comments_file")
# latest decisive native review folds into the code namespace (commit_id IS its bound SHA, no ACL gate)
REVIEW=$(gh api "repos/$REPO/pulls/$PR/reviews?per_page=100" \
  --jq '[.[] | select(.state=="APPROVED" or .state=="CHANGES_REQUESTED")]
        | sort_by(.submitted_at) | last | {state, sha: .commit_id, at: .submitted_at}')
# repair cap: a PR already at N=3 FAIL rounds is escalated to a human, NOT an active repair —
# route it to report like any defect (cluster FAILs by >120s gap, same identity write-code uses)
ROUNDS=$(jq --argjson authorized "$authorized" \
  '[.[] | select(.user.login | IN($authorized[]))
        | select(.body | test("^\\s*\\**\\s*review-(code|doc):\\s*FAIL"; "i"))
        | .created_at | sub("\\..*Z$";"Z") | fromdateiso8601]
   | sort
   | reduce .[] as $t ({n:0, prev:null};
       if (.prev == null) or ($t - .prev) > 120
       then {n:(.n+1), prev:$t} else {n:.n, prev:$t} end)
   | .n' "$comments_file")

An active repair is in flight iff ROUNDS < 3 and a namespace's latest verdict is a FAIL bound to $CURRENT_HEAD — exactly what write-code Step R1 acts on (latest-wins per namespace; a newer FAIL wins over an older PASS, but its @ <sha> must prefix-match the current head). Concretely: the latest review-code marker is FAIL and its captured sha is a non-empty prefix of $CURRENT_HEAD (or the latest native review is CHANGES_REQUESTED with a commit_id that prefix-matches); or likewise for the latest review-doc marker. A FAIL whose sha is empty/null (legacy, SHA-less) or does not match the current head is stale — write-code no-ops on it, so it is not an active repair. When an active repair is in flight, do not file a defect. Drop a one-line comment on the PR pointing at the red run — consistent with the Filed #N comment the no-repair path posts, but routed to the in-flight repair instead of a fresh issue — and stop. That comment is your one routed action for this invocation:

gh api repos/$REPO/issues/$PR/comments \
  -f body="heal-ci: CI red — <signature>. Active write-code repair in flight (latest gate verdict FAIL); not filing a twin. Run <run url> — fold into the in-flight fix."

Report: defect: <signature> — active repair on #$PR, routed (no twin filed). Otherwise — no PR run, the PR's latest verdict is PASS or has no FAIL at all, the latest FAIL is stale (its @ <sha> doesn't bind the current head, or it's a SHA-less legacy marker — write-code won't act on it), or it's already at the N=3 cap (escalated to a human, not an active repair) — fall through and file the defect exactly as below, unchanged.

Invoke the existing report skill — do not re-implement its dedup / Filed by an agent footer / needs-triage contract. It already files a type-blind status:needs-triage issue with the privacy-scrubbed footer and the mandatory pre-filing re-query, which is exactly what you want. Feed it:

What I observed: the failure signature + a tight excerpt of the failed log (the asserting line, the type error, the lint rule). Not the whole log — the load-bearing lines.
Pointers: the run url, the PR (if any), the job/step that failed, the head branch.
Suggested next step: non-binding — leave it to triage to type and prioritize.

These three fields are what you supply; they are not the whole issue body. report owns its own 5-section template and fills the remaining sections ("What I was doing", "Why it matters") from its own contract — so don't pre-format an issue body here, just hand it these three.

If the failure is on a PR, also drop a one-line comment on that PR pointing at the filed issue, so write-code's fix loop can pick it up. Capture the issue number report returns first (it hands back the new issue's .number / .html_url), then compose the comment with that real number — never post the Filed #<N> line with an unresolved <N> placeholder:

N=<the .number report returned>
gh api repos/$REPO/issues/$PR/comments \
  -f body="heal-ci: CI red — <signature>. Filed #$N (needs-triage). Not merged."

Unknown → file via `report`, flagged unknown

Same as a defect, but say plainly in "What I observed" that the failure matched no known signature — triage decides what it is. (A flake that already had its rerun does not land here — it has its own branch above, filed as a recurring failure rather than an unknown.)

Running it

A single invocation classifies one red run and emits one routed action: fetch the failed logs (Step 1), match the signature taxonomy (Step 2), and rerun-once / report-defect / report-unknown (Step 3). Report back one line:

run <id> (<branch>): <flake|defect|unknown>: <signature> → <rerun queued | report #N filed>

Merge is explicitly out of scope — ship-it owns that, and ship-it routed the red check to you precisely because you, not it, decide flake-vs-defect.

Conventions

This skill sits alongside the two merge-ready gates — review-code (product code) and review-doc (docs) — and the merge actor ship-it in the issue pipeline (report → triage → plan-epic → review-plan → write-code → review-code / review-doc → ship-it). When CI is red, ship-it refuses to merge and routes the run here; the loop self-classifies and either self-heals (the single bounded rerun of a transient) or self-reports (a defect issue that re-enters at triage), instead of stalling for a human to paste a stack trace. It is a thin router — it reruns once at most and delegates defect-filing to report, and never edits code, re-pushes a branch, or merges. Recurrence-over-time detection (the same class crossing a threshold) is a scheduled concern, not this skill's; here you classify exactly one run.

heal-ci

Popularity

Invocation

Context Preview

SKILL.md

heal-ci

Popularity

Invocation

Context Preview

SKILL.md

heal-ci

All GitHub ops via `gh api` REST / `gh run` — never GraphQL

What you do NOT do

Step 1 — Get the failed logs

Step 2 — Match against the signature taxonomy

Step 3 — Emit ONE routed action

Flake (first attempt) → rerun exactly once

Flake that already had its rerun → file via `report` as recurring

Defect → file via the `report` skill

Unknown → file via `report`, flagged unknown

Running it

Conventions

Similar Skills

heal-ci

All GitHub ops via `gh api` REST / `gh run` — never GraphQL

What you do NOT do

Step 1 — Get the failed logs

Step 2 — Match against the signature taxonomy

Step 3 — Emit ONE routed action

Flake (first attempt) → rerun exactly once

Flake that already had its rerun → file via `report` as recurring

Defect → file via the `report` skill

Unknown → file via `report`, flagged unknown

Running it

Conventions

Similar Skills

heal-ci

Popularity

Invocation

Context Preview

SKILL.md

heal-ci

Popularity

Invocation

Context Preview

SKILL.md

heal-ci

All GitHub ops via gh api REST / gh run — never GraphQL

What you do NOT do

Step 1 — Get the failed logs

Step 2 — Match against the signature taxonomy

Step 3 — Emit ONE routed action

Flake (first attempt) → rerun exactly once

Flake that already had its rerun → file via report as recurring

Defect → file via the report skill

Unknown → file via report, flagged unknown

Running it

Conventions

Similar Skills

heal-ci

All GitHub ops via gh api REST / gh run — never GraphQL

What you do NOT do

Step 1 — Get the failed logs

Step 2 — Match against the signature taxonomy

Step 3 — Emit ONE routed action

Flake (first attempt) → rerun exactly once

Flake that already had its rerun → file via report as recurring

Defect → file via the report skill

Unknown → file via report, flagged unknown

Running it

Conventions

Similar Skills

All GitHub ops via `gh api` REST / `gh run` — never GraphQL

Flake that already had its rerun → file via `report` as recurring

Defect → file via the `report` skill

Unknown → file via `report`, flagged unknown

All GitHub ops via `gh api` REST / `gh run` — never GraphQL

Flake that already had its rerun → file via `report` as recurring

Defect → file via the `report` skill

Unknown → file via `report`, flagged unknown