From orchestra
Diagnose a failed Orchestra pipeline, open a fix PR, validate it on a branch run, then present a human-readable triage summary and STOP for user approval before merging. Use this when the user wants to review the fix before it goes to main — not for fully automated fixes. Trigger on phrases like "triage my pipeline", "show me what's broken", "investigate but don't fix yet", "prepare a fix for review", or when the user explicitly wants a review gate before applying changes. Also trigger when the user describes a symptom in a downstream system ("dashboard looks wrong", "chart is stale", "dbt model has bad data") even if no pipeline error exists — the skill will trace the symptom upstream.
How this skill is triggered — by the user, by Claude, or both
Slash command
/orchestra:triage-orchestra-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diagnose a failed (or symptom-reported) pipeline, open a fix PR, validate it on the
Diagnose a failed (or symptom-reported) pipeline, open a fix PR, validate it on the branch, then present a compact triage report and stop for human approval. Nothing merges to main until the user says so.
Orchestra docs live under references/orchestra/. Index and layout: ../../references/orchestra/README.md. Paths below are relative to this skill folder.
Pipeline
../../references/orchestra/pipeline/diagnosis-patterns.md — error classification../../references/orchestra/pipeline/remediation-playbooks.md — fix strategies../../references/orchestra/pipeline/knowledge-store.md — (optional) local fix history; prefer your client's persistent memoryMCP
../../references/orchestra/mcp/tools-quick-ref.md — tool arguments and behaviour (use get_pipeline to read a pipeline's full definition)Before doing anything, classify the input:
Error-first — the pipeline has a red status, or the user gives a run ID/URL/UUID, or says "my pipeline failed". Skip to Step 1.
Symptom-first — the user describes something wrong in a downstream system without mentioning a pipeline error: "Lightdash dashboard looks wrong", "numbers are off", "chart is stale", "dbt model returning bad values". Go to Step 0b.
Goal: Find the root cause by starting at the system the user named and walking backwards through the pipeline until something is wrong.
Extract the system keyword from the prompt (e.g. "Lightdash", "dbt", "Fivetran").
Call list_pipeline_runs for the relevant pipeline to get the most recent run.
Call list_task_runs for that run to get all tasks and their execution order.
Match the keyword to a task via the integration field (e.g. LIGHTDASH, DBT_CORE,
FIVETRAN). This is the entry task. If no pipeline is named, ask the user which
pipeline to check before proceeding.
For the entry task in the most recent run, check three things in order:
Run status. Was it SUCCEEDED, WARNING, SKIPPED, or FAILED? Note it — even a SUCCEEDED task can produce wrong output if its inputs were bad.
Recent code commits. Identify the git repo for this task from runParameters
(branch, commit, platformLink). Fetch the last 10 commits on that branch:
gh api repos/<owner>/<repo>/commits?sha=<branch>&per_page=10
For each commit in the last 7 days, check the diff for changes to files relevant to the symptom. Examples:
.yml metric/dimension definitions, schema.yml joinsArtifacts/logs. Download available artifacts (e.g. dbt run_results.json) and
check for warnings or row count anomalies even on technically passing runs.
Decision:
Walk backwards through the pipeline task list by execution order. For each upstream task, repeat Step 0b-2: check run status, recent commits in its repo, and artifacts.
Example traversal order for a Fivetran → dbt → Lightdash pipeline:
Prompt: "Lightdash dashboard looks wrong"
Layer 1: Lightdash task
→ status: SUCCEEDED
→ commits: no changes in last 7 days
→ no issue found → move upstream
Layer 2: dbt task
→ status: SUCCEEDED
→ commits: found commit 3 days ago changing model SQL in affected area
→ STOP — root cause identified
(Layer 3: Fivetran task — not reached)
Stop as soon as a suspicious commit or anomaly is found. Do not keep looking once a cause is identified.
If traversal reaches the source with nothing found across all layers, report: "No issues found in the last 7 days across all layers" with a table of what was checked at each layer (task, status, commits reviewed, conclusion).
Once identified, classify it exactly as you would an error-first diagnosis:
Proceed to Step 2.
If the commit represents an intentional business logic change (not a bug), do not open a revert PR. Instead, present the triage summary with a [User action needed] resolution: explain which commit introduced the change and ask the user to decide whether to revert or adjust the downstream layer to match the new logic.
Run the full diagnosis: find the failed run, get task runs, fetch logs/artifacts/operations,
classify the error, identify root cause. Read ../../references/orchestra/pipeline/diagnosis-patterns.md;
optionally recall similar past fixes from your client's persistent memory (or a local
../../references/orchestra/pipeline/knowledge-store.md if the user keeps one).
Do not present a verbose diagnosis block to the user yet. Collect all findings silently — the output comes at the end in the triage summary.
Apply the fix exactly as in fix-orchestra-pipeline Step 5:
gh pr createupdate_pipeline if Orchestra-backedCritical difference: Do NOT merge. Do NOT call gh pr merge. Leave the PR open.
Collect:
Trigger a pipeline run against the fix branch (not main) to prove the fix works before anyone merges it:
start_pipeline(
alias_or_pipeline_id=<pipeline_id>,
branch=<fix-branch-name>,
environment=<same env as the failed run>
)
Poll get_pipeline_run_status every 60 s until terminal (SUCCEEDED / FAILED / WARNING).
Do not present anything to the user until the branch run reaches a terminal state.
If the pipeline writes data (ingestion, reverse-ETL), warn the user before running: "This pipeline writes data — branch run will also write. Proceed?"
Output the triage summary in the format below, then explicitly stop. Do not schedule a wakeup. Do not merge anything. Wait for the user.
Optimised for Slack: scannable, no preamble, decision-ready.
## Triage: `<pipeline name>`
**What broke:** <one sentence — which task, which test/error>
**Why:** <one sentence — specific root cause>
| Value / change | Field | Origin |
|---|---|---|
| `VALUE` | `column_name` | API enum expansion / bad commit / etc. |
---
**How it was found:**
1. `list_task_runs` → dbt task FAILED, exit code 1
2. Downloaded `run_results.json` → 3 `accepted_values` test failures
3. `list_operations` → `REUSED` present in live operation data
---
**PRs opened:**
| PR | File | Change |
|---|---|---|
| [#N](url) | `models/staging/schema.yml` | Added `REUSED` to `operation_status` accepted values |
**Branch validation:** Run `<run-id>` on `<branch>` — ✅ SUCCEEDED / ❌ FAILED
---
**Why this fixes it:** <2–3 sentences on what the test does, why the new value broke it,
and why this is the right fix rather than a data issue.>
---
> **Ready to apply?** Reply `merge` to merge the PR(s) and trigger a production run,
> or `reject` to close them.
## Triage: `<pipeline name>` — symptom: "<user's description>"
**Entry layer:** <system the user named>
**Root cause found at:** <layer where the issue was identified>
**What changed:** Commit `<sha>` on `<date>` — "<commit message>"
---
**Layers checked:**
| Layer | Task status | Commits checked | Finding |
|---|---|---|---|
| Lightdash | SUCCEEDED | 0 in last 7d | Clean |
| dbt Core | SUCCEEDED | 4 in last 7d | ⚠️ Commit `abc1234` changed `store_sales_example.sql` |
| Fivetran | — | not reached | — |
---
**Suspect commit diff:**
- File: `models/clean/store_sales_example.sql`
- Change: `WHERE sales > 0` → `WHERE sales > 1000` (filters out low-value stores)
---
**PRs opened:**
| PR | File | Change |
|---|---|---|
| [#N](url) | `models/clean/store_sales_example.sql` | Reverted threshold change |
**Branch validation:** Run `<run-id>` on `<branch>` — ✅ SUCCEEDED / ❌ FAILED
---
**Why this fixes it:** <explanation of the causal chain from the commit to the symptom>
---
> **Ready to apply?** Reply `merge` to merge and trigger a production run,
> or `reject` to close the PR(s).
> If this was an intentional change, reply `intentional` and I'll investigate the
> downstream layer instead.
merge (or: "yes", "approve", "ship it", "lgtm"):
gh pr merge <N> --repo <owner/repo> --squash --delete-branch for each PRstart_pipeline(pipeline_id, environment=Production)../../references/orchestra/pipeline/knowledge-store.md if the user keeps one. Off by default.reject (or: "no", "close it", "abandon"):
gh pr close <N> --repo <owner/repo> for each PRintentional (symptom-first only — the commit was deliberate, not a bug):
User provides feedback ("change X to Y", "also fix Z"):
knowledge-store.md as an opt-in fallback —
see fix-orchestra-pipeline. Never commit workspace-specific fix history to this repo.npx claudepluginhub orchestra-hq/orchestra-skills --plugin orchestraProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.