From orchestra
Automatically diagnose and fix failed Orchestra data pipelines. Use this skill whenever a user mentions Orchestra pipeline failures, broken pipelines, pipeline errors, task failures, pipeline debugging, or wants to understand why an Orchestra pipeline run failed. Also trigger when the user says things like "fix my pipeline", "what's broken", "why did my pipeline fail", "debug this run", "retry my pipeline", or references Orchestra pipeline runs, task runs, or pipeline errors. This skill handles the full lifecycle: identify failures, fetch logs and artifacts, diagnose the root cause, apply fixes where possible, retry, and learn from past fixes. It supports all Orchestra integrations including dbt, Snowflake, Python, HTTP, Fivetran, Airbyte, and more. Trigger this skill even if the user just pastes an Orchestra error message, pipeline run URL, pipeline run link from the Orchestra UI, a UUID, a Slack alert, or a pipeline name/alias.
How this skill is triggered — by the user, by Claude, or both
Slash command
/orchestra:fix-orchestra-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diagnose, fix, and retry failed Orchestra pipelines — and optionally remember what worked.
Diagnose, fix, and retry failed Orchestra pipelines — and optionally remember what worked.
This skill assumes the Orchestra MCP server is connected. All MCP calls are scoped to the user's workspace.
Use Orchestra MCP tools for all operations in this skill (read a pipeline's full definition with
get_pipeline). Argument summaries: ../../references/orchestra/mcp/tools-quick-ref.md.
The user may provide their problem in several forms. Parse the input before entering the workflow:
Users often paste a link from the Orchestra UI. Extract IDs from any of these URL patterns:
https://app.getorchestra.io/pipeline-runs/{pipeline_run_id}
https://app.getorchestra.io/pipeline-runs/{pipeline_run_id}/lineage
https://app.getorchestra.io/pipeline-runs/{pipeline_run_id}/task-runs/{task_run_id}
https://app.getorchestra.io/pipelines/{pipeline_id}/runs/{pipeline_run_id}
https://app.getorchestra.io/pipelines/{pipeline_id}
The IDs are UUIDs (e.g. 123e4567-e89b-12d3-a456-426614174000). If a URL contains a
pipeline_run_id, skip straight to Step 2. If it only contains a pipeline_id, query for
the latest failed run of that pipeline.
For custom/self-hosted Orchestra instances, the base domain may differ (e.g.
https://orchestra.company.com/...). The path structure is the same — extract UUIDs
from path segments.
If the user pastes a bare UUID, try it as a pipeline run ID first with
get_pipeline_run_status. If that returns a result, proceed to Step 2.
If not, treat it as a pipeline ID and query recent runs with list_pipeline_runs.
If the user says "fix the daily-etl pipeline", search for it with list_pipelines
and match by name or alias, then query for its latest failed run.
If the user pastes an error message or log snippet, skip to Step 4 (Diagnose) — you already have the evidence. Ask for the pipeline run ID only if you need to fetch additional context like logs or artifacts.
Orchestra alert messages (from Slack, Teams, email, webhooks) typically contain the
pipeline name, task name, and status. Extract these and use them to find the
corresponding pipeline run via list_pipeline_runs/list_task_runs.
Execute these steps in order. Each step feeds the next. If the parsed input provides enough information to skip ahead, jump to the relevant step.
Goal: Find which pipeline runs have failed recently.
If the user provided a pipeline run ID or URL: Skip to Step 2 using the extracted ID.
If the user said "what's broken" or similar: Query for recent failures:
list_pipeline_runs with status=FAILEDIf multiple failures exist: Ask the user which one to investigate, or offer to triage all of them starting with the most recent.
Key fields from the response:
id — the pipeline run ID (needed for all subsequent steps)pipelineName — human-readable namepipelineId — the pipeline definition IDmessage — Orchestra's summary of what happenedtriggeredBy — what started the run (cron, sensor, manual, webhook)completedAt — when it finished failingenvName — which environment (Production, Staging, etc.)Goal: Identify exactly which task(s) within the pipeline failed.
list_task_runs filtered by the failed pipeline IDs and status=FAILEDstatus=WARNING task runs — they may contain useful contextKey fields from each task run:
id — task run ID (needed for logs/artifacts)taskName — human-readable task nametaskId — the task identifier in the pipeline YAMLintegration — which integration (e.g. SNOWFLAKE, DBT_CORE, HTTP, PYTHON)integrationJob — the specific job type (e.g. SNOWFLAKE_RUN_QUERY, DBT_CORE_RUN_COMMAND)status — FAILED or WARNINGmessage — Orchestra's task-level messageexternalStatus — the status from the underlying platform (e.g. HTTP 500, dbt error code)externalMessage — the platform's error messagetaskParameters — what was configured on the taskrunParameters — runtime parameters including connection detailsconnectionId — which credential/connection was usednumberOfAttempts — how many times Orchestra retriedPresent the findings: Show the user which tasks failed, in what order, and their error messages. If the pipeline has multiple tasks, note which ones succeeded (they ran before the failure point) and which were skipped (downstream of the failure).
Goal: Get the raw evidence — logs, artifacts, and operations.
For each failed task run:
Logs: Call list_task_run_logs to list available log files. Then fetch each log
with download_task_run_log. Focus on the last ~256KB of large logs using
range_header (for example bytes=-262144).
Artifacts: Call list_task_run_artifacts. For dbt tasks, look for
run_results.json and manifest.json. Download relevant artifacts with
download_task_run_artifact.
Operations: Call list_operations filtered by task_run_id to see
sub-operations (individual dbt models, Snowflake queries, etc.) and their statuses.
Read ../../references/orchestra/pipeline/diagnosis-patterns.md before proceeding to Step 4. It contains
integration-specific error patterns that will help classify the failure.
Goal: Classify the failure and identify the root cause.
This is the analytical step. Using all evidence from Steps 1-3 plus the patterns in
../../references/orchestra/pipeline/diagnosis-patterns.md:
Decide code vs platform. If the failure is ingestion/sync infrastructure or another
vendor-managed integration (Fivetran, Airbyte, Estuary, etc.), surface platformLink and
connectionId and stop — do not open a Git PR. If the failure is in repo SQL, dbt, Python,
or misconfigured pipeline YAML, proceed with remediation. See
../../references/orchestra/pipeline/diagnosis-patterns.md (TOOL_OR_INFRASTRUCTURE).
Classify the error category. Common categories:
AUTH_FAILURE — credentials expired, rotated, or insufficient permissionsTIMEOUT — task exceeded configured timeout or underlying platform timed outQUERY_ERROR — SQL syntax error, missing table/column, type mismatchRESOURCE_CONFLICT — sync job already running, resource lockedNETWORK_ERROR — firewall, VPN, DNS resolution, connection refusedCONFIG_ERROR — invalid parameters, missing required fields, wrong environmentDEPENDENCY_FAILURE — upstream task failed, missing input dataPLATFORM_ERROR — the underlying platform (Snowflake, dbt Cloud, etc.) had an outageCODE_ERROR — Python script error, dbt model compilation failureRATE_LIMIT — API rate limit hit on the underlying platformDATA_ERROR — data quality test failure, schema drift, unexpected nullsIdentify the root cause. Be specific. Not just "query error" but "column
user_email does not exist in table analytics.users — likely a schema migration
that removed or renamed the column."
(Optional) Recall past fixes. Past-fix memory is deferred to the calling agentic
client. If your client exposes persistent memory (e.g. Claude Code memory, Cursor
rules/memories), check it first for similar past fixes. As a fallback, read a local
../../references/orchestra/pipeline/knowledge-store.md if the user keeps one — it ships
empty and may not exist. This step is optional: skip it entirely when no memory is
available. Treat any recalled entry as historical context and re-verify it still applies
before acting on it.
Present the diagnosis clearly to the user:
Read ../../references/orchestra/pipeline/remediation-playbooks.md before proceeding to Step 5.
Goal: Fix the issue or tell the user exactly what to do.
Based on the diagnosis, consult ../../references/orchestra/pipeline/remediation-playbooks.md and take action:
Fixes the agent can apply directly:
start_pipelineupdate_pipeline to fix configuration errors like wrong parameters, missing
environment variables, or incorrect task orderingrun_inputs in start_pipeline to override
problematic input valuesgh CLI to create a pull request with the fix directly.
Do not ask the user to make the change themselves. Workflow:
fix/missing-pyproject-toml)gh pr create targeting the failing branchFixes that require user action (explain clearly — but still poll after PR if one was opened):
Always explain what you're doing and why before taking action.
Goal: Watch the PR and trigger the pipeline rerun once it merges — without making the user babysit it.
After sharing the PR URL, emit one status line and keep watching the PR until it reaches a terminal state (or the user asks you to stop):
⏳ PR #178 open — checking every 60 s; will trigger the pipeline on merge.
Polling loop:
Check PR state:
gh pr view {pr_number} --repo {owner/repo} --json state,mergedAt
If state == "MERGED": Proceed immediately to Step 6 — trigger start_pipeline
using the original pipeline ID and environment. No confirmation needed (the user
already approved the fix by merging the PR).
If state == "CLOSED" (not merged): The PR was closed without merging. Report
this and ask the user how to proceed — do not auto-retry.
If state == "OPEN": Wait ~60 seconds, then check again; after several checks with
no merge, widen the interval to a few minutes. Use whatever scheduling mechanism your
client provides — if it can re-invoke you on a timer, schedule the next check and hand
back control; otherwise keep polling in the same conversation. Either way, retain the PR
number, repo, pipeline ID, and environment so each check resumes the same fix workflow.
Polling output format (one line per check, not a full summary):
⏳ PR #178 — OPEN (2 min elapsed, next check in 60 s)
⏳ PR #178 — OPEN (3 min elapsed, next check in 60 s)
✅ PR #178 — MERGED — triggering pipeline rerun…
Do not re-diagnose or re-explain the fix on each poll tick. One line only.
Goal: Confirm the fix worked.
This step is entered either (a) directly after a non-PR fix, or (b) automatically from Step 5b once the PR is merged.
start_pipeline
run_inputs if applicableget_pipeline_run_status every ~30 secondsPersisting fixes is optional and deferred to the calling agentic client. Only do it when the user wants a durable record — it is off by default, and nothing workspace-specific should be committed to this repository.
../../references/orchestra/pipeline/knowledge-store.md,
append an entry using the template at the bottom of that file. The published file ships empty.When you do record a fix, capture: date, pipeline name, error category, integration, root cause, fix applied, and whether the first diagnosis was correct.
If you discover a genuinely new, generic diagnosis pattern, consider noting it in
../../references/orchestra/pipeline/diagnosis-patterns.md — but keep workspace-specific detail
(pipeline IDs, connection names, account identifiers) out of shared reference files.
Be succinct. Users are engineers dealing with broken pipelines — they want facts and actions, not explanations. No preamble, no summaries of what you just did, no "great news". If the answer fits in one line, use one line. Cut any sentence that doesn't add new information.
All user-facing output must follow these templates exactly. Consistent structure makes it easy to scan at a glance — especially when there are multiple failures to triage.
Use a table. One row per distinct pipeline (deduplicate by pipelineId). Newest failure first.
After the table, add a one-line callout for any that are feature branch runs (skip them).
## Failing Pipelines — Workspace Triage
| Pipeline | Category | Failing Task | Root Cause |
|---|---|---|---|
| `name` | `CATEGORY` | Task name | One-line plain-English description |
> **Skipped (feature branch):** `pipeline-name` — failures are on branch `x`, main is healthy.
Always end the triage with a prompt offering to dig into specific pipelines:
Which would you like me to investigate? I'd suggest starting with X because Y.
Use a consistent header and structured block every time, then evidence and confidence.
## Diagnosis: `<pipeline name>`
**Error category:** CATEGORY
**Root cause:** One specific sentence — not "query error" but exactly which object/column/table.
**Integration:** DBT_CORE / DBT_CORE_EXECUTE (or similar)
**Connection:** connection-id-here
**Confidence:** High / Medium / Low
**Evidence:**
- Exact log line or error message (quoted)
- Which operation or model failed
- Any corroborating signals (exit code, attempt count, etc.)
Always present options as a numbered list with a clear owner label on each:
**Fix options:**
1. **[Agent can apply]** Short description of what will be done and why it fixes the issue.
2. **[Agent opens a PR]** For code changes in Git-backed pipelines — describe what file/change
will be committed and which branch the PR targets.
3. **[User action needed]** Only for things the agent genuinely cannot do: credential rotation,
firewall changes, permission grants. Include the specific UI path, command, or SQL.
4. **[Needs more info]** What you'd need to know to proceed (e.g. "Does the EVENTS table
exist? Run: SELECT COUNT(*) FROM SNOWFLAKE_WORKING.PUBLIC.EVENTS").
Recommended: Option N — one sentence on why this is the right call.
Never present options without a recommendation. Never use vague labels like "you could try".
Never label a code fix as [User action needed] — open the PR instead.
During polling, emit a single line per status check. Do not repeat the full context.
▶ Run e2049b86 — RUNNING (0:42 elapsed)
▶ Run e2049b86 — RUNNING (1:30 elapsed)
✅ Run e2049b86 — SUCCEEDED (2:15 elapsed)
On failure during retry, immediately switch to diagnosis format (don't just say "it failed").
Always end a successful fix with a compact summary block:
## Fixed: `<pipeline name>`
- **Run:** `<new-run-id>` — SUCCEEDED
- **Root cause:** (one line)
- **Fix applied:** (one line)
- **Duration:** X min
- **Recorded:** Saved to client memory ✓ (only if the user opted in — omit this line otherwise)
update_pipeline.
Git-backed pipelines require code changes committed to the repository. Check storageProvider
in the list_pipelines response to determine which type it is.npx claudepluginhub orchestra-hq/orchestra-skills --plugin orchestraProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.