Skill

fix-orchestra-pipeline

Automatically diagnose and fix failed Orchestra data pipelines. Use this skill whenever a user mentions Orchestra pipeline failures, broken pipelines, pipeline errors, task failures, pipeline debugging, or wants to understand why an Orchestra pipeline run failed. Also trigger when the user says things like "fix my pipeline", "what's broken", "why did my pipeline fail", "debug this run", "retry my pipeline", or references Orchestra pipeline runs, task runs, or pipeline errors. This skill handles the full lifecycle: identify failures, fetch logs and artifacts, diagnose the root cause, apply fixes where possible, retry, and learn from past fixes. It supports all Orchestra integrations including dbt, Snowflake, Python, HTTP, Fivetran, Airbyte, and more. Trigger this skill even if the user just pastes an Orchestra error message, pipeline run URL, pipeline run link from the Orchestra UI, a UUID, a Slack alert, or a pipeline name/alias.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/orchestra:fix-orchestra-pipeline

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Diagnose, fix, and retry failed Orchestra pipelines — and optionally remember what worked.

SKILL.md

424 lines · ~4.7k tokens

Stats

Parent stars0

MaintenanceGood

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Fix Orchestra Pipeline

Diagnose, fix, and retry failed Orchestra pipelines — and optionally remember what worked.

Prerequisites

This skill assumes the Orchestra MCP server is connected. All MCP calls are scoped to the user's workspace.

MCP tools

Use Orchestra MCP tools for all operations in this skill (read a pipeline's full definition with get_pipeline). Argument summaries: ../../references/orchestra/mcp/tools-quick-ref.md.

Parsing user input

The user may provide their problem in several forms. Parse the input before entering the workflow:

Orchestra URLs

Users often paste a link from the Orchestra UI. Extract IDs from any of these URL patterns:

https://app.getorchestra.io/pipeline-runs/{pipeline_run_id}
https://app.getorchestra.io/pipeline-runs/{pipeline_run_id}/lineage
https://app.getorchestra.io/pipeline-runs/{pipeline_run_id}/task-runs/{task_run_id}
https://app.getorchestra.io/pipelines/{pipeline_id}/runs/{pipeline_run_id}
https://app.getorchestra.io/pipelines/{pipeline_id}

The IDs are UUIDs (e.g. 123e4567-e89b-12d3-a456-426614174000). If a URL contains a pipeline_run_id, skip straight to Step 2. If it only contains a pipeline_id, query for the latest failed run of that pipeline.

For custom/self-hosted Orchestra instances, the base domain may differ (e.g. https://orchestra.company.com/...). The path structure is the same — extract UUIDs from path segments.

Raw UUIDs

If the user pastes a bare UUID, try it as a pipeline run ID first with get_pipeline_run_status. If that returns a result, proceed to Step 2. If not, treat it as a pipeline ID and query recent runs with list_pipeline_runs.

Pipeline names or aliases

If the user says "fix the daily-etl pipeline", search for it with list_pipelines and match by name or alias, then query for its latest failed run.

Error messages

If the user pastes an error message or log snippet, skip to Step 4 (Diagnose) — you already have the evidence. Ask for the pipeline run ID only if you need to fetch additional context like logs or artifacts.

Slack / alert messages

Orchestra alert messages (from Slack, Teams, email, webhooks) typically contain the pipeline name, task name, and status. Extract these and use them to find the corresponding pipeline run via list_pipeline_runs/list_task_runs.

Workflow overview

Execute these steps in order. Each step feeds the next. If the parsed input provides enough information to skip ahead, jump to the relevant step.

Step 1 — Identify failed pipeline runs

Goal: Find which pipeline runs have failed recently.

If the user provided a pipeline run ID or URL: Skip to Step 2 using the extracted ID.

If the user said "what's broken" or similar: Query for recent failures:

Call list_pipeline_runs with status=FAILED
Default to the last 7 days if no time range specified
Present results as a concise summary: pipeline name, when it failed, trigger type, message

If multiple failures exist: Ask the user which one to investigate, or offer to triage all of them starting with the most recent.

Key fields from the response:

id — the pipeline run ID (needed for all subsequent steps)
pipelineName — human-readable name
pipelineId — the pipeline definition ID
message — Orchestra's summary of what happened
triggeredBy — what started the run (cron, sensor, manual, webhook)
completedAt — when it finished failing
envName — which environment (Production, Staging, etc.)

Step 2 — Get failed task runs

Goal: Identify exactly which task(s) within the pipeline failed.

Call list_task_runs filtered by the failed pipeline IDs and status=FAILED
Also fetch status=WARNING task runs — they may contain useful context

Key fields from each task run:

id — task run ID (needed for logs/artifacts)
taskName — human-readable task name
taskId — the task identifier in the pipeline YAML
integration — which integration (e.g. SNOWFLAKE, DBT_CORE, HTTP, PYTHON)
integrationJob — the specific job type (e.g. SNOWFLAKE_RUN_QUERY, DBT_CORE_RUN_COMMAND)
status — FAILED or WARNING
message — Orchestra's task-level message
externalStatus — the status from the underlying platform (e.g. HTTP 500, dbt error code)
externalMessage — the platform's error message
taskParameters — what was configured on the task
runParameters — runtime parameters including connection details
connectionId — which credential/connection was used
numberOfAttempts — how many times Orchestra retried

Present the findings: Show the user which tasks failed, in what order, and their error messages. If the pipeline has multiple tasks, note which ones succeeded (they ran before the failure point) and which were skipped (downstream of the failure).

Step 3 — Fetch diagnostics

Goal: Get the raw evidence — logs, artifacts, and operations.

For each failed task run:

Logs: Call list_task_run_logs to list available log files. Then fetch each log with download_task_run_log. Focus on the last ~256KB of large logs using range_header (for example bytes=-262144).
Artifacts: Call list_task_run_artifacts. For dbt tasks, look for run_results.json and manifest.json. Download relevant artifacts with download_task_run_artifact.
Operations: Call list_operations filtered by task_run_id to see sub-operations (individual dbt models, Snowflake queries, etc.) and their statuses.

Read ../../references/orchestra/pipeline/diagnosis-patterns.md before proceeding to Step 4. It contains integration-specific error patterns that will help classify the failure.

Step 4 — Diagnose the error

Goal: Classify the failure and identify the root cause.

This is the analytical step. Using all evidence from Steps 1-3 plus the patterns in ../../references/orchestra/pipeline/diagnosis-patterns.md:

Decide code vs platform. If the failure is ingestion/sync infrastructure or another vendor-managed integration (Fivetran, Airbyte, Estuary, etc.), surface platformLink and connectionId and stop — do not open a Git PR. If the failure is in repo SQL, dbt, Python, or misconfigured pipeline YAML, proceed with remediation. See ../../references/orchestra/pipeline/diagnosis-patterns.md (TOOL_OR_INFRASTRUCTURE).
Classify the error category. Common categories:
- AUTH_FAILURE — credentials expired, rotated, or insufficient permissions
- TIMEOUT — task exceeded configured timeout or underlying platform timed out
- QUERY_ERROR — SQL syntax error, missing table/column, type mismatch
- RESOURCE_CONFLICT — sync job already running, resource locked
- NETWORK_ERROR — firewall, VPN, DNS resolution, connection refused
- CONFIG_ERROR — invalid parameters, missing required fields, wrong environment
- DEPENDENCY_FAILURE — upstream task failed, missing input data
- PLATFORM_ERROR — the underlying platform (Snowflake, dbt Cloud, etc.) had an outage
- CODE_ERROR — Python script error, dbt model compilation failure
- RATE_LIMIT — API rate limit hit on the underlying platform
- DATA_ERROR — data quality test failure, schema drift, unexpected nulls
Identify the root cause. Be specific. Not just "query error" but "column user_email does not exist in table analytics.users — likely a schema migration that removed or renamed the column."
(Optional) Recall past fixes. Past-fix memory is deferred to the calling agentic client. If your client exposes persistent memory (e.g. Claude Code memory, Cursor rules/memories), check it first for similar past fixes. As a fallback, read a local ../../references/orchestra/pipeline/knowledge-store.md if the user keeps one — it ships empty and may not exist. This step is optional: skip it entirely when no memory is available. Treat any recalled entry as historical context and re-verify it still applies before acting on it.
Present the diagnosis clearly to the user:
- Error category
- Root cause (specific)
- Evidence (which log line, which error message, which operation failed)
- Confidence level (high/medium/low)

Read ../../references/orchestra/pipeline/remediation-playbooks.md before proceeding to Step 5.

Step 5 — Apply the fix

Goal: Fix the issue or tell the user exactly what to do.

Based on the diagnosis, consult ../../references/orchestra/pipeline/remediation-playbooks.md and take action:

Fixes the agent can apply directly:

Retry — if the error is transient (timeout, rate limit, platform blip), trigger a re-run with start_pipeline
Update pipeline YAML — if the pipeline is Orchestra-backed (not Git-backed), use update_pipeline to fix configuration errors like wrong parameters, missing environment variables, or incorrect task ordering
Re-run with different inputs — use run_inputs in start_pipeline to override problematic input values
Open a PR — for any code fix in a Git-backed pipeline (missing file, wrong config, dbt model error, etc.), use the gh CLI to create a pull request with the fix directly. Do not ask the user to make the change themselves. Workflow:
1. Clone the repo (or use the working directory if already cloned)
2. Check out a new branch off the failing branch (e.g. fix/missing-pyproject-toml)
3. Apply the fix
4. Commit and push
5. Open a PR via gh pr create targeting the failing branch
6. Share the PR URL with the user
7. Immediately begin polling the PR — proceed to Step 5b without waiting for the user

Fixes that require user action (explain clearly — but still poll after PR if one was opened):

Credential rotation — tell them exactly which connection to update and where in the UI
Firewall/network — provide the Orchestra IPs to whitelist
Permission grants — show the exact GRANT statement or IAM policy change

Always explain what you're doing and why before taking action.

Step 5b — Poll the PR and trigger the retry on merge

Goal: Watch the PR and trigger the pipeline rerun once it merges — without making the user babysit it.

After sharing the PR URL, emit one status line and keep watching the PR until it reaches a terminal state (or the user asks you to stop):

⏳ PR #178 open — checking every 60 s; will trigger the pipeline on merge.

Polling loop:

Check PR state:

gh pr view {pr_number} --repo {owner/repo} --json state,mergedAt

If state == "MERGED": Proceed immediately to Step 6 — trigger start_pipeline using the original pipeline ID and environment. No confirmation needed (the user already approved the fix by merging the PR).
If state == "CLOSED" (not merged): The PR was closed without merging. Report this and ask the user how to proceed — do not auto-retry.
If state == "OPEN": Wait ~60 seconds, then check again; after several checks with no merge, widen the interval to a few minutes. Use whatever scheduling mechanism your client provides — if it can re-invoke you on a timer, schedule the next check and hand back control; otherwise keep polling in the same conversation. Either way, retain the PR number, repo, pipeline ID, and environment so each check resumes the same fix workflow.

Polling output format (one line per check, not a full summary):

⏳ PR #178 — OPEN (2 min elapsed, next check in 60 s)
⏳ PR #178 — OPEN (3 min elapsed, next check in 60 s)
✅ PR #178 — MERGED — triggering pipeline rerun…

Do not re-diagnose or re-explain the fix on each poll tick. One line only.

Step 6 — Retry and monitor

Goal: Confirm the fix worked.

This step is entered either (a) directly after a non-PR fix, or (b) automatically from Step 5b once the PR is merged.

Trigger a new pipeline run via start_pipeline
- Use the same pipeline ID and environment as the failed run
- Pass any corrected run_inputs if applicable
- For PR-triggered reruns, no confirmation needed — merge was the approval
Poll get_pipeline_run_status every ~30 seconds
Report the outcome:
- Succeeded: Optionally record the fix (see Learning loop below).
- Failed again (same error): The fix didn't work. Go back to Step 4 with new evidence.
- Failed (different error): New problem uncovered. Restart from Step 3.

Learning loop — (Optional) Record what you learned

Persisting fixes is optional and deferred to the calling agentic client. Only do it when the user wants a durable record — it is off by default, and nothing workspace-specific should be committed to this repository.

Preferred: if your client exposes persistent memory (Claude Code memory, Cursor rules/memories), save a short note there so future runs can recall it.
Fallback: if the user keeps a local ../../references/orchestra/pipeline/knowledge-store.md, append an entry using the template at the bottom of that file. The published file ships empty.

When you do record a fix, capture: date, pipeline name, error category, integration, root cause, fix applied, and whether the first diagnosis was correct.

If you discover a genuinely new, generic diagnosis pattern, consider noting it in ../../references/orchestra/pipeline/diagnosis-patterns.md — but keep workspace-specific detail (pipeline IDs, connection names, account identifiers) out of shared reference files.

Output formatting

Be succinct. Users are engineers dealing with broken pipelines — they want facts and actions, not explanations. No preamble, no summaries of what you just did, no "great news". If the answer fits in one line, use one line. Cut any sentence that doesn't add new information.

All user-facing output must follow these templates exactly. Consistent structure makes it easy to scan at a glance — especially when there are multiple failures to triage.

Triage view (Step 1 — multiple pipelines)

Use a table. One row per distinct pipeline (deduplicate by pipelineId). Newest failure first. After the table, add a one-line callout for any that are feature branch runs (skip them).

## Failing Pipelines — Workspace Triage

| Pipeline | Category | Failing Task | Root Cause |
|---|---|---|---|
| `name` | `CATEGORY` | Task name | One-line plain-English description |

> **Skipped (feature branch):** `pipeline-name` — failures are on branch `x`, main is healthy.

Always end the triage with a prompt offering to dig into specific pipelines:

Which would you like me to investigate? I'd suggest starting with X because Y.

Single pipeline diagnosis (Step 4)

Use a consistent header and structured block every time, then evidence and confidence.

## Diagnosis: `<pipeline name>`

**Error category:** CATEGORY
**Root cause:** One specific sentence — not "query error" but exactly which object/column/table.
**Integration:** DBT_CORE / DBT_CORE_EXECUTE (or similar)
**Connection:** connection-id-here
**Confidence:** High / Medium / Low

**Evidence:**
- Exact log line or error message (quoted)
- Which operation or model failed
- Any corroborating signals (exit code, attempt count, etc.)

Fix options (Step 5)

Always present options as a numbered list with a clear owner label on each:

**Fix options:**

1. **[Agent can apply]** Short description of what will be done and why it fixes the issue.
2. **[Agent opens a PR]** For code changes in Git-backed pipelines — describe what file/change
   will be committed and which branch the PR targets.
3. **[User action needed]** Only for things the agent genuinely cannot do: credential rotation,
   firewall changes, permission grants. Include the specific UI path, command, or SQL.
4. **[Needs more info]** What you'd need to know to proceed (e.g. "Does the EVENTS table
   exist? Run: SELECT COUNT(*) FROM SNOWFLAKE_WORKING.PUBLIC.EVENTS").

Recommended: Option N — one sentence on why this is the right call.

Never present options without a recommendation. Never use vague labels like "you could try". Never label a code fix as [User action needed] — open the PR instead.

Retry status (Step 6)

During polling, emit a single line per status check. Do not repeat the full context.

▶ Run e2049b86 — RUNNING (0:42 elapsed)
▶ Run e2049b86 — RUNNING (1:30 elapsed)
✅ Run e2049b86 — SUCCEEDED (2:15 elapsed)

On failure during retry, immediately switch to diagnosis format (don't just say "it failed").

Resolution summary (after successful fix)

Always end a successful fix with a compact summary block:

## Fixed: `<pipeline name>`

- **Run:** `<new-run-id>` — SUCCEEDED
- **Root cause:** (one line)
- **Fix applied:** (one line)
- **Duration:** X min
- **Recorded:** Saved to client memory ✓ (only if the user opted in — omit this line otherwise)

Important notes

Never expose secrets. Log contents may contain credentials, tokens, or connection strings. Summarise errors without reproducing sensitive values.
Confirm before destructive actions. Always ask the user before retrying a pipeline that writes data (ingestion, materialisation, reverse ETL). Idempotent pipelines (tests, queries) can be retried with just a heads-up.
Respect environments. If the failure was in Production, be extra cautious. Suggest testing the fix in a Development/Staging environment first if one exists.
Rate limit awareness. The Orchestra metadata backend returns 7 days of data by default and has pagination. Don't make excessive MCP calls — batch where possible.
Git-backed vs Orchestra-backed. Only Orchestra-backed pipelines can be updated via update_pipeline. Git-backed pipelines require code changes committed to the repository. Check storageProvider in the list_pipelines response to determine which type it is.

fix-orchestra-pipeline

Invocation

Context Preview

SKILL.md

fix-orchestra-pipeline

Invocation

Context Preview

SKILL.md

Fix Orchestra Pipeline

Prerequisites

MCP tools

Parsing user input

Orchestra URLs

Raw UUIDs

Pipeline names or aliases

Error messages

Slack / alert messages

Workflow overview

Step 1 — Identify failed pipeline runs

Step 2 — Get failed task runs

Step 3 — Fetch diagnostics

Step 4 — Diagnose the error

Step 5 — Apply the fix

Step 5b — Poll the PR and trigger the retry on merge

Step 6 — Retry and monitor

Learning loop — (Optional) Record what you learned

Output formatting

Triage view (Step 1 — multiple pipelines)

Single pipeline diagnosis (Step 4)

Fix options (Step 5)

Retry status (Step 6)

Resolution summary (after successful fix)

Important notes

Similar Skills

Fix Orchestra Pipeline

Prerequisites

MCP tools

Parsing user input

Orchestra URLs

Raw UUIDs

Pipeline names or aliases

Error messages

Slack / alert messages

Workflow overview

Step 1 — Identify failed pipeline runs

Step 2 — Get failed task runs

Step 3 — Fetch diagnostics

Step 4 — Diagnose the error

Step 5 — Apply the fix

Step 5b — Poll the PR and trigger the retry on merge

Step 6 — Retry and monitor

Learning loop — (Optional) Record what you learned

Output formatting

Triage view (Step 1 — multiple pipelines)

Single pipeline diagnosis (Step 4)

Fix options (Step 5)

Retry status (Step 6)

Resolution summary (after successful fix)

Important notes

Similar Skills