From tsuga
Primary entry point for active-incident investigation, post-incident RCA, or recurring-degradation triage. Use when a monitor fires, a customer reports slow / errored / missing telemetry, an incident is declared (P1 through P5), or someone asks 'what's wrong with X right now?'. Coordinates parallel evidence branches — telemetry sweep (`tsuga` CLI), change correlation (git log / gh pr), analogue search (`$incident-history`), codebase-grep (local repos), challenger review — while tracking hypotheses and evidence gates. Classifies the investigation mode (known-symptom, novel-symptom, broad-degradation, cheat-check) up front so the branch plan matches the scope. Produces an operator-ready verdict with cited evidence, a time-bounded `Latest-cited-evidence:` trailer, and explicit 'insufficient evidence' output when probes don't converge.
How this skill is triggered — by the user, by Claude, or both
Slash command
/tsuga:incident-investigationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Primary entry point. Read-only unless the user explicitly asks for mutations.
Primary entry point. Read-only unless the user explicitly asks for mutations.
This skill covers the full investigation loop — orchestrator-level workflow AND the detailed procedures for each parallel branch. Read selectively based on your role:
## Branch: telemetry sweep. Read its Procedure + Branch output + Branch guardrails. Skip everything above (workflow, ledger, gate, verdict) and the Change-correlation branch — those aren't your job.## Branch: change correlation. Same — read only that section.{{CODEBASES_DIR}} for that literal string; return the file:line and ~5 lines of surrounding context (the enclosing function + nearby conditions that trigger the emission). Nothing else. You do not interpret — you locate.Subagents return branch-shaped output (their section's Branch output contract). The orchestrator synthesizes them into the final verdict.
Minimum viable case:
declared_at is your "now")Accept a human summary or a case manifest (references/case-manifest.md). If scope is unclear, ask for the smallest missing fact — do not launch a generic sweep.
Treat declared_at as the present. The investigation must produce the conclusion a responder could have reached at the moment the page fired, not the one a historian can see after the fix shipped.
declared_at. Not to "verify" a hypothesis. Not to "confirm" a PR's diff. Not to quote a fix PR's title or body. Seeing the fix is not allowed to influence — or validate — the causal chain you return.git log --until=<declared_at>, git log --before=<declared_at>, or inspect only commits whose author/commit date is strictly before it.$gh, filter to merged:<<declared_at> (or created:<<declared_at> when arguing about what existed to be deployed). Treat any PR you happen to see that postdates declared_at as invisible.declared_at, return insufficient evidence with the probes you would run next. That outcome is strictly better than a causal chain contaminated by hindsight.Citing any PR / commit / merge dated at or after declared_at in your final verdict invalidates the result. Verdict must state, once, what its latest cited change is and confirm that date is < declared_at.
Check these during synthesis:
what changed? Deploys, config drift, key rotation, quotas, schema changes explain most incidents.file:line that emits it first, then check whether the PR's diff modifies that exact path.Saw of your diagnostic path. Pull it with tsuga monitors get <id> before running broader log searches."invalid type: map, expected a sequence", the payload being an array is the first thing to check — don't stack indirections on top. If literal reading contradicts your hypothesis, restart from the literal reading.Pick one: outage_RCA | monitoring_watch | setup_onboarding | migration_decommission | targeted_subtask | validation_noise.
Non-outage modes skip the full telemetry sweep and produce task-shaped output.
Healthy-signature fast-path. Before defaulting to outage_RCA, scan the case for these early-exit signals:
resolved / normal / ok AND no downstream user-visible impact in the case hints → pick monitoring_watch, verdict healthy after ONE confirmation query. Do not spawn branches.info / none / empty AND no explicit customer complaint → validation_noise.resolved_at − declared_at < 2 min with no user-visible-impact hint → self-resolved stale alert; verdict healthy.Fast-path triage is cheap: one probe, one classification, publish. Silence after verification IS a valid healthy signal (no errors found = no problem) — just say so explicitly in the verdict and cite the query that confirmed it.
Track five fields, separate:
Do not let failing subsystem silently replace root cause.
If the alert or case manifest references a monitor ID, pull it first — the monitor's own filter + threshold + groupBy IS the telemetry shape that crossed, by definition.
tsuga monitors get <monitor-id>
Treat the monitor query as the first entry in the diagnostic path:
Saw: monitor <id> "<name>" crossed threshold <N> → Check: <monitor filter + aggregation> → Confirms: <actual value in bad window vs control>.
Then re-run that same query against the incident window AND a control window — confirm the cross, quantify the delta. This pins the investigation to the exact signal that fired the page, not a rediscovered one. Do not start with a broad logs search when a specific monitor query is already framed for you.
One sentence of matching is enough to load one. Zero is fine. All-of-them-from-fear is not.
If scope names a specific tech (Postgres, Redis, Kafka, …), the telemetry branch loads its $knowledge-technology reference. You don't need to orchestrate that.
Disjoint goals, run concurrently. Each branch is its own subagent; do not serialize.
$gh, strict mechanism fit (diff must touch emitter line). Full procedure in Branch: change correlation below.$incident-history (prior incident archive mining, if mounted).{{CODEBASES_DIR}} to find the file:line where that signal is emitted. Output: file path, surrounding function context, nearby conditions that trigger the emission. 5 signals → 5 parallel greps — this scales linearly and each one returns a sharp pin, not a vague area.Why codebase-grep matters. Knowing where a signal is emitted lets you:
file:line?)if / match / error-handling block)Without this, you're matching PRs by directory. With it, you're matching PRs by the emitting code path — a much stronger mechanism.
Merge results as they land; don't wait on all.
Cost discipline. Prefer cheap, high-signal probes before expensive ones:
logs new-error-patterns / logs error-pattern-increases (cheap — pre-computed) before logs searchlogs patterns (bounded cluster) before logs search --query '*' (unbounded scan)aggregation scalar before aggregation timeseries at broad group-by limitsservices get for 24h counters before drilling into per-log detailgrep -rn across codebases (cheap, parallel) before elaborate change-correlation hypotheses
A broad logs search with no service / team scope is the most expensive move — save it for when you have a specific error string to chase.Per candidate cause, record:
file:line from codebase-grep)Carry ≥ 2 candidates until one is confirmed or alternatives are falsified.
Steelman alternatives. For each non-leading candidate, write one sentence: "what evidence would I need to see to promote this to the leading hypothesis?" If any of those sentences describes a probe you haven't run yet AND it's cheap, run it before committing to the current leader.
Before locking a verdict, run this check on the leading hypothesis:
tsuga_logs + tsuga_aggregation, or tsuga_aggregation + gh_pr)? A single-source confirmation is too fragile for confirmed.file:line where the observed signal is emitted (from the codebase-grep branch)? If not, spawn that grep now — it's cheap and usually decisive.If any answer is no AND the missing check is cheap (one more aggregation, logs patterns, or codebase grep), loop back to the relevant branch for that specific query before assigning the verdict. Don't run a second full sweep — run one targeted probe.
If the missing check is expensive or the signal is genuinely unreachable, state that explicitly in Open unknowns and downgrade verdict to most likely or insufficient evidence.
Run this procedure on every Validated claim before publishing:
[evidence: <source>] tag.1247) → rewrite claim with the exact value, keep validated.Non-validated and say what would confirm it.Mechanistic-fit check for [evidence: local_git] / [evidence: gh_pr] claims. Temporal correlation + surface match is necessary but NOT sufficient. A claim that a PR caused the incident needs:
F.F emits observation O (confirmed via codebase-grep).O is what the telemetry actually recorded.If the diff is "in the same area" but you can't trace diff → emitter → observation, downgrade the verdict from confirmed root cause to most likely and flag mechanism not fully traced in Open unknowns.
Hallucinated citations are worse than missing ones. When in doubt, demote.
Verdict (pick one):
confirmed root cause — ≥ 2 evidence types AND (direct artifact OR clear trigger with strong symptom alignment)most likely root causesymptom diagnosis only — subsystem known, trigger unknownnot an RCA taskinsufficient evidenceCategory (pick one, orthogonal to verdict):
configuration_error — wrong value, missing env, flag flip, IAM mismatchcode_defect — bug in recently shipped codedata_quality — malformed input, schema drift, upstream API changeresource_exhaustion — memory, CPU, connections, disk, quota, FDsdependency_failure — upstream service, DNS, auth provider, 3rd-party APIinfrastructure — node, network, cloud provider, cert expiryhealthy — alert stale, metric normal, self-recoveredunknown — insufficient evidence to categorizeSubagent scope: if you were spawned as a telemetry-sweep subagent, this section + its Procedure + Branch output + Branch guardrails is everything you need. You don't synthesize a verdict; you surface facts + verbatim signals for codebase-grep + a completeness-check report.
Read-only evidence gathering from Tsuga. This branch does not declare root cause on its own — it produces facts the orchestrator synthesizes.
Time window, scope hint (service / cluster / env / customer / monitor), reported symptom. Missing scope → return only the discovery steps needed to resolve it.
tsuga monitors get <monitor-id> FIRST. Read the monitor's filter, aggregation, threshold, and groupBy — this IS the exact telemetry shape that crossed. Re-run the same query against the incident window AND a control window (same weekday + hour, 7 days earlier). Record the crossed value, the control value, and the ratio.tsuga services list|get. Capture canonical name, env, team, versions, sources, 24h log/trace counts.tsuga defaults. Always set explicit --from, --to, --max-results. For tsuga aggregation, convert windows to epoch seconds.$knowledge-technology reference to target the sweep.CrashLoopBackOff, OOMKilled, throttling, "too many", "insufficient" — spend one probe asking "is there a single config knob that would fix this?" before any elaborate change-correlation. Grep mounted codebases / helm / Pulumi for patterns like *BatchSize, *PoolSize, *Concurrency, *MaxConnections, *FailureThreshold, *InFlightBatches, *MemPoolSize scoped to the affected service.logs new-error-patterns / logs error-pattern-increases (when team scope exists)logs patterns to cluster failure shapeslogs search only after pattern discoverytraces search for exact failing spansaggregation scalar|timeseries for counts, rates, comparisonsmonitors list|get for signal semantics (not live truth)dashboards list|get / quality-reports list as supporting context onlyfile:line. Do not try to explain what a signal means until its emitting code is found.new-error-patterns OR patterns)A metric or log pattern being absent is NOT equivalent to its value being zero. Causes of absence include receiver scope / permission issues, feature not enabled, instrumentation gap, or scrape failure.
When you checked and found nothing, say which: (absent) — did not appear in window | (not instrumented) — scope lacks the receiver | (denied) — permission error | (empty) — query ran, returned 0 rows. Never report silent absence as "metric is zero" in a Validated claim.
Observed symptom: <one sentence>
Monitor anchor: <monitor id + filter + threshold, or (none) if case didn't cite a monitor>
Confirmed failing subsystem: <one sentence, or (unknown)>
Signals that support it:
- <fact with exact value> [evidence: tsuga_logs | tsuga_traces | tsuga_aggregation | tsuga_monitors | service_metadata]
Signals that do not yet support causality:
- <what you checked that was silent>
Control-window comparison: <bad vs good: counts / rates / ratio, or (skipped) with reason>
Verbatim signals for codebase-grep:
- "exact error string 1"
- "exact error string 2"
- metric.name.to_grep
- log pattern
Config-threshold preflight result: <summary, or (N/A — non-capacity symptom)>
Best next non-Tsuga check: <one action, or (none)>
Every claim carries [evidence: …]. No tag = hypothesis, belongs in the non-causal section.
metrics list|get is metadata. Use aggregation for values.symptom diagnosis only when you can't tie the subsystem to a trigger.More detail on tsuga command patterns: references/tsuga-rules.md.
Subagent scope: if you were spawned as a change-correlation subagent, this section is yours. Don't assign verdicts or synthesize root causes — produce the change timeline + candidate classifications (mechanism_confirmed / mechanism_plausible / area_only / ruled_out) for the orchestrator.
Answer: what changed, was it deployed, and could it plausibly cause the symptom?
{{CODEBASES_DIR}}/ — check there first for subdirectories (each is a git repo you can inspect with git log, git diff, git blame).Nothing past declared_at is admissible. Restrict every git log, git show, and gh pr call to strictly-before the incident start. Every shell you run in this branch must include the time bound:
git log --until="$DECLARED_AT" ...git log --before="$DECLARED_AT" ...gh pr list --search "merged:<$DECLARED_AT"A post-incident PR titled "Fix " is the answer key leaking backward — do not use it, do not quote it, do not let it validate your leader. If you see one anyway (because a broad query returned it), drop it and rerun with the --until bound. Violating this invalidates the verdict.
file:line pins from codebase-grep. The orchestrator spawns codebase-grep subagents for every verbatim signal in the telemetry output. Their results (one file:line per signal) are your highest-leverage inputs — each tells you exactly which source file a PR would need to touch to be a real candidate.declared_at, files changed in config / helm / infra / auth / feature-flag / routing paths. Dirty working tree = not a safe proxy for the incident window. For each file:line from step 2, run git log -L <line>,<line>:<file> --until=<declared_at> to see which commits modified that specific line before the incident started.$gh — time-bounded. Workflow runs, merged PRs, releases, commits, deployments with merged:<<declared_at> / created:<<declared_at> filters. A PR is a candidate only when it merged BEFORE declared_at AND a deploy completed between its merge and the incident start.file:line that emits the observed signal? If no → not a candidate, regardless of timing.mechanism_confirmed — diff → emitter → observation traces cleanly.mechanism_plausible — diff touches nearby code that could plausibly affect the observation; not directly on the emitter line.area_only — diff is in the right repo / service but doesn't touch the emitter.ruled_out — wrong surface or wrong timing.Most relevant changes:
- <PR/SHA/tag> [evidence: gh_pr | gh_run | gh_release | gh_api | local_git]
emitting signal: "<verbatim error/metric>"
emitting file:line: <path>:<line>
diff touches emitter line?: yes | no
classification: mechanism_confirmed | mechanism_plausible | area_only | ruled_out
Strongest causal candidate:
<change> — timestamp | artifact | surface | deploy status (deployed | merged only) | trace: diff→emitter→observation
Changes ruled out:
- <change> — why it doesn't fit (wrong file, wrong surface, wrong timing, not deployed)
Best verification or rollback step:
<concrete command or action>
Deploy status unknown? Say so explicitly and lower candidate confidence.
area_only classification? Say so explicitly — don't promote to "strongest candidate" without a mechanism trace.
If no repos are mounted (ENABLE_CODEBASES=0, no paths given) AND $gh is unavailable or returns nothing useful:
Return exactly:
Most relevant changes: (none — no repo / gh access)
Strongest causal candidate: (unavailable)
Best verification or rollback step: Operator should check deploy timestamps and config-change audit out-of-band.
Do not infer changes from telemetry alone. Do not cite PRs / SHAs / file paths that were not actually retrieved.
merged ≠ deployed.diff → emitter → observation. "Area match" is not a mechanism.Return these sections in order. Write (none) if a section is empty — never TBD.
Verdict: <one label>
Confidence: <low | medium | high>
Category: <one category>
Headline:
<one sentence, < 120 chars, paste-ready for Slack>
Causal chain:
- <trigger>
- <propagation>
- <symptom>
(If only symptom is known, say so. Don't invent a trigger.)
Validated claims:
- <claim> [evidence: tsuga_logs | tsuga_traces | tsuga_aggregation | tsuga_monitors | gh_pr | gh_run | gh_release | local_git | incident_archive | service_metadata]
- ...
Non-validated claims:
- <inference — state what evidence would confirm / refute>
Alternatives considered:
- <hypothesis — why deprioritized or falsified>
What changed:
- <timestamp | repo | artifact type | surface | fit>
Remediation:
Stop the bleeding: <fastest restore action, or (none needed)>
Likely root fix: <durable fix>
Verify before acting: <confirm fix addresses root cause, not symptom>
Open unknowns:
- <what you couldn't answer + what would unblock it>
Validated claim carries an [evidence: <source>] tag from the allowed list.1247 errors, not thousands of errors.tsuga_monitors is a clue about signal semantics, not live incident truth.what changed? early.merged ≠ deployed. subsystem ≠ root cause. symptom ≠ trigger.Load when needed:
tsuga command patterns for the telemetry branchnpx claudepluginhub tsuga-dev/agent-plugins --plugin tsugaProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.