From Issue Tracing
Use when the user provides a Grafana or Kibana/ELK URL and asks to investigate an alert, error, incident, or anomaly, or when the user runs /issue-tracing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/issue-tracing:issue-tracingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
On-call triage assistant. Takes a Grafana or Kibana URL (or alert description) and produces a structured **Root Cause / Impact / How to Resolve / Unknowns** report in both Traditional Chinese and English.
On-call triage assistant. Takes a Grafana or Kibana URL (or alert description) and produces a structured Root Cause / Impact / How to Resolve / Unknowns report in both Traditional Chinese and English.
This skill is a step-by-step SOP, not a reference document. Each step has rules and gates you MUST execute and acknowledge in chat before moving to the next. Reading the rules without writing the required outputs (scope table, query plan, etc.) violates the skill. If you find yourself thinking "I'll just write the report now" before completing every step, you are skipping work.
Input ($ARGUMENTS optional):
Preflight: resolve project root + preload tools (do this FIRST, before any URL work)
Pre-load only the always-used entry-point tools so the very first ES / Grafana calls don't break narrative. Single ToolSearch with:
mcp__elasticsearch__searchmcp__elasticsearch__list_indicesmcp__grafana__list_datasourcesAll other tools (panel queries, dashboard JSON, Prometheus / Loki query, panel image, etc.) are loaded on demand when actually needed — ToolSearch is cheap and saves context vs. preloading schemas you may not use.
Code reading is required later (step 10) to determine user impact and confirm root cause; without a project root, the whole flow stalls partway. Resolve up front:
a. Check add-dir paths loaded in the session — if there's a plausible service repo under one, treat its parent (or the path itself) as the root.
b. Ask the user — if (a) does not yield a root, ask: "請執行 /add-dir <你的服務 repo 根目錄>(例如 /add-dir ~/Project),讓我之後能讀 code 判斷使用者影響。" Wait for the user to add it, then re-check.
Cache the resolved root in conversation context. Then continue to step 2.
Parse the input URL
Grafana URL (*-grafana.*/d/<uid>/...):
uid, viewPanel (panel id, optional), from, to, var-* template varsfrom/to are relative (now-1h), resolve to absolute UTCKibana URL (*log*/app/discover#/... or view/<savedSearchId>):
time.from, time.toenv, project.keyword, level.keyword, KQL querydataViewId to determine index pattern (see step 4)view/<id> (saved search), the saved search bundles a data view + filters — use the filters in the URL stateTime buffer: expand the parsed range by ±5 minutes when querying.
Determine investigation path
| Input | Path |
|---|---|
Grafana dashboard URL (no viewPanel) | Read whole dashboard summary + every panel query, identify which panels show anomaly in the time window |
Grafana single panel URL (viewPanel=panel-N) | Focus on that panel only |
| Kibana URL | Skip Grafana, go directly to ES query (step 7) |
Map Kibana data view → ES index pattern (skip when possible)
Fast path — prefer this: if the input already gives you a project.keyword filter (or any other strong field filter that pins the data), query with index: "_all" and the filter directly. The right index can be inferred afterwards from the _index field on hits if you really need it. This skips the entire discovery flow below and avoids the 80KB+ list_indices truncation.
Use the discovery flow below ONLY when:
Discovery flow (when needed):
mcp__elasticsearch__list_indices, extract unique prefixes from .ds-<prefix>-* entries.<product>-<region>"). Match it against the data stream prefixes. A data view like <x>-<y> typically maps to a data stream like <y>-logs-<x>*, <x>-<y>*, or similar — show candidates and pick the one that returns hits with the expected project.keyword filter.Honor any excluded patterns the user mentions (e.g. test / lower-priority product lines). Default: include everything.
If the data center / cloud is unclear, query all confirmed patterns in parallel.
Run Grafana panel queries (when needed)
For each relevant panel from step 3:
mcp__grafana__get_dashboard_panel_queries with uid (and panelId if known)datasource.type:
elasticsearch → take the KQL string and run via ES (step 7). Do not call Grafana to execute; query ES directly.prometheus / loki / other → use the corresponding Grafana mcp tool (query_prometheus, query_loki_logs, etc.) with the panel's processedQuery and the parsed time range.viewPanel), report each panel: title, what it measures, observed value vs expected, whether it shows anomaly.Identify candidate projects (before running ES counts)
Source the candidate project list from these, in order:
a. Grafana panel legend (most reliable for dashboard URLs): when a panel groups by project.keyword, the legend table lists projects with totals directly. Two ways to obtain it:
mcp__grafana__get_panel_image to render the panel and inspect the legend
b. Sample hits: if no panel legend, run one query with size: 50 over the time window with the panel's filters but no project filter. Tally project field across hits client-side. This is a coarse top-K but bounded by size.
c. Kibana URL filters: if the URL has project.keyword: is one of [...], use that list verbatim.Do NOT enumerate by trying every known project name — brittle and slow.
Run ES queries (per-project counts)
GATE — Read references/step7-es-query.md NOW for the full procedure (filter requirements, aggs ban, token budget, stack-trace dedupe). Do not write any ES query before reading it; this content does not survive context dilution if you only read it once at the start of the conversation.
Distinct user count (when meaningful)
Many backend logs do NOT carry customerid / accountid. Workflow:
a. Sample 1 hit from the dominant project: size: 1, _source: "*"
b. Inspect fields. If no customerid, accountid, userid, account_id etc. → write n/a in the report.
c. If a user-id field exists: pull size: 100 hits, dedupe client-side, report as a lower bound (e.g. ~38+). Note the cap.
d. Do NOT spend extra effort if the field is absent — the report is useful without it.
Cross-project drill-down
GATE — Read references/step9-cross-project-drill.md NOW for the trigger priority ladder, hop limit, and the mandatory pre-incident baseline correlation check. Do not pull a sibling error pattern into Root Cause without the procedure in that file.
Code inspection & scope check
For the originating project AND every upstream identified in step 9, look it up under the project root from step 1 using these rules:
a. Direct match: <root>/<project.keyword>
b. Dot variant: replace - with . (e.g. service-a-b → service.a.b)
c. Hyphen variant: replace . with -
d. Strip separators: ls <root> | grep -i <stripped>
e. Frontend / consuming repo: if the project ends with -backend / .backend, also check <root>/<base>.frontend / <root>/<base>-frontend.
Record the result in conversation context as a scope table:
Investigation scope:
- <project>: in-scope (path: <path>)
- <upstream-A>: in-scope (path: <path>)
- <upstream-B>: out-of-scope
In-scope = repo found under project root. Out-of-scope = no match after a–e (likely owned by another team / not on this machine).
The scope table drives the depth of step 10b and step 11:
GATE — write the scope table out in chat before continuing, even if it has only one row. Without an explicit scope table, you have no basis to decide whether step 11 is mandatory or optional.
Hard rule for frontend lookup: if the originating service is -backend / .backend and the report's Impact will describe user-visible behavior, locating the frontend repo via 10a.e is REQUIRED — backend code alone cannot tell you how the error surfaces to the user. If the frontend repo is not found under the project root, Impact must say so explicitly in Unknowns instead of guessing.
If a–e all failed for a repo you needed (e.g. originating project's frontend not under root), list in Unknowns: "需要 <project> 的 repo 路徑 / 請執行 /add-dir <path>".
Verify infra metrics
GATE — Read references/step11-infra-metrics.md NOW for the full procedure (in-scope vs out-of-scope branch, service-token extraction, datasource-first query flow with Prometheus + InfluxDB recipes, dashboard fallback, mandatory CPU + Memory + restart scan, mean+max + 1-min-bin aggregation, Plan block requirement).
The reference is the source of truth for this step; this skill body is intentionally thin so the rules arrive fresh in context when you actually run the step, not stale at conversation start.
Produce the report
GATE — Read references/step12-report.md NOW for the full procedure (pre-report final check, HARD RULES on Impact wording, GMT+8 rule, Chinese full template, English short template). Do not start writing the report before reading; the formatting rules and Impact constraints will not survive context dilution from the preceding tool calls.
@timestamp range + project.keyword + level.keyword + size cap. No * / match_all queries.n/a, never estimate.add-dir paths or ask the user to /add-dir. Never hardcode paths and never persist to memory (per-project memory doesn't help cross-project triage).Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub bingeli1379/eli-marketplace --plugin issue-tracing