From UModel
Autonomous root-cause analysis using a UModel object-graph semantic layer. Explores entities, fetches scoped telemetry, and traverses relationships to diagnose incidents, SLO breaches, or degraded services.
How this skill is triggered — by the user, by Claude, or both
Slash command
/umodel:umodel-rcaThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Given a symptom, **investigate autonomously to a root cause** over the UModel
Given a symptom, investigate autonomously to a root cause over the UModel object graph. You decide the path; this skill gives the method, not a script.
It builds on the read toolkit in the umodel-query skill (entity / topology /
model reads via umctl query run <ws> "<SPL>" -o json; rows in data.data,
columns in data.header). Load both. The essentials you'll use are recapped
inline below.
Same server and CLI as umodel-query. For the bundled demo:
make quickstart QUICKSTART_SAMPLE=examples/incident-investigation # serves http://localhost:8080
get_metrics / get_logs are driven by the object graph: the model knows which
metric/log set hangs off an entity and the fields_mapping, so it fills in
service_id for you — you never hand-write PromQL or guess an ID.
umctl query run demo ".entity_set with(domain='platform', name='platform.service', ids=['63718b78868895d2590551b27ec6f51c']) | entity-call get_metrics('platform','platform.service.metrics','latency_p99_ms', step='30s')" -o json
umctl query run demo ".entity_set with(domain='platform', name='platform.service', ids=['…']) | entity-call get_logs('platform','platform.service.logs', query='level = \"ERROR\"')" -o json
Open source returns a query plan (the rendered PromQL / ES query, with the id substituted) — a downstream executor runs it. Against a PaaS-backed endpoint (
umctl --addr <paas>withmode='data') the same call returns the actual rows as the PaaS API response ({__labels__, __ts__, __value__}for metrics). Either way, the object graph produced the exact, correctly-scoped query.
Run this loop; let evidence — not a fixed script — drive your next query.
.entity … query='degraded'). Read its methods, datasets, neighbors, and
linked runbook (the umodel-query reads).get_metrics / get_logs) to
confirm and quantify the symptom..topo to upstream
callers and their recent config_change / deployment; follow links into the
business domain (promotions / traffic) or runtime domain (nodes / pods).
Cross-domain reach is where the object graph beats a flat metrics dump.change_summary and rule out
trivial ones (the red herring trap). Prefer a cause with a stated, ideally
quantified, mechanism.If the entity links a runbook_set (read it with .umodel with(kind='runbook_set', name='…')), use its observations as a reasoning frame — each is a hypothesis +
how to check it + a conclusion rule. Structure your reasoning with it; you may
still form hypotheses it didn't list. Cite its knowledge (failure patterns) and
toolkits (allowed remediation tools).
## Diagnosis
Symptom: <what's broken, quantified>
Evidence chain:
- <finding> ← <SPL / graph path traversed>
Root cause: <cause>, mechanism: <stated / quantified>
Ruled out: <red herrings and why>
Confidence: <high|medium|low>
Recommended action: <tool> — <input> (risk, requires confirmation, ETA)
Symptom: payment-gateway (platinum SLO) is degraded. A good agent reaches the
root cause without being told the steps:
.entity … query='degraded' → payment-gateway (63718b78…), links
runbook platform.service.ops + datasets platform.service.metrics/.logs.get_metrics(… 'latency_p99_ms' …) → P99 breaching;
get_logs(… level="ERROR") → upstream-timeout signatures..topo getNeighborNodes … 'calls' → upstream checkout-service
(149632df…); .entity … platform.config_change query='checkout' →
cfg-checkout-retry, max_retries 2→5 24h ago..entity … platform.deployment query='payment' → payment-gw v3.2.1, trivial logging change → ruled out (red herring)..entity … business.promotion query='active' → 618 Flash Sale,
actual 38000 vs expected 12000 QPS (3.5×).rollback_config_change (medium risk, confirm first).umodel-query reads for everything except telemetry fetch; this skill
adds the fetch + the reasoning loop.get_metrics / get_logs: plan in open source, data against a PaaS endpoint.query_spl_execute with { "workspace", "query" }
instead of the CLI; same SPL.npx claudepluginhub alibaba/unifiedmodel --plugin umodelDiagnoses production incidents by detecting environment, gathering symptoms, reading logs with Grep/Bash, checking metrics, tracing requests to find root causes and propose fixes with rollbacks.
Diagnoses incidents by tracing symptoms through services, generates runbooks and structured postmortems, and tracks SLOs/SLAs with error budget accounting.
Guides structured Honeycomb workflows for production issue investigations: orient with context/SLOs/triggers, broad queries/service maps, BubbleUp differentiators, trace analysis to find root causes like latency spikes or error surges.