Skill

umodel-rca

Autonomous root-cause analysis using a UModel object-graph semantic layer. Explores entities, fetches scoped telemetry, and traverses relationships to diagnose incidents, SLO breaches, or degraded services.

monitoring

developer-tools

Popularity

Stars

146

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/umodel:umodel-rca

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Given a symptom, **investigate autonomously to a root cause** over the UModel

SKILL.md

122 lines · ~1.5k tokens

Stats

LanguagePython

Stars146

Forks26

MaintenanceExcellent

Last CommitJun 12, 2026

Actions

View Source View Plugin View on GitHub View README

UModel RCA — autonomous root-cause analysis

Given a symptom, investigate autonomously to a root cause over the UModel object graph. You decide the path; this skill gives the method, not a script.

It builds on the read toolkit in the umodel-query skill (entity / topology / model reads via umctl query run <ws> "<SPL>" -o json; rows in data.data, columns in data.header). Load both. The essentials you'll use are recapped inline below.

Setup

Same server and CLI as umodel-query. For the bundled demo:

make quickstart QUICKSTART_SAMPLE=examples/incident-investigation   # serves http://localhost:8080

Model-guided data fetch (autonomous retrieval)

get_metrics / get_logs are driven by the object graph: the model knows which metric/log set hangs off an entity and the fields_mapping, so it fills in service_id for you — you never hand-write PromQL or guess an ID.

umctl query run demo ".entity_set with(domain='platform', name='platform.service', ids=['63718b78868895d2590551b27ec6f51c']) | entity-call get_metrics('platform','platform.service.metrics','latency_p99_ms', step='30s')" -o json
umctl query run demo ".entity_set with(domain='platform', name='platform.service', ids=['…']) | entity-call get_logs('platform','platform.service.logs', query='level = \"ERROR\"')" -o json

Open source returns a query plan (the rendered PromQL / ES query, with the id substituted) — a downstream executor runs it. Against a PaaS-backed endpoint (umctl --addr <paas> with mode='data') the same call returns the actual rows as the PaaS API response ({__labels__, __ts__, __value__} for metrics). Either way, the object graph produced the exact, correctly-scoped query.

The autonomous RCA loop

Run this loop; let evidence — not a fixed script — drive your next query.

ORIENT — locate the symptomatic entity (.entity … query='degraded'). Read its methods, datasets, neighbors, and linked runbook (the umodel-query reads).
CHARACTERIZE (fetch) — pull its own signals (get_metrics / get_logs) to confirm and quantify the symptom.
HYPOTHESIZE — candidate causes: upstream dependency, recent change (config / deploy), capacity / traffic, downstream resource. Keep several alive.
GATHER EVIDENCE (multi-hop, cross-domain) — traverse .topo to upstream callers and their recent config_change / deployment; follow links into the business domain (promotions / traffic) or runtime domain (nodes / pods). Cross-domain reach is where the object graph beats a flat metrics dump.
CORRELATE & DISCRIMINATE — line up changes × topology × telemetry × business context on a timeline. Separate root cause from coincidence: a recent deploy is not guilty just because it's recent — read its change_summary and rule out trivial ones (the red herring trap). Prefer a cause with a stated, ideally quantified, mechanism.
CONCLUDE — root cause + evidence chain (cite the graph path per link) + quantified mechanism + confidence + a reversible, confirmation-required recommendation.

Runbook as scaffold

If the entity links a runbook_set (read it with .umodel with(kind='runbook_set', name='…')), use its observations as a reasoning frame — each is a hypothesis + how to check it + a conclusion rule. Structure your reasoning with it; you may still form hypotheses it didn't list. Cite its knowledge (failure patterns) and toolkits (allowed remediation tools).

Output

## Diagnosis
Symptom: <what's broken, quantified>
Evidence chain:
- <finding>  ← <SPL / graph path traversed>
Root cause: <cause>, mechanism: <stated / quantified>
Ruled out: <red herrings and why>
Confidence: <high|medium|low>
Recommended action: <tool> — <input> (risk, requires confirmation, ETA)

Worked example — incident-investigation demo (a TEST of the method, not a script)

Symptom: payment-gateway (platinum SLO) is degraded. A good agent reaches the root cause without being told the steps:

ORIENT: .entity … query='degraded' → payment-gateway (63718b78…), links runbook platform.service.ops + datasets platform.service.metrics/.logs.
CHARACTERIZE: get_metrics(… 'latency_p99_ms' …) → P99 breaching; get_logs(… level="ERROR") → upstream-timeout signatures.
GATHER: .topo getNeighborNodes … 'calls' → upstream checkout-service (149632df…); .entity … platform.config_change query='checkout' → cfg-checkout-retry, max_retries 2→5 24h ago.
DISCRIMINATE: .entity … platform.deployment query='payment' → payment-gw v3.2.1, trivial logging change → ruled out (red herring).
CROSS-DOMAIN: .entity … business.promotion query='active' → 618 Flash Sale, actual 38000 vs expected 12000 QPS (3.5×).
CONCLUDE: retry amplification (×2.5) × promotion traffic (×3.5) = 8.75× load → recommend rollback_config_change (medium risk, confirm first).

Notes

Reuse the umodel-query reads for everything except telemetry fetch; this skill adds the fetch + the reasoning loop.
get_metrics / get_logs: plan in open source, data against a PaaS endpoint.
Stay read-only — recommend remediation, do not execute it.
MCP alternative: call query_spl_execute with { "workspace", "query" } instead of the CLI; same SPL.

umodel-rca

Popularity

Invocation

Context Preview

SKILL.md

umodel-rca

Popularity

Invocation

Context Preview

SKILL.md

UModel RCA — autonomous root-cause analysis

Setup

Model-guided data fetch (autonomous retrieval)

The autonomous RCA loop

Runbook as scaffold

Output

Worked example — incident-investigation demo (a TEST of the method, not a script)

Notes

Similar Skills

UModel RCA — autonomous root-cause analysis

Setup

Model-guided data fetch (autonomous retrieval)

The autonomous RCA loop

Runbook as scaffold

Output

Worked example — incident-investigation demo (a TEST of the method, not a script)

Notes

Similar Skills