From 97
Provides rigid checks for structured logs, request IDs, tracing, metrics in production request handlers, RPCs, background jobs to ensure diagnosability.
How this skill is triggered — by the user, by Claude, or both
Slash command
/97:observabilityThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Code that runs fine in dev and goes inert in production is the dominant operational failure mode for modern services. **When you add code that will run for users, you also add the diagnosability of that code: structured logs, trace context across process boundaries, metrics with bounded cardinality, signals an operator can read without your help.**
Code that runs fine in dev and goes inert in production is the dominant operational failure mode for modern services. When you add code that will run for users, you also add the diagnosability of that code: structured logs, trace context across process boundaries, metrics with bounded cardinality, signals an operator can read without your help.
This is a rigid skill. Jump to the sub-section that matches what you're writing and run that sub-section's checks.
These checks matter most when adding a request handler, RPC, or background job that will run in production with users depending on diagnosability. In MVPs, prototypes, internal dev tools, and one-off scripts, structured-logging, tracing, and SLO discipline are premature — prefer the simplest thing that works.
Invoke when you're about to:
log.info / log.warn / log.error calls in code that will run under loadIf the change adds an observability call to production code even slightly, invoke anyway — the cardinality and trace-context bugs are not.
timestamp, level, event (a short stable name like user_login_failed), plus the relevant context fields (request_id, user_id when not sensitive, route, duration_ms, status). Example: logger.info(f"user {user.id} logged in via {provider} at {ts}") is unsearchable; logger.info("user_login", user_id=user.id, provider=provider) is queryable. (OTel/StructuredLogs.)security-and-trust-boundaries); whether log files belong on disk or stdout (build-deploy-and-tooling 12F/XI). This skill decides what fields go on the line and how they are shaped.requests.get, manual queue producer). Example: a handler that reads from one service and writes to another with no propagation — the trace breaks at the boundary and the operator cannot see the cross-service path. (OTel/TraceContext.)failed_logins_total{user_id="...", reason="..."} produces a new time series per user — millions of series for a system with millions of users, and the metrics backend falls over. Per-user, per-request, per-trace-id data belongs in logs and traces, not metric labels. Metric labels are for low-cardinality, bounded sets: HTTP method, route template, status class, region, downstream name. (OE/CardinalityDiscipline.)SRE/GoldenSignals.)These thoughts mean STOP — apply the domain check before committing:
| Thought | Reality |
|---|---|
| "I'll log a single human-readable string — it's easier to grep." | Free-form strings are unsearchable in production aggregators. Log structured key-value with stable event names; the operator queries by field, not by substring. (OTel/StructuredLogs) |
| "I'll add the user id as a metric label so we can see per-user failures." | Per-user labels create a time series per user. Use a metric for the count; put the user id in logs and traces where high cardinality is fine. (OE/CardinalityDiscipline) |
| "I'll add the full URL path as a label." | Same problem — /users/12345 and /users/12346 are different series. Use the route template (/users/:id), not the realized path. (OE/CardinalityDiscipline) |
| "I'll instrument every helper function with a span." | Spans cover meaningful units of work; one per private helper buries the trace in noise. Span per request / transaction / job, not per function. (OTel/TraceContext) |
"The downstream call uses raw requests.get — no need to thread the trace headers." | The trace breaks at the boundary; the operator cannot see the cross-service path. Propagate W3C Trace Context, even when bypassing the tracer SDK. (OTel/TraceContext) |
| "We don't measure latency on this background job — it'll be fine." | Without latency / traffic / errors / saturation visibility, the only way to know it broke is a user complaint. Wire at least the four signals for production service code. (SRE/GoldenSignals) |
| "The request id is in the trace — we don't need it in the log." | Logs without the request id force the operator to traverse the trace just to correlate one error line. Put the request id on every log line for the request. (OTel/StructuredLogs) |
For every observability surface your change touches, all of the following are true:
event name, and includes the request id.security-and-trust-boundaries).If any box that applies to your change is unchecked, you are not done. Either finish, or revert and re-plan.
| ID | Principle | Source |
|---|---|---|
OTel/StructuredLogs | Structured key/value logs with stable event names | OpenTelemetry semantic conventions; SRE book |
OTel/TraceContext | W3C Trace Context propagated across every cross-process call | OpenTelemetry semantic conventions; Observability Engineering |
SRE/GoldenSignals | The four signals for service code: latency, traffic, errors, saturation | Site Reliability Engineering, ch. 6 |
OE/CardinalityDiscipline | High-cardinality data belongs in logs and traces, not metric labels | Observability Engineering (Majors et al.) |
See principles.md for the long-form distillations and source citations.
npx claudepluginhub oribarilan/97 --plugin 97Observability discipline: structured logging, metrics instrumentation, distributed tracing, and signal correlation. Invoke whenever task involves any interaction with observability concerns — adding logging, designing metrics, instrumenting traces, correlating signals, reviewing instrumentation, or understanding when to use which pillar.
Instruments code with logging, metrics, traces, and alerting so production behavior is visible and diagnosable. Use when shipping features, diagnosing slow incidents, or reviewing PRs with I/O or cross-service calls.
Audits and designs observability instrumentation: structured logging, metrics, tracing, and alerting. Use when reviewing coverage gaps or defining SLIs/SLOs.