From hatch3r
Verifies observability completeness before shipping a service — OTel span coverage, log-trace correlation, SLOs, error tracking, and GenAI semantic conventions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hatch3r:hatch3r-observability-verifyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill defines what "done" means for any feature shipping a service. Run before declaring a feature complete. The 9 gates below mix automated checks (machine-checkable on every PR) with one release-cadence gate (SLO + burn-rate alert review per release). Skipping any gate = the feature is not done. Reviewer approval and passing unit tests alone do not satisfy this bar.
This skill defines what "done" means for any feature shipping a service. Run before declaring a feature complete. The 9 gates below mix automated checks (machine-checkable on every PR) with one release-cadence gate (SLO + burn-rate alert review per release). Skipping any gate = the feature is not done. Reviewer approval and passing unit tests alone do not satisfy this bar.
Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per agents/shared/user-question-protocol.md. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: service scope (which routes), trace vendor (OTel collector vs vendor SDK), sample rates (head vs tail), SLO target values, and Gate 7 applicability (LLM-in-path vs pure service).
Fan-out scales with task size; token cost never justifies serializing independent work (rules/hatch3r-fan-out-discipline.md P8 B2; agents/shared/efficiency-patterns.md). Emit sub_agents_spawned: { count, rationale } in your output.
This skill is the verification HARNESS — it declares HOW each observability gate is checked. The DISPATCHER that decides WHEN to run it is the CQ specialist agent:
agents/hatch3r-reliability.md — invokes this skill as the telemetry sub-vector gate of CQ4 (OTel span coverage, structured-log + trace-id correlation, RED/USE metrics, GenAI semconv), alongside skills/hatch3r-reliability-verify for the SLO/probes/runbook sub-vector. The agent contributes the review trigger and Phase-4 dispatch; this skill contributes the 9-gate procedure.No duplication: the agent decides WHEN, this skill defines HOW. The agent body cites this skill (agents/hatch3r-reliability.md — "cite ... skills/hatch3r-observability-verify as the closing gates"); this subsection is the symmetric upstream citation per rules/hatch3r-agent-orchestration.md (Phase-4 dispatch).
grep -E 'app\.(get|post|put|patch|delete)|router\.|@Get|@Post|fastify\.route' src/ and outbound calls via grep -E 'fetch\(|axios|prisma|redis|pg\.query'. Each match must have a tracer call on the same path: grep -E 'tracer|startSpan|@WithSpan' against the file.@opentelemetry/auto-instrumentations-node, opentelemetry-instrumentation Python) satisfy the spec when loaded before app imports — verify via process arg --require @opentelemetry/auto-instrumentations-node/register or equivalent loader.http.request.method, http.route, http.response.status_code, url.scheme. DB spans carry db.system + db.operation.name. Span status ERROR set on every 5xx + every caught exception. Sources: rules/hatch3r-observability-tracing.md, OpenTelemetry semconv v1.41.1 (the HTTP/DB attributes named here are stable from the >=1.29 floor onward).slog). No console.log for application logs in production code paths.trace_id and span_id from the active OTel context. Verify via Playwright or vitest fixture that emits a request and asserts both fields appear on the captured log line.@opentelemetry/instrumentation-pino for Node, LoggingInstrumentor for Python — auto-injects trace_id + span_id. Manual injection acceptable when auto-instrumentation is unavailable for the logger.traceparent + tracestate headers) propagated on every outbound HTTP call. Test: send a request, inspect the outbound call recorded by nock / msw / a recording proxy, assert the header is present and parses as a valid traceparent string 00-{32hex}-{16hex}-{2hex}.trace_id + traceparent propagated on every outbound call. Sources: rules/hatch3r-observability-logging.md, W3C Trace Context Level 1 (W3C Recommendation 2020-02).SeverityNumber mapping documented in the logger initialization. Replace ad-hoc level strings with the OTel-aligned set: TRACE / DEBUG / INFO / WARN / ERROR / FATAL mapped to SeverityNumber 1 / 5 / 9 / 13 / 17 / 21."created order" {order_id, amount}. Never embed dynamic values into the message string — pass them as fields.service.name, service.version, deployment.environment, trace_id, span_id, severity_number, timestamp (RFC 3339 with millisecond precision).console.log for app logs. Enforced via eslint rule no-console with error severity in production code paths; test code is exempt via override. Sources: rules/hatch3r-observability-logging.md, OpenTelemetry Logs Data Model.route, method, status. Histogram buckets follow the rule default [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000] ms.{service}.{domain}.{metric}_{unit} in snake_case. Counter names end in _total; histogram names end in the unit (_ms, _bytes).rules/hatch3r-observability-metrics.md (<100 unique values per label). Never use raw user_id or unbucketed path as a label.rules/hatch3r-observability-metrics.md.slo.yaml or service.yaml at the service root.good_events / valid_events. Source the numerator and denominator from the same signal (load-balancer logs OR application metrics, never mixed).rules/hatch3r-observability-metrics.md.release: process.env.GIT_SHA or equivalent.sentry-cli sourcemaps upload or vendor equivalent. Source-map upload step runs on tag-push and on every production deploy.beforeSend hook strips email, IP, password, authorization tokens from event payloads. Test via a fixture event with PII and assert the captured payload is clean.rules/hatch3r-observability-logging.md, Sentry release tracking guide.Applies only when the feature calls an LLM or runs an agent:
gen_ai.* conventions are Development-status as of SemConv v1.41.1 — names may change; pin the SemConv version you emit and re-verify these attribute keys each P3 currency cycle.gen_ai.operation.name, gen_ai.provider.name (renamed from the deprecated gen_ai.system in SemConv v1.37.0), gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons. Cache-hit flag emitted as a span attribute when the provider returns one.execute_tool {gen_ai.tool.name} spans per rules/hatch3r-observability-tracing.md § "AI Agent Instrumentation". Each tool span carries gen_ai.operation.name, gen_ai.tool.name, plus the app.tool.input_hash/app.tool.output_status/app.tool.duration_ms extras.gen_ai.client.token.usage (Histogram, attribute gen_ai.token.type) and gen_ai.client.operation.duration (Histogram).Cross-reference: rules/hatch3r-ai-evals.md (Slice 5), OpenLLMetry semantic conventions.
ParentBased(TraceIdRatioBased(0.1)) head sample + tail-sampling policy keeping 100% of error traces and 100% of traces with latency > p95.target_sps = qps * head_sample * (1 + retry_factor). Re-check on every deploy.user_id from spans before export when count > 10k unique values per 5-minute window.rules/hatch3r-observability-tracing.md.runbook_url annotation linking to a runbook in docs/runbooks/ or equivalent. Runbook contains: symptoms, likely causes, diagnostic steps, remediation actions, owner team, escalation policy.runbook_url is missing or the target runbook file does not exist. Provide a validate-alerts script under scripts/ or rely on promtool check rules for Prometheus.rules/hatch3r-observability-metrics.md.All 9 gates pass = the feature is "done". Anything less = not done.
The orchestrator running this skill emits a single-line verdict per gate (GATE_N: PASS|FAIL <evidence-path>) and aggregates them. One FAIL on a required gate blocks the merge regardless of reviewer approval status.
hatch3r-implementer finishes service code and before hatch3r-qa-validation runs.src/routes/, src/handlers/, src/services/, src/api/, src/middleware/, src/controllers/, src/lib/, or any file matching the four observability rule globs.rules/hatch3r-observability-logging.mdrules/hatch3r-observability-metrics.mdrules/hatch3r-observability-tracing.md (includes AI agent instrumentation; was previously split as -detail)opentelemetry.io/docs/specs/semconv/opentelemetry.io/docs/specs/semconv/gen-ai/www.w3.org/TR/trace-context/sre.google/workbook/alerting-on-slos/grafana.com/docs/grafana/latest/alerting/docs.sentry.io/product/releases/github.com/traceloop/openllmetrynpx claudepluginhub hatch3r/hatch3r --plugin hatch3rAudits and designs observability instrumentation: structured logging, metrics, tracing, and alerting. Use when reviewing coverage gaps or defining SLIs/SLOs.
Guides instrumenting and operating observable software systems with OpenTelemetry, traces, spans, metrics, logs, structured events, SLOs, alerts, sampling, and telemetry pipelines for debugging production behavior.
Handles observability and reliability: SLO-based alerting with runbooks, OpenTelemetry instrumentation for RED metrics/logs/traces, incident response, monitoring audits, and coverage checks.