Skill

debug-instrument

Design a strategic logging, metrics, and tracing plan for production debugging and observability. Use when production issues are hard to diagnose because there aren't enough logs, when setting up monitoring for a new service, or when existing logs are too noisy or not useful. Trigger for: "we can't tell what's happening in production", "add logging to this service", "design observability for X", "we need better metrics", or after a production incident where lack of visibility was a problem.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/debug-agent:debug-instrument

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGrepGlobWrite

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Good instrumentation makes production invisible problems visible. The challenge is adding enough signal without creating noise or performance overhead.

SKILL.md

250 lines · ~1.8k tokens

Stats

Parent stars0

MaintenanceGood

Last CommitMar 20, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Instrumentation Design — Logging, Metrics, and Tracing

Good instrumentation makes production invisible problems visible. The challenge is adding enough signal without creating noise or performance overhead.

Ask the user which component or service to instrument if not already specified. Then analyze the code and design a comprehensive instrumentation strategy.

Phase 1: Identify Critical Paths

Before deciding where to add instrumentation, map what matters:

User-facing paths — request → response chains that users feel directly
Business-critical paths — payment processing, authentication, data persistence
Performance-sensitive paths — high-frequency operations where overhead matters
Error-prone paths — areas that have historically had issues (check git log)
Integration points — calls to external services, databases, queues

For each path: what does success look like? What does failure look like? What intermediate states matter?

Phase 2: Instrumentation Strategy

Three levels — choose based on the path's criticality:

Level 1 — Essential (Every production service needs this)

Entry/Exit Points:

Every API endpoint: log request received (method, path, user ID) and response sent (status, duration)
Every database transaction: log start/commit/rollback with duration
Every external service call: log attempt, result, and duration

Error Conditions:

Every exception catch: log error type, message, stack trace, and relevant context
Every validation failure: log what was invalid and why
Every timeout: log which operation timed out and after how long
Every retry: log attempt number and reason

Performance Markers:

Operation start/end times for anything taking >100ms
Queue depths at enqueue/dequeue
Cache hit/miss rates

Level 2 — Diagnostic (Add for complex services)

State Changes:

User authentication events (login/logout/failure)
Order/payment status changes
Configuration updates (what changed, who changed it)
Feature flag state changes

Business Events:

Key user actions (significant ones, not every click)
Background job start/completion/failure
Data pipeline stage completion

Level 3 — Deep Debug (Add temporarily for investigations)

Detailed Tracing:

Parameter values for complex or buggy functions
Intermediate calculation results
Loop iteration counts for operations that vary widely
Branch decisions for complex conditionals

Remove Level 3 instrumentation once the investigation is complete, or gate it behind a feature flag.

Phase 3: Log Format Design

Structured logging makes logs queryable. Use JSON format with these standard fields:

{
  "timestamp": "ISO-8601 format",
  "level": "DEBUG | INFO | WARN | ERROR",
  "service": "service-name",
  "trace_id": "correlation ID for distributed tracing",
  "span_id": "ID for this specific operation",
  "user_id": "if request is user-scoped",
  "operation": "what is happening",
  "duration_ms": "for completed operations",
  "status": "success | failure | timeout",
  "error": "error details if applicable",
  "metadata": {
    "custom": "contextual fields"
  }
}

Sampling strategy (to control volume):

100% for errors and warnings — never sample these out
10% for normal successful operations in high-traffic paths
1% for health checks and very high-frequency operations
Make sampling rate configurable at runtime

Phase 4: Metrics Design

Metrics (unlike logs) are aggregatable. Design metrics that answer operational questions:

RED Metrics (for every service):

Rate — requests per second (counter)
Errors — error count and rate (counter)
Duration — response time distribution (histogram with P50, P95, P99)

USE Metrics (for resources):

Utilization — resource usage percentage (gauge)
Saturation — queue depth, wait time (gauge)
Errors — resource allocation failures (counter)

Business Metrics (per domain):

Transaction success rate
Feature usage counts
Revenue-impacting events

Phase 5: Distributed Tracing

For microservices, design trace context propagation:

Create spans for:

Every HTTP request (inbound and outbound)
Every database query
Every cache operation
Every message queue publish/consume
Every background job execution

Propagate context via:

HTTP headers: X-Trace-Id, X-Span-Id, X-Parent-Span-Id
Message metadata for async flows
Database query comments for DB-level tracing

Output Format

# Instrumentation Plan

**Service/Component:** [name]
**Current coverage:** [Low/Medium/High]
**Performance budget:** [acceptable overhead %]

---

## Priority 1 — Implement This Week

### [Component/Endpoint name]

**Entry point logging:**
```[language]
// Location: [file:line]
logger.info("Operation started", {
  operation: "name",
  trace_id: ctx.traceId,
  user_id: ctx.userId,
  // key inputs
});

Exit/error logging:

// On success
logger.info("Operation completed", { duration_ms: Date.now() - start, status: "success" });

// On error
logger.error("Operation failed", { error: err.message, stack: err.stack, duration_ms: ... });

Metric:

operation_duration_ms{operation="name", status="success|failure"}

[Repeat for each critical path]

Metrics Implementation

// Recommended metric definitions
[concrete metric code for the target language/framework]

Alert Recommendations

- name: high_error_rate
  condition: error_rate > 1%
  severity: critical

- name: high_latency
  condition: p95_latency > 1000ms
  severity: warning

Log Level Configuration

Level	Development	Staging	Production
DEBUG	On	On	Off
INFO	On	On	Sampled (10%)
WARN	On	On	On
ERROR	On	On	On (100%)

Performance Impact Estimate

Type	CPU overhead	Memory overhead	I/O impact
Structured logging	<1%	<10MB	Async buffered
Metrics	<0.5%	<5MB	Batched
Tracing	<2%	<20MB	Sampled

Rollout Plan

Week 1

Add error logging to all exception handlers
Add service boundary logging (API entry/exit)
Set up basic RED metrics

Week 2

Add performance tracking for slow operations
Implement distributed tracing
Create initial dashboards

Week 3

Add business event tracking
Configure sampling strategies
Set up alerting rules

debug-instrument

Invocation

Tool Access

Context Preview

SKILL.md

debug-instrument

Invocation

Tool Access

Context Preview

SKILL.md

Instrumentation Design — Logging, Metrics, and Tracing

Phase 1: Identify Critical Paths

Phase 2: Instrumentation Strategy

Level 1 — Essential (Every production service needs this)

Level 2 — Diagnostic (Add for complex services)

Level 3 — Deep Debug (Add temporarily for investigations)

Phase 3: Log Format Design

Phase 4: Metrics Design

Phase 5: Distributed Tracing

Output Format

Metrics Implementation

Alert Recommendations

Log Level Configuration

Performance Impact Estimate

Rollout Plan

Week 1

Week 2

Week 3

Similar Skills

Instrumentation Design — Logging, Metrics, and Tracing

Phase 1: Identify Critical Paths

Phase 2: Instrumentation Strategy

Level 1 — Essential (Every production service needs this)

Level 2 — Diagnostic (Add for complex services)

Level 3 — Deep Debug (Add temporarily for investigations)

Phase 3: Log Format Design

Phase 4: Metrics Design

Phase 5: Distributed Tracing

Output Format

Metrics Implementation

Alert Recommendations

Log Level Configuration

Performance Impact Estimate

Rollout Plan

Week 1

Week 2

Week 3

Similar Skills