From godmode
Assesses and implements observability: RED/USE metrics, structured logging, OpenTelemetry tracing, SLOs, alerting rules, and dashboards. For adding monitoring, Prometheus, or shipping without instrumentation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:observeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- `/godmode:observe`, "add monitoring", "set up logging"
/godmode:observe, "add monitoring", "set up logging"| Pillar | Status | Coverage | Tools |
| Metrics | PARTIAL | 40% | Prometheus |
| Logging | BASIC | 60% | stdout |
| Tracing | NONE | 0% | — |
| Alerting | MINIMAL | 20% | PagerDuty |
# Detect observability libraries
grep -rE "prom-client|datadog|opentelemetry" \
package.json requirements.txt 2>/dev/null
RED (request-driven):
http_requests_total (counter: method, path, status)
http_request_duration_sec (histogram: method, path)
http_requests_in_flight (gauge)
USE (infrastructure):
CPU, memory, disk, network, connections
Business:
user_signups_total, orders_completed_total,
payment_failures_total, active_sessions
IF cardinality > 10K labels: remove high-cardinality (user_id, request_id go in logs, not metrics).
{
"timestamp": "2025-01-15T10:30:45Z",
"level": "info",
"service": "api-gateway",
"trace_id": "abc123",
"request_id": "req-456",
"message": "request completed",
"duration_ms": 45
}
| Level | When | Alert? |
| FATAL | Cannot continue | Page on-call |
| ERROR | Operation failed | Alert channel |
| WARN | Unexpected but handled | Track trend |
| INFO | Business events | No |
| DEBUG | Diagnostic (dev only) | No |
NEVER log PII, tokens, passwords, or credit cards. ALWAYS include request_id in every log entry.
OpenTelemetry setup (recommended):
Auto-instrument: HTTP, DB, Redis, gRPC
Propagate: trace_id across all service calls
Backend: Jaeger|Tempo|Zipkin|Datadog
IF sampling rate too low (< 1%): miss rare errors. IF 100% sampling: too expensive at scale. Default: 10% sampling, 100% for errors.
| SLI | Target | Window | Error Budget |
| Availability | 99.9% | 30d | 43.2 min |
| Latency p99 | < 500ms | 30d | 0.1% slow |
| Error rate | < 0.1% | 30d | budget: 0.1% |
Burn rate alerts: Fast: 14.4x over 1h (consumes 2% budget) Slow: 6x over 6h (consumes 5% budget)
| Alert | Condition | Severity |
| HighErrorRate | 5xx > 1% for 5m | Critical |
| HighLatencyP99 | p99 > 2s for 5m | Critical |
| DiskSpaceLow | usage > 85% | Warning |
| PodCrashLoop | restarts > 3/5m | Critical |
EVERY alert MUST have for duration (min 2m critical,
5m warning). No flapping alerts.
Row 1: Request Rate, Error Rate, Latency, Saturation
Row 2: SLO status, error budget remaining
Row 3: Per-service breakdown
Row 4: Infrastructure (CPU, memory, disk, network)
for duration.Append .godmode/observe-results.tsv:
timestamp pillar tool items_configured coverage_pct status
KEEP if: pillar produces verified output AND
latency increase < 5%.
DISCARD if: no output OR perf degrades > 5%
OR cardinality explosion.
STOP when FIRST of:
- All 3 pillars verified (metrics/logs/traces)
- SLOs defined + alerts configured
- User requests stop
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| Cardinality explosion | Remove labels, use histograms |
| Missing trace spans | Check sampling, propagation |
| Alert fatigue | Tune thresholds, add for: duration |
| Log volume exceeds budget | Drop DEBUG, sample paths |
npx claudepluginhub arbazkhan971/godmodeSets up observability with structured logging, metrics collection, distributed tracing, alerting rules, dashboards, SLOs using ELK, Prometheus, Grafana, Datadog, OpenTelemetry. For monitoring, production debugging, observability architecture.
Audits and designs observability instrumentation: structured logging, metrics, tracing, and alerting. Use when reviewing coverage gaps or defining SLIs/SLOs.
Designs production-grade monitoring, logging, and tracing systems with SLI/SLO management, alerting, and incident response workflows.