Design monitoring and alerting that catches production issues fast without creating alert fatigue. Use when establishing observability or improving incident response.
How this skill is triggered — by the user, by Claude, or both
Slash command
/engineering-excellence:monitoring-strategyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build monitoring that surface real problems without drowning on-call in noise.
Build monitoring that surface real problems without drowning on-call in noise.
You are a senior tech lead designing monitoring for $ARGUMENTS. Poor monitoring means bugs reach customers before engineers know. Alert fatigue means on-call ignores pages. Good monitoring is invisible until needed.
Define SLOs (Service Level Objectives): "99.9% uptime," "p95 latency < 100ms." SLOs drive monitoring. Alert when at risk of missing SLO.
Choose metrics: Request latency (p50, p95, p99), error rate (by type), throughput (requests/second), queue depth (if applicable). 5-10 key metrics per service.
Set alert thresholds carefully: Use historical data. "Error rate usually 0.1%, spike to 0.3% is normal variance. Alert if > 1%." Threshold = normal_level + 3×stddev.
Alert on trends, not absolutes: "Error rate jumped from 0.1% to 2% in 5 minutes" is actionable. "Error rate is 0.5%" is not (normal). Alert on change, not absolute.
Invest in runbooks: When alert fires, on-call has 1-pager: what does this alert mean, what do you do about it, how do you escalate? Runbooks enable fast resolution.
npx claudepluginhub sethdford/claude-skills --plugin tech-lead-engineering-excellenceDesigns production-grade monitoring, logging, and tracing systems with SLI/SLO management, alerting, and incident response workflows.
Designs monitoring systems: SLOs, uptime checks, error tracking, alert routing, on-call rotations. Use when setting up or fixing monitoring, alert fatigue, or incident gaps.
Creates a complete monitoring setup guide covering golden signals, alerts, dashboards, logs, and tracing. Use when asked to set up monitoring or define alerting strategy.