From uptime
This skill should be used when the user asks to "check alerts", "investigate an outage", "why is my site down", "triage incidents", "what's alerting", "show current outages", "diagnose downtime", or needs to investigate monitoring alerts and outages. Covers alert review, outage analysis, upstream provider correlation, and escalation guidelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/uptime:incident-triageThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Workflow for investigating alerts, outages, and service degradation.
Workflow for investigating alerts, outages, and service degradation.
list_alerts to see all active alerts. Key fields:
| Field | Meaning |
|---|---|
alert_type | Check type that triggered (HTTP, DNS, SSL, etc.) |
is_up | false = currently down, true = recovered |
created_at | When the alert fired |
output | Raw check output, the most useful diagnostic field |
get_check on the alerting check to see:
list_outages filtered by check to see the timeline:
Multiple simultaneous alerts often point to a root cause upstream of any individual check.
| Alerts firing | Likely root cause |
|---|---|
| DNS A + HTTP | DNS resolution failure; HTTP can't connect because DNS is broken |
| DNS NS + DNS A + HTTP | Nameserver failure, cascading into all resolution |
| DNS MX + SMTP | DNS-level mail routing failure |
Triage action: check DNS checks first. If NS is down, that's the root cause.
| Alerts firing | Likely root cause |
|---|---|
| SSL + HTTP (certificate error in output) | Expired or misconfigured certificate |
| SSL only (HTTP still passing) | Certificate issue browsers warn on but don't block yet |
If checks across multiple unrelated domains fail simultaneously:
This step is mandatory during every triage. Many outages that appear local are actually caused by infrastructure provider incidents.
Key providers to check: Cloudflare, AWS, Google Cloud, Microsoft Azure, Fastly, Akamai.
Even without explicit dependency monitoring, infer providers from DNS records:
| Record type | Pattern | Provider |
|---|---|---|
| CNAME | *.cloudfront.net | AWS CloudFront |
| CNAME | *.cdn.cloudflare.net | Cloudflare CDN |
| CNAME | *.fastly.net | Fastly |
| NS | *.cloudflare.com | Cloudflare DNS |
| NS | awsdns-* | AWS Route 53 |
| MX | *.google.com | Google Workspace |
| MX | *.outlook.com | Microsoft 365 |
If DNS checks show CNAME pointing to *.cloudfront.net and AWS CloudFront is reporting an incident, that's the root cause.
Upstream incident detected: Cloudflare is reporting degraded performance in EU regions (started 14:23 UTC). Your checks failing from EU probe locations are likely caused by this. No action needed on your side; monitor Cloudflare's status page for resolution.
Before triaging, note whether alerting checks have escalation rules. If escalation rules exist, someone may already be notified. If not and the outage is critical, manually notify appropriate stakeholders.
is_up: false for > 5 minutesFalse positives are alerts that fire when the service is actually healthy.
If false positives are frequent, recommend adjusting the monitoring configuration:
| Problem | Fix |
|---|---|
| Single-location flapping | Increase sensitivity to >= 2 (require multiple locations to confirm) |
| Probe location unreliable | Replace with a different location in the same region |
| Timeout-based false alerts | Increase timeout value to accommodate normal latency variance |
| Interval too aggressive | Increase interval for non-critical checks (e.g. 1 min -> 5 min) |
| All checks share same locations | Diversify locations across regions to reduce correlated false alerts |
False negatives are real outages that monitoring fails to detect. These are more dangerous than false positives because they create a false sense of security.
| Cause | Why it happens | Fix |
|---|---|---|
| Missing check types | Only HTTP monitored, but DNS was the actual failure point | Add DNS, SSL, ICMP checks for comprehensive coverage |
| Wrong endpoint monitored | Health endpoint returns 200 even when the app is broken | Monitor a functional endpoint that exercises the real code path |
expect_string not set | HTTP check passes on any 200 response, even error pages | Add expect_string to verify response content |
| Too few locations | All probes are in one region; regional outage goes undetected from that region | Use 3-5 locations across multiple continents |
| Check is paused | Forgotten manual pause or stale maintenance window | Review paused checks; convert to scheduled maintenance windows |
| No upstream monitoring | Provider outage causes degradation but no check covers the dependency | Add CloudStatus checks for critical upstream providers |
Frequent false negatives indicate the monitoring strategy needs a broader review. Recommend invoking the monitoring-optimization skill to run a full audit: gap analysis, configuration review, and upstream dependency check.
Use ignore_alert to exclude a confirmed false positive from outage calculations. This is important for accurate SLA reporting: ignored alerts don't count as downtime.
When to ignore:
When NOT to ignore:
Always confirm with the user before ignoring alerts, as it affects SLA metrics.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub uptime-com/uptime-skills --plugin uptime