Skill

ops-investigate-alert

Investigates monitoring alerts end-to-end by pulling metrics, logs, traces, and recent code changes to identify root causes. For on-call engineers handling alerts via Datadog, Grafana, or PagerDuty MCPs.

Bash

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ai-toolkit:ops-investigate-alert

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Investigate a monitoring alert by pulling metrics, logs, traces, and related service code. Symptoms in, root cause hypothesis out.

SKILL.md

166 lines · ~1.4k tokens

Stats

LanguageJavaScript

Parent stars70

Parent forks5

MaintenanceExcellent

Last CommitApr 8, 2026

Actions

View Source View Plugin View on GitHub View README

Investigate Alert

Investigate a monitoring alert by pulling metrics, logs, traces, and related service code. Symptoms in, root cause hypothesis out.

When to Use

A monitoring alert fired and you need to understand why
On-call engineer needs a structured investigation starting point
Alert is noisy and you want to determine if it's actionable or a false positive

Process

1. Verify Monitoring MCP Availability

Check which monitoring MCP servers are available. Look for any mcp__* tools related to monitoring platforms (Datadog, Grafana, PagerDuty, etc.).

Recommended: Datadog MCP — provides the richest investigation surface (monitors, metrics, logs, traces, events in one platform).

If no monitoring MCP is available, stop with:

Error: No monitoring MCP server found. This skill requires a monitoring MCP to query alert data. Recommended: add the Datadog MCP to your Claude Code MCP settings.

Also check for optional tools:

GitHub CLI (gh) — for reading related service code and recent deploys
Kubernetes MCP — for checking pod status

Note which are available — adapt the investigation accordingly.

2. Parse Input

If a monitoring platform URL:

Extract the monitor/alert ID from the URL
Proceed to Step 3

If an alert name or description:

Search monitors using the available monitoring MCP
If multiple match, list them and ask the user to confirm which one
If still not found, ask the user for more details

3. Fetch Monitor Details

Retrieve the monitor configuration and current state:

Monitor name, type, and query
Current status (OK / Alert / Warn / No Data)
Last triggered time
Affected service(s) and environment(s)
Alert message and runbook link (if any)

4. Query Metrics

Fetch the metric(s) that triggered the alert:

Time window: from 1 hour before trigger to now (or 1 hour after resolution)
What to look for: anomalies, spikes, drops, flat lines, threshold crossings
Compare against the monitor's threshold to understand severity

5. Analyze Logs

Search logs for the affected service and environment:

Time window: same as Step 4
What to look for: errors, stack traces, timeouts, connection failures, unusual patterns
Filter by severity: focus on ERROR and WARN levels first, then broaden if needed

6. Check Traces (if available)

Search for distributed traces:

Filter by service name and time window
What to look for: slow spans, error spans, unusual latency distribution, failing downstream calls

7. Check Infrastructure (if available)

If Kubernetes MCP or cloud CLI is available:

Pod status, restart counts, OOM kills
Resource usage (CPU, memory) near the alert time
Recent deployment events

If not available (VPN, permissions), note it and continue with available data.

8. Check Recent Code Changes (if `gh` available)

gh auth status

If authenticated:

Identify the repo from the service name

Check recent releases/tags to see what's currently deployed:

gh api repos/<org>/<service>/tags --jq '.[0:3] | .[] | {name: .name, sha: .commit.sha}'

Diff between last 2 tags to see what changed in the latest release:

gh api repos/<org>/<service>/compare/<prev-tag>...<latest-tag> --jq '.commits[] | {sha: .sha[:7], message: .commit.message, author: .commit.author.name}'

Look at relevant code based on the error type:
- HTTP errors → route handlers, middleware
- DB errors → query code, connection pooling
- Timeout errors → external call clients, timeout configs
- OOM → memory-heavy operations, unbounded collections

NEVER create, push, or modify tags.

9. Present Investigation Summary

## Alert Investigation: <Alert Name>

**Status:** <OK / Alert / Warn / No Data>
**Service:** <service> | **Env:** <env>
**Triggered:** <timestamp> | **Duration:** <duration or "Ongoing">

### Metrics
<key observations — spike at X time, value Y vs threshold Z>

### Logs
<key log lines or patterns — N errors of type X, stack trace summary>

### Traces
<latency or error observations — if available>

### Infrastructure
<pod status, resource usage — if available>

### Recent Code Changes
<commits near trigger time, or "No recent changes" or "gh CLI not available">

### Root Cause Hypothesis
<best assessment based on available data — be explicit about confidence level>

### Recommended Next Steps
1. <most impactful action>
2. <secondary action>
3. <what to check if hypothesis is wrong>

If data is inconclusive, say so explicitly and suggest what to check manually (e.g., VPN access to k8s, direct DB query, checking with the team).

Interaction Style

Lead with data, not guesses — show metrics/logs/traces before forming a hypothesis
Be explicit about confidence: "high confidence", "likely", "inconclusive"
If a step yields no data, say so and move on — don't speculate

Rules

Never skip Steps 3-5 (monitor, metrics, logs) — these are the core investigation
Steps 6-8 (traces, infra, code) are optional based on tool availability
Never create, push, or modify tags or deployments during investigation
Always present the structured summary at the end, even if inconclusive

Output

Present the investigation summary inline in the conversation. No file output unless the user asks to save it.

ops-investigate-alert

Popularity

Invocation

Context Preview

SKILL.md

ops-investigate-alert

Popularity

Invocation

Context Preview

SKILL.md

Investigate Alert

When to Use

Process

1. Verify Monitoring MCP Availability

2. Parse Input

3. Fetch Monitor Details

4. Query Metrics

5. Analyze Logs

6. Check Traces (if available)

7. Check Infrastructure (if available)

8. Check Recent Code Changes (if gh available)

9. Present Investigation Summary

Interaction Style

Rules

Output

Similar Skills

Investigate Alert

When to Use

Process

1. Verify Monitoring MCP Availability

2. Parse Input

3. Fetch Monitor Details

4. Query Metrics

5. Analyze Logs

6. Check Traces (if available)

7. Check Infrastructure (if available)

8. Check Recent Code Changes (if gh available)

9. Present Investigation Summary

Interaction Style

Rules

Output

Similar Skills

8. Check Recent Code Changes (if `gh` available)

8. Check Recent Code Changes (if `gh` available)