From ai-toolkit
Investigates monitoring alerts end-to-end by pulling metrics, logs, traces, and recent code changes to identify root causes. For on-call engineers handling alerts via Datadog, Grafana, or PagerDuty MCPs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-toolkit:ops-investigate-alertThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Investigate a monitoring alert by pulling metrics, logs, traces, and related service code. Symptoms in, root cause hypothesis out.
Investigate a monitoring alert by pulling metrics, logs, traces, and related service code. Symptoms in, root cause hypothesis out.
Check which monitoring MCP servers are available. Look for any mcp__* tools related to monitoring platforms (Datadog, Grafana, PagerDuty, etc.).
Recommended: Datadog MCP — provides the richest investigation surface (monitors, metrics, logs, traces, events in one platform).
If no monitoring MCP is available, stop with:
Error: No monitoring MCP server found. This skill requires a monitoring MCP to query alert data. Recommended: add the Datadog MCP to your Claude Code MCP settings.
Also check for optional tools:
gh) — for reading related service code and recent deploysNote which are available — adapt the investigation accordingly.
If a monitoring platform URL:
If an alert name or description:
Retrieve the monitor configuration and current state:
Fetch the metric(s) that triggered the alert:
Search logs for the affected service and environment:
Search for distributed traces:
If Kubernetes MCP or cloud CLI is available:
If not available (VPN, permissions), note it and continue with available data.
gh available)gh auth status
If authenticated:
gh api repos/<org>/<service>/tags --jq '.[0:3] | .[] | {name: .name, sha: .commit.sha}'
gh api repos/<org>/<service>/compare/<prev-tag>...<latest-tag> --jq '.commits[] | {sha: .sha[:7], message: .commit.message, author: .commit.author.name}'
NEVER create, push, or modify tags.
## Alert Investigation: <Alert Name>
**Status:** <OK / Alert / Warn / No Data>
**Service:** <service> | **Env:** <env>
**Triggered:** <timestamp> | **Duration:** <duration or "Ongoing">
### Metrics
<key observations — spike at X time, value Y vs threshold Z>
### Logs
<key log lines or patterns — N errors of type X, stack trace summary>
### Traces
<latency or error observations — if available>
### Infrastructure
<pod status, resource usage — if available>
### Recent Code Changes
<commits near trigger time, or "No recent changes" or "gh CLI not available">
### Root Cause Hypothesis
<best assessment based on available data — be explicit about confidence level>
### Recommended Next Steps
1. <most impactful action>
2. <secondary action>
3. <what to check if hypothesis is wrong>
If data is inconclusive, say so explicitly and suggest what to check manually (e.g., VPN access to k8s, direct DB query, checking with the team).
Present the investigation summary inline in the conversation. No file output unless the user asks to save it.
npx claudepluginhub c0x12c/ai-toolkit --plugin ai-toolkitInvestigates Grafana alerts using gcx CLI to check states, query datasources like Prometheus, determine firing causes, scope, and impact. For diagnosing specific firing or pending alerts.
Guides debugging of Kubernetes applications and alerts using VictoriaMetrics metrics, VictoriaLogs, VictoriaTraces via 4-phase protocol with subagents.
Investigates production issues by querying Datadog logs, metrics, and APM traces, then correlating findings with codebase. Useful for debugging errors, latency spikes, alerts in deployed services.