Skill

dd-monitors

From pup

Manages Datadog monitors: create, update, mute/unmute, with best practices for alerting, scoping, recovery thresholds, and safe deletion workflows.

Datadog

monitoring

Popularity

Stars

911

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/pup:dd-monitors

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Create, manage, and maintain monitors for alerting.

SKILL.md

198 lines · ~1.1k tokens

Stats

LanguageRust

Stars911

Forks85

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

Datadog Monitors

Create, manage, and maintain monitors for alerting.

Prerequisites

This requires the pup binary in your path.

pup - cargo install --git https://github.com/DataDog/pup

Quick Start

pup auth login

Common Operations

List Monitors

pup monitors list
pup monitors list --tags "team:platform"
pup monitors search --query "status:Alert"

Get Monitor

pup monitors get <id>

Create Monitor

pup monitors create --file monitor.json

Mute/Unmute

# Mute with duration
pup monitors update 12345 --file monitor-muted.json

# Or mute with specific end time
pup monitors update 12345 --file monitor-muted-until.json

# Unmute
pup monitors update 12345 --file monitor-unmuted.json

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

Rule	Why
No flapping alerts	Use `last_Xm` not `last_1m`
Meaningful thresholds	Based on SLOs, not guesses
Actionable alerts	If no action needed, don't alert
Include runbook	`@runbook-url` in message

# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window

2. Use Proper Scoping

# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅

3. Set Recovery Thresholds

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""

⚠️ NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

Monitor Types

Type	Use Case
`metric alert`	CPU, memory, custom metrics
`query alert`	Complex metric queries
`service check`	Agent check status
`event alert`	Event stream patterns
`log alert`	Log pattern matching
`composite`	Combine multiple monitors
`apm`	APM metrics

Audit Monitors

# Find monitors without owners
pup monitors list | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

Downtime vs Muting

Use	When
Mute monitor	Quick one-off, < 1 hour
Downtime	Scheduled maintenance, recurring

# Downtime (preferred)
pup downtime create --file downtime.json

Failure Handling

Problem	Fix
Alert not firing	Check query returns data, thresholds
Too many alerts	Increase window, add recovery threshold
No data alerts	Check agent connectivity, metric exists
Auth error	`pup auth refresh`

dd-monitors

Popularity

Invocation

Context Preview

SKILL.md

dd-monitors

Popularity

Invocation

Context Preview

SKILL.md

Datadog Monitors

Prerequisites

Quick Start

Common Operations

List Monitors

Get Monitor

Create Monitor

Mute/Unmute

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

2. Use Proper Scoping

3. Set Recovery Thresholds

4. Include Context in Messages

⚠️ NEVER Delete Monitors Directly

Monitor Types

Audit Monitors

Downtime vs Muting

Failure Handling

References

Similar Skills

Datadog Monitors

Prerequisites

Quick Start

Common Operations

List Monitors

Get Monitor

Create Monitor

Mute/Unmute

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

2. Use Proper Scoping

3. Set Recovery Thresholds

4. Include Context in Messages

⚠️ NEVER Delete Monitors Directly

Monitor Types

Audit Monitors

Downtime vs Muting

Failure Handling

References

Similar Skills