From trogonstack-datadog
Design new Datadog dashboards, redesign existing ones, or audit dashboards for operational readiness. Covers widget selection, layout organization, template variables, group structure, tab organization, alert threshold validation, and zero-knowledge readability. Uses pup CLI for inspecting dashboards and validating designs. Use when designing new dashboards, redesigning existing ones, auditing before on-call handoff, or reviewing after dashboard changes. Do not use for: (1) Datadog agent installation or configuration, (2) monitor/alert rule design, (3) APM instrumentation or tracing setup, (4) log pipeline configuration.
How this skill is triggered — by the user, by Claude, or both
Slash command
/trogonstack-datadog:datadog-design-dashboardThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Design a dashboard layout that tells a clear story — from high-level health signals down to granular diagnostics — using proper widget types, group organization, and template variables for reusability.
Design a dashboard layout that tells a clear story — from high-level health signals down to granular diagnostics — using proper widget types, group organization, and template variables for reusability.
Important: Always check for existing dashboards first with pup dashboards list --agent. Do not create a new dashboard if one already exists for the same service or purpose — update the existing one instead. Only create a new dashboard when no relevant one exists or the user explicitly asks for a new one.
Philosophy: The frameworks, layouts, and widget guides in this skill are starting points — not rigid rules. Every product and business is different. Understand the domain first, then adapt the frameworks to fit. The best dashboards reflect how the business actually works, not how a generic template says they should.
First, determine the mode:
Skip if ALL of these are already specified: dashboard purpose, target audience, data sources, template variable needs, dashboard strategy.
pup dashboards get <id> --agent before designing.Skip if ALL of these are already specified: dashboard ID or URL, service name or team context.
Always interview if: No dashboard ID is provided or multiple dashboards may be relevant.
Applies to design mode. Skip if auditing only.
Before designing, understand what you are building observability for. The metrics that matter depend entirely on the product and business context.
Ask the user:
If the user points you to a codebase: Read it. Look at the entry points, the API routes, the database models, the queue consumers, the external service calls. Understanding the code gives you the context to choose metrics that actually matter — not just generic RED/USE signals.
If the user describes the business: Use that context to tailor the Business (B) section. An e-commerce service cares about checkout success rates. A messaging service cares about delivery latency. A payment service cares about transaction completion. Generic "request rate" and "error rate" are a starting point, but the real value comes from metrics that map to customer-visible outcomes.
Skip domain discovery if: You already have deep context about the service from prior conversations or the user has provided detailed specifications.
Gate: Before designing the Business group, you must be able to name at least 3 domain outcomes specific to this service — in plain language a product manager would recognize. Examples: "order placed", "payment completed", "message delivered". If you cannot name them, ask the user before proceeding. Do not substitute transport-layer metrics (gRPC error rate, HTTP request rate) as placeholders — those are P, not B. See the B trap in references/widgets.md.
Skip to Audit if the user only wants to review an existing dashboard.
pup dashboards list --agent
pup dashboards get <dashboard-id> --agent
If auditing an existing dashboard, fetch its definition first and analyze its current structure before redesigning.
Before designing widgets, check what metrics and tag values actually exist for the service. This prevents designing around metrics that don't exist or using the wrong tag values in queries.
# See what metrics are available for the service
pup metrics list --filter="<service-name>.*" --agent
# Verify the service tag is active and see what metrics are flowing
pup metrics list --filter="trace.*" --tag-filter="service:<service-name>" --agent
Use the actual metric names and tag values you find here when writing widget queries — do not guess or invent them. If a metric you expect does not appear, flag it to the user before building widgets around it.
This applies to all query types: metric queries, APM span filters (operation_name, resource_name, span tags), and log filters. The **Configuration** sections in references/widgets.md describe JSON structure and field constraints only — they are not prescriptive queries. Always verify the actual filter values with pup before using them.
Match the dashboard purpose to a framework. Read references/frameworks.md for detailed metric mappings and group structures.
| Purpose | Framework |
|---|---|
| Service overview | RED (Rate, Errors, Duration) |
| Infrastructure | USE (Utilization, Saturation, Errors) |
| Executive/business | Golden Signals |
| SLO tracking | SLI/SLO |
| Debugging | Drill-down |
Using your domain understanding and the chosen framework, design the group structure and select widgets. Read these references before designing:
If the dashboard has 7+ top-level groups, evaluate whether tabs would reduce scroll fatigue. Read references/tabs.md for the organization pattern, JSON schema, and common mistakes.
Before adding tabs to an existing dashboard, audit all widget IDs for duplicates — tabs trigger strict uniqueness validation.
Present the design using this template:
# Dashboard Design: [Dashboard Title]
## Purpose
[1-2 sentences: what this monitors, who uses it]
## Template Variables
| Variable | Tag | Default |
|----------|-----|---------|
| ... | ... | `*` |
## Layout
### Group: [Group Title]
| Widget | Type | Query/Metric | Width | Alert Threshold |
|--------|------|-------------|-------|----------------|
| ... | ... | ... | ... | ... |
[Repeat for each group]
Applies to both modes. Run after design, or directly if auditing an existing dashboard.
The core principles are: graphs should earn their place with alert thresholds, thresholds should sit close to normal traffic, a business section should exist at the top, and the dashboard should be readable by someone with zero service knowledge.
These are guiding principles — not a rigid checklist. Apply judgment based on the product and business context. A context-providing metric (like deployment events) may earn its place without a threshold. A service with unusual traffic patterns may need different proximity rules.
# If given a service name, list all dashboards and identify the relevant one by title
pup dashboards list --agent
# If given a URL, extract the dashboard ID from the path (e.g., /dashboard/abc-def-ghi/...)
# Get the full dashboard definition (includes the dashboard URL in the response)
pup dashboards get <dashboard-id> --agent
# Verify real metric names exist
pup metrics list --filter="trace.http.request.*" --agent
Parse the response to build an inventory of all widgets, groups, and their configurations.
Read references/widgets.md for the full widget prefix system before cataloging.
Catalog every widget in the dashboard:
| Widget Title | Prefix | Type | Group | Has Alert Threshold | Threshold Value | Notes |
|---|---|---|---|---|---|---|
| ... | I0/P1/D0/B0/— | ... | ... | ... | ... | ... |
Focus on timeseries and query value widgets — these are the primary candidates for alert threshold markers.
Read references/thresholds.md for threshold marker principles, configuration details, and findings format.
For each timeseries widget, check:
Read references/thresholds.md for proximity guidance, Y-axis configuration rules, and findings format.
For each widget with a threshold, check:
Principle: A dedicated Business (B) group should exist at the top of the dashboard with 5-8 key metrics for immediate outage identification. Business metrics are customer-visible outcomes — not infrastructure or domain internals. The specific metrics should reflect the product's business transactions, not generic traffic and error rates.
Check:
B0-N: prefix?P, not B, regardless of where it is placed. Flag and recommend moving to the appropriate platform group.Findings format:
#### Business Section Audit
**Status**: MISSING / INCOMPLETE / OK
**Current state**: [Description of what exists]
**Recommended metrics** (if missing or incomplete):
1. B0: Key transaction success rate (are critical flows completing?)
2. B0: Customer-facing error rate (are requests failing for customers?)
3. B1: API p99 latency (are responses slow for customers?)
4. B1: Total request rate (are we receiving traffic?)
5. B2: Queue depth or processing lag (is async work backing up?)
6. B2: Key business event throughput (e.g. orders created, payments processed)
Principle: Someone with zero knowledge of the service should be able to spot problems by looking for red indicators.
Evaluate:
Findings format:
#### Zero-Knowledge Readability Audit
| Check | Status | Finding |
|-------|--------|---------|
| Problems visible in <10s | FAIL | No red lines on 8 of 12 graphs |
| Conditional formatting on QV widgets | PARTIAL | 2 of 4 QV widgets have thresholds |
| Group names self-explanatory | OK | All groups use clear names |
| Runbook/ownership note | MISSING | No note widget with team info |
If the dashboard has 7+ top-level groups, check whether tabs are in use. See references/tabs.md.
Compile all findings into a structured report:
# Dashboard Audit: [Dashboard Title]
**Dashboard ID**: [id]
**URL**: [url]
**Review date**: [date]
## Summary
[2-3 sentence summary: overall health of the dashboard, critical issues count]
## Critical Issues
[List issues that must be fixed before the dashboard is production-ready]
## Alert Threshold Audit
[From step 3]
## Threshold Proximity Audit
[From step 4]
## Business Section Audit
[From step 5]
## Zero-Knowledge Readability Audit
[From step 6]
## Recommended Actions
### Must Fix
1. [Action item with specific widget and group reference]
### Should Fix
1. [Action item]
### Nice to Have
1. [Action item]
pup metrics list --agent — no invented metric names[service] Purpose pattern — no "Dashboard" suffix, no environment in the titletitle field updated in the JSON (not just the filename) — redeploy after any title changepup metrics list --agent — no hardcoded env, service, or host values; use the variable set appropriate for the dashboard type (see references/layouts.md)B-prefixed metrics tailored to the service's customer-visible outcomesI0:, P1:, D0:, B0:, etc.) — see references/widgets.md@N positional references, common mistakesSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub trogonstack/agentskills --plugin trogonstack-datadog