From enterprise-harness-engineering
Queries Prometheus metrics and alert rules via HTTP API for CPU/memory/disk utilization, service health, capacity trends, and target checks using PromQL.
How this skill is triggered — by the user, by Claude, or both
Slash command
/enterprise-harness-engineering:prometheusThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Query monitoring metrics, check alerts, and verify target health via the Prometheus HTTP API. API and PromQL syntax are referenced through Context7 MCP; only environment-specific rules are documented here.
Query monitoring metrics, check alerts, and verify target health via the Prometheus HTTP API. API and PromQL syntax are referenced through Context7 MCP; only environment-specific rules are documented here.
Configure your Prometheus endpoint before using this skill:
| Variable | Description | Required |
|---|---|---|
PROMETHEUS_URL | Your Prometheus server URL (e.g. http://prometheus.internal:9090) | Yes |
Common metric prefixes to monitor:
node_* — Node Exporter (host metrics: CPU, memory, disk, network)kube_* — kube-state-metrics (K8s object state: deployments, pods, nodes)container_* — cAdvisor (container resource usage)apiserver_* — K8s API Server metricskubelet_* — Kubelet metricsprometheus_* — Prometheus self-monitoringIf you have additional exporters (Kafka, Redis, custom applications), add their metric prefixes here:
| Prefix | Source | Description |
|---|---|---|
kafka_* | Kafka Exporter | Broker and consumer group metrics |
fluentbit_* | Fluent Bit | Log pipeline metrics |
| (add your own) |
Authentication: Configure as needed for your environment (none, basic auth, or bearer token).
API endpoints and PromQL syntax can be found in the official Prometheus documentation.
PROMETHEUS_URL accordinglystep should not be smaller than the scrape interval (typically 15s-60s) to avoid invalid interpolationrate() / sum by() aggregationsdate -v-1H +%s instead of the Linux date -d '1 hour ago' +%sJob labels are the key to locating services. Common naming patterns:
| Pattern | Example | Description |
|---|---|---|
{env}-{region}-{service} | prod-gateway | Service by environment and region |
kubernetes-{resource} | kubernetes-pods | Standard K8s metrics |
{component}-exporter | kafka-exporter | Dedicated exporters |
Configure your own job naming convention here to help the agent locate services correctly.
If you run Kafka with a Kafka Exporter, this is a common pattern:
# Aggregate consumer lag by consumergroup and topic
sum by (consumergroup, topic) (kafka_consumergroup_lag)
Normal lag range depends on your workload. Sustained growth indicates consumer processing capacity issues.
node_cpu_seconds_total -> node_memory_MemAvailable_bytes -> node_filesystem_avail_bytes -> locate high-load nodeskafka_brokers (broker count) -> kafka_consumergroup_lag (consumer lag) -> kafka_topic_partition_under_replicated_partition (under-replicated partitions)container_cpu_usage_seconds_total -> container_memory_working_set_bytes -> aggregate by pod/namespacekube_node_status_condition -> kube_pod_status_phase -> kube_deployment_status_replicas_unavailable# High-cardinality label aggregation -- will cause Prometheus OOM
curl "$PROMETHEUS_URL/api/v1/query?query=sum by(pod)(rate(container_cpu_usage_seconds_total[5m]))"
# pod label cardinality is too high (hundreds of pods); aggregate by namespace or deployment instead
# Check Kafka consumer lag
curl -s "$PROMETHEUS_URL/api/v1/query?query=sum%20by%20(consumergroup,topic)(kafka_consumergroup_lag)" | jq '.data.result[] | {group: .metric.consumergroup, topic: .metric.topic, lag: .value[1]}'
# Check node CPU usage top 10
curl -s "$PROMETHEUS_URL/api/v1/query?query=topk(10,100*(1-rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))" | jq '.data.result[] | {node: .metric.instance, cpu_pct: .value[1]}'
# Disk space prediction (will it be full in 24h)
curl -s "$PROMETHEUS_URL/api/v1/query?query=predict_linear(node_filesystem_avail_bytes{mountpoint=\"/\"}[24h],86400)" | jq '.data.result[] | {instance: .metric.instance, predicted_bytes: .value[1]}'
npx claudepluginhub addxai/enterprise-harness-engineering --plugin enterprise-harness-engineeringProvides PromQL query patterns, alerting rules, and Grafana Cloud Metrics integration for monitoring and observability workflows.
Prometheus instrumentation discipline: right metric type, right name, right labels. Invoke whenever task involves any interaction with Prometheus metrics — instrumenting application code, writing PromQL queries, defining alerting or recording rules, choosing metric types, managing label cardinality, building exporters, or reviewing monitoring configuration.
CLI for PromQL queries on Prometheus/Thanos/Cortex/VictoriaMetrics/Grafana engines: instant/range queries, metric/label discovery, table/CSV/JSON/graph outputs. For observability troubleshooting, performance issues, latency/errors/saturation analysis.