From home-lab-ops
Validates Prometheus metrics, Grafana dashboards, and the monitoring stack configuration to prevent silent breakage when exporter versions change
How this skill is triggered — by the user, by Claude, or both
Slash command
/home-lab-ops:monitoring-guardThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The monitoring stack runs on **VM 203** (10.220.1.63) as Docker Compose services:
The monitoring stack runs on VM 203 (10.220.1.63) as Docker Compose services:
monitoring/
├── prometheus/ # Metrics collection + alerting
├── grafana/ # Dashboards + visualization
├── loki/ # Log aggregation
├── promtail/ # Log shipping (runs on all cluster nodes)
└── exporters/
├── node-exporter # All 6 Proxmox nodes + all VMs
├── pve-exporter # Proxmox VE metrics (runs on monitoring VM)
├── ceph-exporter # Ceph cluster metrics
├── ipmi-exporter # iDRAC hardware metrics
├── pbs-exporter # Proxmox Backup Server metrics
└── unifi-poller # UniFi Dream Machine Pro metrics
# SSH to monitoring VM
ssh [email protected]
# Check all containers running
docker compose ps
# Check each exporter is actually scraping
curl -s localhost:9100/metrics | head -5 # node_exporter
curl -s localhost:8082/pve # pve-exporter (check for 200 OK)
curl -s localhost:9283/metrics | head -5 # ceph-exporter
# Prometheus health
curl -s localhost:9090/-/healthy
curl -s localhost:9090/api/v1/targets | python3 -m json.tool | grep -E '"health"|"job"'
# On monitoring VM — verify exporter returns data
curl -s localhost:<exporter_port>/metrics | grep <metric_name>
# Check target health in Prometheus
curl -s 'localhost:9090/api/v1/targets' | python3 -m json.tool | grep -A3 '"job": "<job_name>"'
# Query for a metric
curl -s 'localhost:9090/api/v1/query?query=<metric_name>' | python3 -m json.tool
See references/metric-registry.md for known metric name changes between exporter versions.
In Prometheus, explore available metrics:
curl -s 'localhost:9090/api/v1/label/__name__/values' | python3 -m json.tool | grep <keyword>
curl -s 'localhost:9090/api/v1/rules' | python3 -m json.tool | grep -E '"name"|"health"'
When modifying a Grafana dashboard JSON:
"expr": fields in the dashboard JSONinstance, job, host labels must match what Prometheus records# Extract all metric expressions from a dashboard JSON
cat roles/monitoring/files/grafana/dashboards/<dashboard>.json | \
python3 -c "import json,sys; d=json.load(sys.stdin); \
[print(p.get('expr','')) for panel in d.get('panels',[]) \
for t in panel.get('targets',[]) for p in [t]]"
Docker containers running as non-root can't read files created with mode 0600.
# Check permissions on config files
ls -la /opt/monitoring/prometheus/
# All .yml files should be 0644
sudo chmod 644 /opt/monitoring/prometheus/*.yml
# Verify node_exporter is running on a Proxmox host
ssh [email protected] "systemctl status prometheus-node-exporter"
# Should be active/running and listening on port 9100
# Verify from monitoring VM
curl -s 10.220.1.8:9100/metrics | head
Promtail runs on each Proxmox node. Check:
# On a Proxmox node
systemctl status promtail
journalctl -u promtail -n 30
# rsyslog must be forwarding to promtail's port
cat /etc/rsyslog.d/99-promtail.conf
# Should contain: *.* action(type="omfwd" target="localhost" port="1514" protocol="tcp")
When adding a new exporter to the monitoring stack:
roles/monitoring/templates/docker-compose.yml.j2roles/monitoring/templates/ or files/roles/monitoring/files/prometheus/prometheus.ymlvault_<exporter>_api_token)0644 in the role taskdocker compose up -d <new_container> before applying via AnsibleSee references/metric-registry.md for the metric names exposed by each exporter.
npx claudepluginhub infiquetra/infiquetra-claude-plugins --plugin home-lab-opsDeploys monitoring stacks like Prometheus, Grafana, Datadog with exporters, scrape targets, alerting rules, Grafana dashboards for Kubernetes or Docker.
Configures monitoring stacks with Prometheus/Grafana, Datadog, CloudWatch, PagerDuty; designs USE/RED dashboards; sets up custom metrics, alerts, uptime checks, log aggregation.
Configures Prometheus for metric collection, scrape configs, recording rules, and alerting. Useful when setting up monitoring or troubleshooting Prometheus setup.