From obol
Monitor and diagnose Obol DVT cluster performance using their hosted Grafana (Prometheus metrics + Loki logs)
How this skill is triggered — by the user, by Claude, or both
Slash command
/obol:obol-monitoringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diagnose health issues, duty failures, and performance issues and opportunities across Obol Distributed Validator Technology (DVT) clusters using Grafana datasources (Prometheus for metrics, Loki for logs).
Diagnose health issues, duty failures, and performance issues and opportunities across Obol Distributed Validator Technology (DVT) clusters using Grafana datasources (Prometheus for metrics, Loki for logs).
Charon is distributed validator middleware for Ethereum proof-of-stake. A cluster of 4-7 independent nodes each run Charon to coordinate threshold BLS signing for shared validators. If a threshold t of n nodes agree on duty data and sign it, the duty succeeds. Typical configurations: 3-of-4 (fault tolerance 1) or 5-of-7 (fault tolerance 2).
Each node in a cluster runs:
Deployed via docker-compose from CDVN or LCDVN repos.
Every validator duty flows through these stages. A failure at any stage produces a specific reason code:
Scheduler → Fetcher → Consensus (QBFT) → DutyDB → ValidatorAPI → ParSigDB → ParSigEx → SigAgg → AggSigDB → Broadcast
| Duty Type | Timing | Economic Impact | Notes |
|---|---|---|---|
proposer | Slot start (T+0) | High — missed block reward | Rare but high-value |
attester | T + slot_duration/3 (~T+4s) | Medium — missed attestation reward + inactivity penalty | Most common duty |
sync_message / sync_contribution | Various | Low — small sync committee rewards | Only for validators in sync committee |
aggregator / prepare_aggregator | T + 2*slot_duration/3 (~T+8s) | None — no direct economic reward or penalty | Non-economic, deprioritize |
Slot timing: 12 seconds per slot, 32 slots per epoch (6.4 minutes).
Set the Grafana API token:
export OBOL_GRAFANA_API_TOKEN="glsa_..."
python3 ${CLAUDE_SKILL_DIR}/scripts/cluster_triage.py "Cluster Name" [--network mainnet] [--hours 1]
Outputs JSON with: cluster config, health status (readyz), versions, consensus performance, per-peer participation rates, duty failure reasons, P2P connectivity, beacon node health, balance summary, and log availability.
python3 ${CLAUDE_SKILL_DIR}/scripts/duty_analysis.py "Cluster Name" 13867535 [--duty attester] [--network mainnet]
Reconstructs a chronological timeline of consensus events for a specific slot from Loki logs. Shows per-peer: when consensus started, when pre-prepares arrived, round changes, BN call timings, and the final outcome.
python3 ${CLAUDE_SKILL_DIR}/scripts/fleet_overview.py [--network mainnet] [--hours 1]
Aggregates across all clusters: version distribution, BN/VC client diversity, failure rates by reason, worst-performing clusters, and Loki log coverage.
If the agent cannot execute scripts, use these Prometheus queries directly via the Grafana API:
Cluster health: app_monitoring_readyz{cluster_name="X",cluster_network="mainnet"}
Attester failure rate: sum(rate(core_tracker_failed_duties_total{cluster_name="X",duty="attester"}[1h])) / sum(rate(core_tracker_expect_duties_total{cluster_name="X",duty="attester"}[1h]))
Per-peer participation: sum(rate(core_tracker_participation_missed_total{cluster_name="X",duty="attester"}[1h])) by (cluster_peer, peer) * 3600
Failure reasons: sum(rate(core_tracker_failed_duty_reasons_total{cluster_name="X"}[1h])) by (duty, reason) * 3600
Consensus timeouts: sum(rate(core_consensus_timeout_total{cluster_name="X",duty="attester"}[1h])) by (cluster_peer) * 3600
When a user reports a cluster issue, follow this sequence:
Get the cluster name and network (default: mainnet). Run cluster_triage.py.
If any peer has readyz != 1, that's the first thing to report:
Compare attester timeout rate to decision rate. If timeout / (timeout + decisions) > 5%, the cluster has a consensus problem. Check:
Look at per-peer participation rates. The peer with the lowest success rate is likely the bottleneck. Common causes:
app_eth2_latency_seconds for attestation_data endpoint) — Teku BNs have ~170ms median, Lighthouse ~5msapp_beacon_node_peers) — low count means blocks arrive lateUse the failure reason reference below to provide actionable guidance.
If the issue is intermittent or slot-specific, use duty_analysis.py for a specific failing slot to reconstruct the timeline.
| Code | What Broke | What To Do |
|---|---|---|
fetch_bn_error | BN API call failed during fetch | Check BN health, syncing status, connectivity. Check app_monitoring_readyz. |
broadcast_bn_error | BN rejected the broadcast submission | Check BN logs, version compatibility, disk space. |
not_included_onchain | Duty was broadcast but not included on-chain | Check broadcast delay via core_bcast_broadcast_delay_seconds. Investigate BN network health and relay connectivity. |
| Code | What Broke | What To Do |
|---|---|---|
no_consensus | QBFT consensus didn't complete | Check peer connectivity (p2p_ping_success), P2P latency, firewall rules. If one cluster: likely a slow/offline peer. If many clusters: likely a relay issue. |
| Code | What Broke | What To Do |
|---|---|---|
no_local_vc_signature | VC didn't submit a partial signature | Check VC is running, connected to Charon port 3600, keys loaded correctly. Check VC logs. |
| Code | What Broke | What To Do |
|---|---|---|
no_peer_signatures | No signatures received from any peer | All peers offline or complete P2P failure. Check relay URLs, firewall. |
insufficient_peer_signatures | Some but not enough signatures for threshold | Identify which peers are missing via core_tracker_participation_missed_total. |
| Code | What Broke | What To Do |
|---|---|---|
failed_proposer_randao | Prerequisite RANDAO duty failed | Fix the RANDAO failure first — likely peer connectivity issue. |
proposer_insufficient_randaos | Not enough partial RANDAO signatures | Some VCs didn't sign RANDAO. Check VC connectivity. |
proposer_zero_randaos | Zero RANDAO signatures from local VCs | Local VC not signing. Check VC health. |
proposer_no_external_randaos | No RANDAO signatures from peers | P2P exchange failed. Check peer connectivity. |
| Code | What Broke | What To Do |
|---|---|---|
failed_aggregator_selection | Prepare aggregator duty failed | Non-economic. Check if underlying attester duty also failed. |
insufficient_aggregator_selections | Not enough partial beacon committee selections | Non-economic. VC connectivity issue. |
zero_aggregator_prepares | Zero selections from local VCs | Non-economic. VC not participating. |
no_aggregator_selections | No selections received from peers | Non-economic. P2P exchange issue. |
missing_aggregator_attestation | Attestation data unavailable for aggregation | The underlying attester duty failed. Fix attester first. |
| Code | What Broke | What To Do |
|---|---|---|
sync_contribution_no_sync_msg | Prerequisite sync message duty failed | Check sync message consensus. |
sync_contribution_few_prepares | Insufficient sync contribution selections | VC connectivity. |
sync_contribution_zero_prepares | Zero sync contribution selections | VC not participating. |
sync_contribution_failed_prepare | Prepare sync contribution failed | Check underlying duty. |
sync_contribution_no_external_prepares | No selections from peers | P2P exchange. |
par_sig_db_inconsistent_sync | Inconsistent sync committee signatures | Known limitation — expected in current versions. Not a bug. |
| Code | What Broke | What To Do |
|---|---|---|
bug_fetch_error | Unexpected fetch error | Report to Obol with version, cluster name, slot. |
bug_par_sig_db_inconsistent | Inconsistent partial signatures (non-sync) | Report to Obol. |
bug_par_sig_db_external | Failed to store external signatures | Report to Obol. |
bug_par_sig_db_internal | ParSigDB didn't trigger exchange | Report to Obol. May occur due to expiry race. |
bug_sig_agg | BLS threshold aggregation failed | Report to Obol. Inconsistent signed data. |
bug_aggregation_error | Failed to store aggregated signature | Report to Obol. |
bug_duty_db_error | Failed to store in DutyDB | Report to Obol. |
app_monitoring_readyz — Operational status. Values 1-8 (see Step 2 above).core_tracker_expect_duties_total{duty="..."} — Total expected dutiescore_tracker_success_duties_total{duty="..."} — Successfully completedcore_tracker_failed_duties_total{duty="..."} — Failed dutiescore_tracker_failed_duty_reasons_total{duty="...",reason="..."} — By reason codecore_tracker_participation_total{duty="...",peer="..."} — Peer participatedcore_tracker_participation_missed_total{duty="...",peer="..."} — Peer missedcore_consensus_timeout_total{duty="...",timer="..."} — Consensus failures (the duty timed out entirely)core_consensus_decided_rounds{duty="..."} — Rounds needed to decide (1 is ideal)core_consensus_duration_seconds{duty="..."} — Time to reach consensuseager_dlinear = EagerDoubleLinear (default). Round N budget = N seconds from duty start time.app_eth2_latency_seconds{endpoint="..."} — BN API latency. Key: attestation_data (should be <50ms for Lighthouse, ~170ms for Teku)app_beacon_node_peers — BN's own peer count. <50 is concerning.app_beacon_node_version{version="..."} — BN software version.app_eth2_errors_total{endpoint="..."} — BN API errors by endpoint.p2p_ping_latency_secs{peer="..."} — Latency per peer. <100ms good, >500ms problematic.p2p_ping_success{peer="..."} — 1=connected, 0=not. Best proxy for live connectivity.p2p_peer_connection_types{peer="...",type="...",protocol="..."} — direct (good) or relay (higher latency).app_version{version="..."} — Charon version per peer.core_validatorapi_vc_user_agent{user_agent="..."} — VC client. Missing UA = likely Lighthouse VC.cluster_operators — Authoritative cluster size (from lock file). Use this, NOT count of reporting peers.cluster_threshold — Minimum signatures needed.cluster_validators — Number of validators in cluster.app_feature_flags{feature_flags="..."} — Custom-enabled features. Absence does NOT mean feature is off — stable features (EagerDoubleLinear, ConsensusParticipate, ProposalTimeout, FetchOnlyCommIdx0) are on by default without this metric.core_scheduler_validator_balance_gwei{pubkey="...",pubkey_full="..."} — Per-validator balance in gwei.core_bcast_broadcast_total{duty="..."} — Successful broadcasts per duty type.core_bcast_broadcast_delay_seconds{duty="..."} — Delay from slot start to broadcast.Base selector: {cluster_name="X",cluster_network="mainnet"}
Timestamp handling: Always parse the embedded ts= field (e.g., ts=2026-03-09T20:53:59.123Z). Loki receipt timestamps can skew 1-2 seconds due to batching. The ts= field is the application-level timestamp.
Log presence detection: Use the Loki /series endpoint with match[]={cluster_name="X"} to check which peers send logs. Do NOT rely on query_range with limit=1.
| Event | LogQL filter |
|---|---|
| Slot start | |= "Slot ticked" |
| Consensus start | |= "QBFT consensus instance starting" |
| Pre-prepare received | |= "QBFT upon rule triggered" |
| Round change | |= "QBFT round changed" |
| Consensus decided | |= "QBFT consensus decided" |
| Consensus timeout | |= "consensus timeout" |
| BN call finished | |= "Beacon node call finished" |
| Slow BN call | |= "Beacon node call took longer" |
| Duplicate signatures | |= "Ignoring duplicate partial signature" |
| Signature aggregated | |= "Successfully aggregated partial signatures" |
| Attestation submitted | |= "Successfully submitted v2 attestations" |
| Client | attestation_data p50 | Notes |
|---|---|---|
| Lighthouse | ~5ms | Fast, most common |
| Teku | ~170ms | Significantly slower, can cause round 1 misses in tight clusters |
| Prysm | ~5ms | Fast |
| Nimbus | ~5-10ms | Generally fast |
| Lodestar | ~5ms | Fast |
| Grandine | ~5ms | Fast |
app_log_loki_dropped_total metric — if non-zero, the Loki buffer is overflowing (better than blocking, but indicates high log volume)cluster_operators metric, NOT count of observed peers. Some nodes don't send metrics.cluster_operators - observed peers = unknown/offline nodes. Report them explicitly.app_feature_flags.not_included_onchain: Investigate — check broadcast delay metrics and BN connectivity.ts= field, not Loki receipt timestamp (can skew 1-2s).npx claudepluginhub obolnetwork/skills --plugin obolCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.