Analyzes Windows Server Failover Cluster (WSFC) CLUSTER.LOG files for Always On Availability Group root-cause diagnosis. Use this skill when an availability group has gone offline, a failover occurred unexpectedly, or a node was evicted, and you need to identify the WSFC-level cause that SQL Server DMVs cannot see. Applies 30 checks (L1–L30) covering lease timeouts, health check failures, quorum loss, node eviction, network partition, RHS crashes, AG resource transitions, Cloud Witness, Azure Arc, and Contained AG.
How this skill is triggered — by the user, by Claude, or both
Slash command
/mssql-performance-skills:sqlclusterlog-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Analyze Windows Server Failover Cluster (WSFC) CLUSTER.LOG files to diagnose Always On
Analyze Windows Server Failover Cluster (WSFC) CLUSTER.LOG files to diagnose Always On Availability Group failures at the cluster level — the layer below SQL Server DMVs. Applies 30 checks (L1–L30) across five categories:
Accept any of:
CLUSTER.LOG (live log: C:\Windows\Cluster\cluster.log; log generated via Get-ClusterLog: C:\Windows\Cluster\Reports\CLUSTER.LOG)For full analysis, the log should cover at least the 10 minutes before the incident and include entries from all cluster nodes. If only a partial extract is available, note which time range and nodes are covered and flag L25 if node coverage appears incomplete.
WSFC log entries follow this pattern:
<tid>.<pid>::<YYYY>/<MM>/<DD>-<HH>:<MM>:<SS>.<ms> <LEVEL> [<COMPONENT>] <message>
Key components:
[RES] — Resource DLL host (hadrres.dll operations)[hadrag] — AG-specific resource agent inside RES[RHS] — Resource Hosting Subsystem (manages RES process lifecycle)[RCM] — Resource Control Manager (orchestrates state transitions)[NM] — Network Manager[NODE] — Node membership and heartbeatERR / WARN / INFO — Severity prefixes in log lines| Threshold | Value | Used by |
|---|---|---|
| Error burst window | >10 ERR lines in 5 min → Critical; >5 → Warning | L4 |
| Failover cycling | ≥3 group moves in 30 min → Critical; ≥2 → Warning | L5 |
| Log time gap | >30 min → Critical; >5 min → Warning | L8 |
| Pending state duration | >120 sec → Critical; >30 sec → Warning | L12 |
| Lease timeout | 20 sec (SQL Server default LeaseTimeout — distinct from HealthCheckTimeout) | L1 |
| Health check timeout | 30 sec (SQL Server default HealthCheckTimeout for sp_server_diagnostics) | L2 |
| Heartbeat timeout | 5 missed heartbeats (WSFC default SameSubnetThreshold and CrossSubnetThreshold both default to 5) | L20 |
Evaluate these first — they reveal root causes that explain all downstream AG failures.
[hadrag] Lease Thread terminated, lease time expired, HealthCheckTimeout associated with a lease expiry message, or LeaseExpired in [RES] or [hadrag] contextsp_server_diagnostics timeout via CLUSTER_DIAGNOSTICS_TIMEOUT or raise HealthCheckTimeout in the AG resource properties.IsAlive check failed, LooksAlive check failed, HealthCheckTimeout, or sp_server_diagnostics returning a failure state (STATE = 3 or STATE = 4) in [RES]/[hadrag] messagessys.dm_os_ring_buffers for the incident time and review the SQL ERRORLOG for the matching SPID N error.RHS process terminated, RHS.EXE terminated unexpectedly, creating new RHS process, rhs.exe exit in [RHS] context, or RHS exiting / unhandled exception in RHSEvent ID 1146 (RHS terminated) and corresponding Dr. Watson / crash dump. Common causes: a resource DLL (hadrres.dll or another DLL) threw an unhandled exception, or a third-party DLL was loaded into RHS and faulted. Enable SeparateMonitor on the AG resource to isolate hadrres.dll in its own RHS process (see L24).[RCM] or [hadrag] group Move, Online, or Offline events for the same AG resource within any 30-minute windowMaxRestarts and eventually leaves the AG permanently offline. Root cause is almost always L1, L2, or L18 — a recurring condition that fails every recovery attempt. Identify and fix the root cause before the next failover window. Consider temporarily suspending the AG resource to prevent further cycling while the root cause is resolved.quorum loss, quorum not achieved, no quorum, lost quorum, or cluster service stopping — no quorum in any componentremoving node, node was removed from membership, evicted from cluster, or NodeMembership showing a node leaving in [NODE] or [NM] context[NODE] entries immediately before the eviction for heartbeat failures (L20) or network partition signals (L18). If eviction was manual, the cluster is operating as expected. If unexpected, review NIC bonding configuration and cross-subnet latency.These checks fire on SQL Server AG-specific resource events within the WSFC log layer.
TransitionToState ... Online-->Offline, TransitionToState ... OnlinePending-->Offline, OfflineCallIssued, or resource going offline in [RCM] or [hadrag] for an AG resource[hadrag] log entries immediately before the transition — they explain the SQL-side reason (replica disconnect, data sync failure, or SQL Server error).Disconnect from SQL Server, SQL Server connection failed, ODBC error, or SqlConnect failed in [hadrag] or [RES] contextsys.dm_exec_sessions vs. max connections); (3) the dedicated admin connection (DAC) is in use — hadrres.dll uses a regular connection, not DAC, so this is not a DAC issue. Restart SQL Server if terminated unexpectedly.forced failover in [hadrag], [RCM], or [RES] context, or FAILOVER_MODE = MANUAL paired with an Online event on a formerly secondary nodelast_commit_lsn on the former secondary (now primary) with the last confirmed last_hardened_lsn on the old primary. If administrator-initiated: document the reason and verify the replica is in synchronized state.OnlinePending or OfflinePending state for longer than the pending state thresholds (see Thresholds Reference) — calculated from the timestamp of the Pending entry to the next state transition entryhadrres.dll initialization error, failed to initialize, or DLL could not be loaded in [RES] or [RHS] contextsfc /scannow to check for corrupted system files; (4) if after a SQL Server patch, the cluster resource DLL path may need to be updated manually.API call timed out, Resource DLL returned ... after ... ms, or Dll timeout in [RCM] or [RHS] context for an AG resourceDllWatchdogTimeout value or fix the underlying SQL Server performance issue.[hadrag] messages showing the primary replica transitioning to Resolving or Secondary role without a corresponding planned failover commandDISCONNECTED, replica disconnected, or connectivity failure messages in [hadrag] or [RES] context that refer to a remote replica endpoint/sqlwait-review and check for HADR_SYNC_COMMIT and HADR_WORK_QUEUE waits that signal the send queue is backing up.network partition, split brain, lost quorum due to network, or unable to communicate with a quorum of nodes in any componentcluster network offline, NIC failure, network interface, adapter, or NetworkInterface going to failed state in [NM] contextmissed heartbeats, heartbeat timeout, node is not responding, or connectivity timeout between nodes in [NODE] or [NM] context — particularly when the count of missed heartbeats reaches or exceeds the CrossSubnetThreshold or SameSubnetThreshold (see Thresholds Reference)CrossSubnetDelay and CrossSubnetThreshold for geographically distributed clusters (higher latency = higher threshold needed). Do not tune same-subnet thresholds unless explicitly recommended by Microsoft — reducing them increases false evictions.witness resource failed, disk witness offline, cannot access file share witness, or cloud witness errors in [RES] or [RCM]*.blob.core.windows.net), and that the storage account key in the cluster configuration matches the current key.node isolated, unable to communicate with followed by multiple node names, or all communication lost for a node in [NODE] or [NM] contextVerboseLogging = 0 or VerboseLogging disabled explicitly, or critical diagnostic context (API call durations, resource state details) is absent from entries that would normally include it(Get-ClusterResource "AG Resource Name") | Set-ClusterParameter VerboseLogging 1. Verbose logging captures API call durations, state transition details, and health check results that are essential for post-incident diagnosis. Note that verbose logging increases disk I/O for the cluster log on busy clusters — test the disk impact before enabling in production.SeparateMonitor is not enabled for the AG resource(Get-ClusterResource "AG Resource Name") | Set-ClusterParameter SeparateMonitor 1. This causes the AG resource DLL to run in a dedicated rhs.exe process. An RHS crash in another resource DLL will no longer affect the AG. This is a Microsoft best practice for SQL Server AG resources on Windows Server 2012 R2 and later.[NODE] membership entries) is greater than the number of distinct node identifiers that appear as log entry sourcesGet-ClusterLog -Destination C:\ClusterLogs -TimeSpan 60 (omit -Node to collect from all nodes by default). Without logs from all nodes, an isolated node failure or network partition seen only from the failing node's perspective may not be visible. State which nodes are covered in the analysis summary.CloudWitness entries with Timeout or Unable to reach repeated ≥ 3 times in any 10-minute window — Windows Server 2016+ (Cloud Witness requires WS2016 or later)<storageaccount>.blob.core.windows.net:443. Verify no firewall or NSG change was made. Rotate or re-enter the access key in Failover Cluster Manager. If the witness is in a different Azure region than the cluster, consider a witness in the nearest region or fail over to a File Share Witness during the outage.ArcSqlExtension or HybridConnectivity entries with disconnected or heartbeat failure — any SQL Server version with Azure Arc agent installed on cluster nodesGet-Service -Name 'himds','ArcSqlInstanceExtension'. Check for service restarts in the Windows Event Log. Verify outbound connectivity to *.arc.azure.com:443. Re-run azcmagent connect if the MSI certificate has expired.<ag_name>_master) in FAILED or OFFLINE state — SQL 2022+ onlyStart-ClusterResource -Name '<ag_name>_master'. If it fails, check SQL Server ERRORLOG for the contained system database for corruption or I/O errors. Correlate with /sqlhadr-review (H23) and /sqlerrorlog-review (E31).CrossSubnet probe entries with FAILED or No response — indicates cross-subnet heartbeat connectivity loss between cluster nodes in different subnets or sitesRouteHistoryLength parameter is set appropriately and that multisite DNS resolution is functioning.[RHS] Resource 'SQL Server' IsAlive check failed or sp_server_diagnostics output showing state=WARNING or state=ERROR for any component — SQL 2012+WARNING); Critical (ERROR) — sp_server_diagnostics is the health check procedure SQL Server uses to report its own health to the WSFC; component warnings often precede lease timeouts and AG failoverssystem (scheduler/I/O non-yielding), resource (memory pressure), query_processing (blocking/deadlock/spinlock), io_subsystem (I/O errors), or events (recent critical events). Each maps to a specific diagnosis path: query_processing warnings → /sqlwait-review; io_subsystem warnings → /sqlerrorlog-review (E15–E19); resource warnings → /sqlerrorlog-review (E9–E14). Capture the full sp_server_diagnostics output: EXEC sys.sp_server_diagnostics.If the SQL Server version is stated by the user, read VERSION_COMPATIBILITY.md (~/.claude/skills/VERSION_COMPATIBILITY.md if installed, or skills/VERSION_COMPATIBILITY.md from the repo). If unavailable, skip silently. For checks whose minimum version exceeds the instance version: verbose mode → log as SKIP (version: requires SQL 20XX+, instance is SQL 20YY); standard report → omit entirely. Do not suppress NOT ASSESSED rows from missing input — only suppress version-inapplicable checks.
Structure the report as follows. The reference output in
skills/sqlclusterlog-review/examples/cluster-analysis.md demonstrates the expected quality level.
## Cluster Log Analysis
### Summary
- X Critical, Y Warnings, Z Info
- Time range: [first timestamp] – [last timestamp]
- Nodes covered: [node list from log entries]
- Highest-risk finding: [check name and check ID]
### Critical Issues
### [C1 — L1] Lease Timeout — ag_primary (14:32:01)
- **Observed:** [specific log lines, timestamps, and component tags]
- **Impact:** [why this matters at runtime — what failed and what the user experienced]
- **Fix:** [concrete action referencing the check fix steps]
### Warnings
### [W1 — L4] Error Burst — 8 ERR lines in 3 min (14:31:58–14:34:47)
- **Observed:** ...
- **Impact:** ...
- **Fix:** ...
### Info
### [I1 — L23] VerboseLogging = 0 — sparse event density
- **Observed:** ...
- **Impact:** ...
- **Fix:** ...
### Passed Checks
| Check | Result |
|-------|--------|
| L6 — Quorum Loss | PASS — no quorum loss entries in log |
| L7 — Node Eviction | PASS — no eviction events found |
---
*Analyzed by: [state the AI model and version you are running as, e.g. "Claude Sonnet 4.6", "DeepSeek R1", "GPT-4o"] · [current date and time in the user's local timezone, or UTC if timezone is unknown, e.g. "2026-05-16 20:15 NZST"]*
Labeling convention: Output labels use [C1], [W1], [I1] — not raw check IDs.
Check IDs (L1, L9) appear in parentheses after the label in finding headers.
Each finding states Observed (exact log evidence) → Impact (runtime effect) → Fix (actionable step). The Passed Checks table explicitly lists every L-check that was evaluated and not triggered, to signal analysis confidence.
If fewer than two cluster nodes are represented in the log, note this in the Summary and flag L25. If the log covers less than 5 minutes, note the limited time window.
--brief — Omit the Passed Checks table and attribution footer. Output the Summary, Findings, and Prioritized Fix Sequence sections only. Use when a quick scan of what fired is all that's needed.
--critical-only — Suppress Warning and Info findings. Show only Critical findings. The Passed Checks table is also omitted. Use when triaging an incident and only actionable blockers matter.
Both flags can be combined: --brief --critical-only produces the Summary section plus Critical findings only.
When neither flag is present, produce the full report as documented above.
When the user's request includes --verbose, --trace, or the word verbose:
1. Append a ## Check Evaluation Log section after the Passed Checks table.
Include one row for every check in this skill's ruleset, in check-ID order:
| Check | Evidence | Threshold | Result |
|---|---|---|---|
| [ID — Name] | [key attribute(s) and value found, or "absent"] | [threshold or condition] | PASS / FIRE → [severity] / NOT ASSESSED |
Result conventions:
PASS — attribute present, threshold not met**FIRE → Critical/Warning/Info** — threshold met; bold to distinguish from passesNOT ASSESSED — required attribute absent from input2. Save both files to the current working directory using the Write tool:
output//-/analysis.md ← full report output//-/trace.md ← Check Evaluation Log
Derive <input-prefix>:
horrible.sqlplan → horrible)run
Sanitize: alphanumeric + hyphens/underscores only, max 32 chars.File headers:
analysis.md → # Analysis — <skill-name> / # Input: <first 80 chars> / # Generated: <UTC timestamp>
trace.md → # Check Evaluation Log — <skill-name> / # Input: <first 80 chars> / # Generated: <UTC timestamp>
Create directories as needed. When --verbose is not present, write nothing to disk.
/sqlhadr-review — SQL-side AG state snapshot: replica sync health, redo/send queue sizes, estimated data loss — the complement to CLUSTER.LOG root-cause analysis
/sqlerrorlog-review — SQL Server ERRORLOG timeline: AG failover events, lease expiry messages, memory pressure, and I/O warnings that correspond to WSFC events
/sqlwait-review — correlate HADR_WORK_QUEUE, HADR_SYNC_COMMIT, and HADR_REPLICA_DDL_END waits with cluster log timestamps to connect the SQL-side wait signal to the WSFC-level event
/sqlquerystore-review — after an AG failover identified in CLUSTER.LOG, use Query Store to detect plan regressions on the new primary
/sqlplan-review — if scheduler starvation caused L1 or L2, analyze the long-running query that blocked the health check thread
mssql-performance-review — Orchestrator that routes mixed artifacts to multiple specialised skills (this one included), runs an adversarial root-cause check, and produces a single consolidated report with evidence chain, risk-rated fixes, and rollback. Use when you have several artifact types together or describe a symptom without knowing which skill to run.
npx claudepluginhub vanterx/mssql-performance-skills --plugin mssql-performance-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.