From vm-sre-skills
Troubleshooting procedure for high memory usage and OOM (Out of Memory) conditions on Azure Virtual Machines. Covers both Linux and Windows VMs. Use when a VM shows high memory utilization, OOM kills occur, swap pressure is elevated, or a user reports slow VM performance related to memory exhaustion.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vm-sre-skills:high-memory-oom-troubleshootingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> ⚠️ **Example skill** — designed as a starting point. Test and customize for your environment.
⚠️ Example skill — designed as a starting point. Test and customize for your environment.
Use this skill when:
This skill guides you through a structured investigation:
az vm run-commandBefore running any commands, determine the target VM's details:
az vm show --resource-group {rg} --name {vm-name} --query "{os:storageProfile.osDisk.osType, size:hardwareProfile.vmSize, location:location}" -o json
This tells you whether to use Linux or Windows commands in the following steps.
Query Azure Monitor for the VM's memory trend over the last hour. Use your built-in Kusto/metrics capabilities to check:
Available Memory Bytes metric — is it steadily declining or did it drop suddenly?az monitor metrics list --resource-group {rg} --resource {vm-name} --resource-type Microsoft.Compute/virtualMachines --metric "Available Memory Bytes" --interval PT1M --start-time {start-time} --end-time {end-time} -o json
This gives you a timeline before looking inside the VM.
Important:
az vm run-command invokeis a write operation and requires theRunAzCliWriteCommandstool, notRunAzCliReadCommands.
Before running any commands, confirm the VM agent is healthy. If the agent is unhealthy, Run Command will fail.
az vm get-instance-view --resource-group {rg} --name {vm-name} --query "instanceView.vmAgent.statuses[0]" -o json
If the agent status is not "Ready", do not proceed with Run Command. Instead, check VM boot diagnostics or serial console.
Based on the OS type from Step 1, run the appropriate commands.
Check memory usage overview:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "free -m && echo '---' && cat /proc/meminfo | head -20"
Top memory-consuming processes:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "ps aux --sort=-%mem | head -20"
Check swap usage:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "swapon --show && echo '---' && vmstat 1 3"
Check OOM killer activity:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "dmesg -T | grep -i 'oom\|out of memory\|killed process' | tail -30"
Check memory pressure over time:
Note:
sarrequires thesysstatpackage, which is not installed by default on many Linux distributions. The command below attemptssarfirst and falls back to/proc/meminfoif unavailable.
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "sar -r 1 5 2>/dev/null || cat /proc/meminfo"
Check for memory leaks (growing RSS):
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "ps -eo pid,ppid,rss,vsz,comm --sort=-rss | head -20"
Check cgroup memory limits:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null && echo '---' && cat /sys/fs/cgroup/memory/memory.usage_in_bytes 2>/dev/null || echo 'cgroup v1 memory controller not available'"
Run these commands to identify what is consuming memory:
Top memory-consuming processes:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-Process | Sort-Object WorkingSet64 -Descending | Select-Object -First 20 Name, Id, @{N='MemMB';E={[math]::Round($_.WorkingSet64/1MB,1)}}, @{N='VMMB';E={[math]::Round($_.VirtualMemorySize64/1MB,1)}} | Format-Table -AutoSize"
System memory overview:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-CimInstance Win32_OperatingSystem | Select-Object @{N='TotalGB';E={[math]::Round($_.TotalVisibleMemorySize/1MB,2)}}, @{N='FreeGB';E={[math]::Round($_.FreePhysicalMemory/1MB,2)}}, @{N='UsedPct';E={[math]::Round(($_.TotalVisibleMemorySize - $_.FreePhysicalMemory)/$_.TotalVisibleMemorySize * 100,1)}}"
Check memory pressure over time:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-Counter '\Memory\Available MBytes' -SampleInterval 2 -MaxSamples 5"
Page file usage:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-CimInstance Win32_PageFileUsage | Select-Object Name, @{N='AllocatedMB';E={$_.AllocatedBaseSize}}, @{N='UsedMB';E={$_.CurrentUsage}}"
Check pool exhaustion and OOM-equivalent events:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-WinEvent -FilterHashtable @{LogName='System'; Id=2019,2020} -MaxEvents 10 | Format-Table TimeCreated, Id, ProviderName, Message -Wrap"
Check job object or container memory limits:
Note: Windows services may run under Job Objects or inside containers. Use Job Object status for host-level constraints, and
docker statsorcrictl statswhen the workload is containerised.
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-CimInstance Win32_JobObjectStatus | Select-Object Name, PeakJobMemoryUsed | Format-Table -AutoSize"
Recent memory-related events:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-EventLog -LogName System -EntryType Error,Warning -Newest 20 | Where-Object {$_.Message -match 'memory|resource exhaustion|pool'}"
Check the Azure Activity Log for recent operations on the VM or its resource group:
az monitor activity-log list --resource-group {rg} --offset 24h --query "[?contains(to_lower(resourceId), to_lower('{vm-name}'))].{time:eventTimestamp, op:operationName.localizedValue, status:status.localizedValue, caller:caller}" -o table
Look for:
After gathering evidence, produce a report in this format:
## High Memory / OOM Investigation Report
**VM**: {vm-name}
**Resource Group**: {rg}
**OS**: {Linux/Windows}
**VM Size**: {size}
**Investigation Time**: {timestamp}
### Timeline
- Memory pressure started: {time}
- Current available memory: {available} MB / {total} MB ({used-pct}% used)
- Linux swap usage: {swap-used} MB / {swap-total} MB
- Windows pagefile usage: {pagefile-used} MB / {pagefile-total} MB
- Linux OOM kills detected: {yes/no, count from dmesg if applicable}
- Windows pool exhaustion / WerFault events: {yes/no, Event ID 2019/2020 count and crash count if applicable}
- Duration: {duration}
### Root Cause
{Description of what is consuming memory and why}
### Evidence
- Top process: {process-name} (PID {pid}) using {mem} MB RSS/working set
- Linux OOM killer log: {relevant dmesg excerpt or "no OOM events found"}
- Windows pool exhaustion events: {Event ID 2019/2020 excerpt or "no pool exhaustion events found"}
- {Additional findings from commands above}
### Recommendation
{One of the following, with justification:}
- **No action needed** — transient spike, already resolving
- **Restart process** — {process} has a memory leak, safe to restart
- **Scale up VM** — workload legitimately needs more memory ({current-size} → {recommended-size})
- **Investigate application** — {process} is leaking memory, needs application-level profiling
- **Add or increase swap/pagefile** — workload has occasional spikes that Linux swap or the Windows pagefile can absorb
- **Adjust cgroup/container limits** — memory limits are too restrictive for the workload
- **Rollback recent change** — memory spike correlates with {change} at {time}
### Next Steps
{Specific actions the operator should take}
dmesg shows no OOM events but the issue is suspected, check /var/log/syslog or /var/log/messages for older entries.free shows under "used", since Linux aggressively uses free memory for caching.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub raskip/azure-sre-agent-stuff --plugin vm-sre-skills