From vm-sre-skills
Investigation procedure for disk IOPS and throughput throttling on Azure Virtual Machines. Covers both Linux and Windows VMs, Premium SSD, Standard SSD, and Ultra Disk configurations. Use when a VM shows slow disk performance, high IO wait, disk latency spikes, IOPS or throughput throttling, or when an application is slow but CPU and memory appear normal.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vm-sre-skills:disk-iops-throttlingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> ⚠️ **Example skill** — designed as a starting point. Test and customize for your environment.
⚠️ Example skill — designed as a starting point. Test and customize for your environment.
Use this skill when:
This skill guides you through a structured investigation:
Get the VM's size, OS type, and all attached disks:
az vm show --resource-group {rg} --name {vm-name} --query "{size:hardwareProfile.vmSize, os:storageProfile.osDisk.osType, osDisk:{name:storageProfile.osDisk.name, caching:storageProfile.osDisk.caching}, dataDisks:storageProfile.dataDisks[].{name:name, lun:lun, sizeGB:diskSizeGb, caching:caching}}" -o json
Note the VM size (needed for Step 3), OS type (determines commands in Step 5), and caching settings for each disk.
For each disk identified in Step 1, retrieve its performance limits:
az disk show --resource-group {rg} --name {disk-name} --query "{name:name, sizeGB:diskSizeGb, sku:sku.name, tier:tier, iops:diskIOPSReadWrite, throughputMBps:diskMBpsReadWrite, burstingEnabled:burstingEnabled}" -o json
Run this for the OS disk and every data disk. Record the iops and throughputMBps values — these are the per-disk limits you will compare against actual usage in Step 4.
Query the VM size capabilities for the VM's location:
az vm list-sizes --location {location} --query "[?name=='{vm-size}']" -o json
Note: The
az vm list-sizesoutput includesmaxDataDiskCountand resource limits, but does not include the VM-level cached/uncached IOPS and throughput caps. These limits are critical for diagnosing VM-level throttling. Refer to the following common limits or consult the Azure VM sizes documentation:
VM Size Uncached IOPS Uncached Throughput (MBps) Cached IOPS Cached Throughput (MBps) Standard_D2s_v3 3,200 48 4,000 100 Standard_D4s_v3 6,400 96 8,000 200 Standard_D8s_v3 12,800 192 16,000 400 Standard_D16s_v3 25,600 384 32,000 800 Standard_E4s_v3 6,400 96 8,000 200 Standard_E8s_v3 12,800 192 16,000 400 Standard_L8s_v2 6,400 160 8,000 200
az monitor metrics list --resource {disk-resource-id} --metric "Composite Disk Read Operations/sec" "Composite Disk Write Operations/sec" --aggregation Average Maximum --interval PT5M --start-time {1-hour-ago}
az monitor metrics list --resource {disk-resource-id} --metric "Composite Disk Read Bytes/sec" "Composite Disk Write Bytes/sec" --aggregation Average Maximum --interval PT5M --start-time {1-hour-ago}
az monitor metrics list --resource {vm-resource-id} --metric "VM Cached IOPS Consumed Percentage" "VM Uncached IOPS Consumed Percentage" --aggregation Average Maximum --interval PT5M --start-time {1-hour-ago}
az monitor metrics list --resource {vm-resource-id} --metric "VM Cached Bandwidth Consumed Percentage" "VM Uncached Bandwidth Consumed Percentage" --aggregation Average Maximum --interval PT5M --start-time {1-hour-ago}
If consumed percentage metrics are consistently above 90%, throttling is occurring at that level.
Important:
az vm run-command invokeis a write operation and requires theRunAzCliWriteCommandstool, notRunAzCliReadCommands.
IO statistics and IO wait:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunShellScript --scripts "iostat -xdm 1 3 2>/dev/null || cat /proc/diskstats && echo '---IOWAIT---' && vmstat 1 3"
Key indicators:
%iowait in vmstat: High values (>20%) indicate processes are blocked waiting for disk IOawait in iostat: Average time (ms) for IO requests to complete — values >20ms on SSD indicate throttlingavgqu-sz in iostat: Average queue depth — high values mean requests are queuing behind throttle%util in iostat: Values near 100% indicate the disk is saturatedDisk performance counters:
az vm run-command invoke --resource-group {rg} --name {vm-name} --command-id RunPowerShellScript --scripts "Get-Counter '\PhysicalDisk(*)\Avg. Disk Queue Length', '\PhysicalDisk(*)\Disk Reads/sec', '\PhysicalDisk(*)\Disk Writes/sec', '\PhysicalDisk(*)\Avg. Disk sec/Read', '\PhysicalDisk(*)\Avg. Disk sec/Write' -SampleInterval 1 -MaxSamples 3 | Select-Object -ExpandProperty CounterSamples | Format-Table InstanceName, Path, CookedValue -AutoSize"
Key indicators:
Use this decision tree to identify where throttling is occurring:
Is disk-level IOPS near disk tier limit?
├── YES → Disk-level IOPS throttling
│ → Upgrade disk SKU (e.g., P20 → P30) or enable on-demand bursting
│
Is disk-level throughput near disk tier limit?
├── YES → Disk-level throughput throttling
│ → Upgrade disk SKU or switch to Premium SSD v2 / Ultra Disk
│
Is VM Cached IOPS Consumed % > 90%?
├── YES → VM-level cached IOPS throttling
│ → Resize VM to a larger SKU with higher cached IOPS limits
│
Is VM Uncached IOPS Consumed % > 90%?
├── YES → VM-level uncached IOPS throttling
│ → Resize VM or move workloads to cached disks (ReadOnly caching)
│
Is VM Cached/Uncached Bandwidth Consumed % > 90%?
├── YES → VM-level throughput throttling
│ → Resize VM to a larger SKU with higher throughput limits
│
None of the above near limits?
└── Issue is not IOPS/throughput throttling
→ Investigate network, application-level locks, or other bottlenecks
ReadOnly caching on data disks with read-heavy workloads. This uses the VM's local SSD as a read cache and can dramatically reduce throttling on the managed disk.| Disk SKU | Size (GiB) | Provisioned IOPS | Provisioned Throughput (MBps) | Burst IOPS | Burst Throughput (MBps) |
|---|---|---|---|---|---|
| P10 | 128 | 500 | 100 | 3,500 | 170 |
| P15 | 256 | 1,100 | 125 | 3,500 | 170 |
| P20 | 512 | 2,300 | 150 | 3,500 | 170 |
| P30 | 1,024 | 5,000 | 200 | 30,000 | 1,000 |
| P40 | 2,048 | 7,500 | 250 | 30,000 | 1,000 |
| P50 | 4,096 | 7,500 | 250 | 30,000 | 1,000 |
| P60 | 8,192 | 16,000 | 500 | 30,000 | 1,000 |
| P70 | 16,384 | 18,000 | 750 | 30,000 | 1,000 |
| P80 | 32,767 | 20,000 | 900 | 30,000 | 1,000 |
Note: Burst values apply to disks ≤512 GiB with credit-based bursting (default) or to any size with on-demand bursting enabled. For P30 and above, burst values apply only with on-demand bursting enabled.
After gathering evidence, produce a report in this format:
## Disk IOPS / Throughput Throttling Investigation Report
**VM**: {vm-name}
**Resource Group**: {rg}
**OS**: {Linux/Windows}
**VM Size**: {size}
**Location**: {location}
**Investigation Time**: {timestamp}
### Disk Inventory
| Disk Name | Type | SKU | Size (GiB) | Max IOPS | Max Throughput (MBps) | Caching | Bursting |
|-----------|------|-----|-----------|----------|----------------------|---------|----------|
| {disk} | OS | {sku} | {size} | {iops} | {throughput} | {cache} | {yes/no} |
| {disk} | Data | {sku} | {size} | {iops} | {throughput} | {cache} | {yes/no} |
### VM-Level Limits
| Metric | Limit | Current (Avg) | Current (Max) | Status |
|--------|-------|---------------|---------------|--------|
| Cached IOPS | {limit} | {avg}% | {max}% | {OK/THROTTLED} |
| Uncached IOPS | {limit} | {avg}% | {max}% | {OK/THROTTLED} |
| Cached Bandwidth | {limit} | {avg}% | {max}% | {OK/THROTTLED} |
| Uncached Bandwidth | {limit} | {avg}% | {max}% | {OK/THROTTLED} |
### Per-Disk Analysis
| Disk Name | Metric | Current (Avg) | Current (Max) | Disk Limit | % Used | Status |
|-----------|--------|---------------|---------------|------------|--------|--------|
| {disk} | IOPS | {avg} | {max} | {limit} | {pct}% | {OK/THROTTLED} |
| {disk} | Throughput | {avg} MBps | {max} MBps | {limit} | {pct}% | {OK/THROTTLED} |
### Throttling Source
{disk-level / VM-level / both / none detected}
### Root Cause
{Description of what is causing the throttling and which workload is driving IO}
### Recommendation
{One or more of the following, with justification:}
- **No action needed** — IO is within limits, investigate elsewhere
- **Upgrade disk SKU** — {disk-name} is hitting {current-sku} limits → upgrade to {recommended-sku}
- **Enable bursting** — enable on-demand bursting on {disk-name} for intermittent spikes
- **Resize VM** — VM-level throttling at {percentage}% → resize from {current-size} to {recommended-size}
- **Enable caching** — add ReadOnly caching on {disk-name} for read-heavy workload
- **Switch to Premium SSD v2 / Ultra Disk** — workload needs dynamically adjustable IOPS/throughput
- **Stripe disks** — aggregate IOPS across multiple disks using LVM or Storage Spaces
### Next Steps
{Specific actions the operator should take}
az disk update.npx claudepluginhub raskip/azure-sre-agent-stuff --plugin vm-sre-skillsCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.