From mtv-skills
Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis.
How this skill is triggered — by the user, by Claude, or both
Slash command
/mtv-skills:check-ceph-healthThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this guide to diagnose and remediate Ceph storage issues on OpenShift clusters running OCS/ODF (OpenShift Data Foundation).
Use this guide to diagnose and remediate Ceph storage issues on OpenShift clusters running OCS/ODF (OpenShift Data Foundation).
This skill requires:
oc metrics (kubectl-metrics) -- for Ceph metrics (health, capacity, OSD, PG)oc debug-queries (kubectl-debug-queries) -- for listing resources, logs, eventsIf any tool is missing, install with:
curl -sSL https://raw.githubusercontent.com/yaacov/kubectl-metrics/main/install.sh | bash
curl -sSL https://raw.githubusercontent.com/yaacov/kubectl-debug-queries/main/install.sh | bash
--query for FilteringUse --query to filter, sort, and project results server-side. Use pipe output to jq, grep, or other post-processing tools only when --query cannot express what you need.
The --query flag accepts TSL (Tree Search Language) with four optional clauses:
[select <field>, ...] [where <condition>] [order by <field> [asc|desc]] [limit N]
Note: select only affects table output (the default). With --output json, all fields are always returned regardless of select.
Operators: =, !=, <, >, <=, >=, like (% wildcard), ilike (case-insensitive), ~= (regex), and, or, not, in [...], between X and Y.
Before writing queries, discover actual field names with --output json:
oc debug-queries list --resource pods --namespace openshift-storage --limit 2 --output json
oc metrics query --query "ceph_health_status"
Health values: 0=OK, 1=WARN, 2=ERR.
oc metrics query --query "ceph_cluster_total_bytes"
oc metrics query --query "ceph_cluster_total_used_bytes"
oc metrics query --query "ceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100"
oc debug-queries get --resource cephcluster --namespace openshift-storage --output json
Health states:
HEALTH_OK -- cluster is healthyHEALTH_WARN -- degraded but functional (backfillfull, nearfull, degraded PGs)HEALTH_ERR -- critical, writes may be blocked (full OSDs, too few OSDs, down PGs)oc metrics query --query "ceph_osd_stat_bytes"
oc metrics query --query "ceph_osd_stat_bytes_used"
oc metrics query --query "rate(ceph_osd_op_latency_sum[5m]) / rate(ceph_osd_op_latency_count[5m])"
oc debug-queries list --resource pods --namespace openshift-storage --selector "app=rook-ceph-osd"
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*osd-prepare.*'"
oc debug-queries list --resource pvc --namespace openshift-storage --selector "app=rook-ceph-osd"
oc metrics query --query "ceph_pg_total"
oc metrics query --query "ceph_pg_active"
oc metrics query --query "ceph_pg_degraded"
oc metrics query --query "ceph_pool_percent_used * 100"
oc metrics query --query "rate(ceph_pool_rd[5m])"
oc metrics query --query "rate(ceph_pool_wr[5m])"
oc metrics query --query "ceph_pool_stored"
oc metrics query --query "ceph_pool_max_avail"
PVC provisioning is handled by CSI driver pods. If these are unhealthy, no volumes can be created.
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*rbd.*ctrlplugin.*'"
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*cephfs.*ctrlplugin.*'"
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*rbd.*nodeplugin.*'"
Check CSI provisioner logs:
oc debug-queries logs --name <rbd-ctrlplugin-pod> --namespace openshift-storage --container csi-rbdplugin --tail 50
oc debug-queries list --resource pvc --all-namespaces --query "where status.phase = 'Pending'"
oc debug-queries get --resource pvc --name <pvc-name> --namespace <namespace>
oc debug-queries list --resource pv --all-namespaces --query "where status.phase = 'Released'"
oc debug-queries list --resource storageclass --all-namespaces
Symptoms: PVCs stuck in Pending, provisioning errors with DeadlineExceeded or operation already exists.
Diagnosis:
oc metrics query --query "ceph_health_status"
oc metrics query --query "ceph_osd_stat_bytes_used / ceph_osd_stat_bytes * 100"
oc debug-queries get --resource cephcluster --namespace openshift-storage --output json
Look for OSD_FULL and POOL_FULL messages in the CephCluster status.
Remediation: See "Requires Shell" section below for oc delete pv and ceph osd set-full-ratio.
Symptoms: Cluster functional but approaching full. Warnings about nearfull or backfillfull OSDs.
Remediation:
Symptoms: HEALTH_WARN with messages about degraded or undersized placement groups.
Diagnosis:
oc metrics query --query "ceph_pg_degraded"
oc metrics query --query "ceph_pg_total - ceph_pg_active"
oc debug-queries events --namespace openshift-storage --query "where type = 'Warning'"
Remediation:
Symptoms: PVC events say "waiting for external provisioner" but no ProvisioningFailed errors.
Diagnosis:
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*ctrlplugin.*'"
oc debug-queries logs --name <rbd-ctrlplugin-pod> --namespace openshift-storage --container csi-rbdplugin --tail 100
Remediation:
Symptoms: POOL_FULL warning but individual OSDs have space.
Diagnosis:
oc metrics query --query "ceph_pool_percent_used * 100"
oc metrics query --query "ceph_osd_stat_bytes_used / ceph_osd_stat_bytes * 100"
Remediation:
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*ocs-operator.*|.*odf-operator.*|.*rook-ceph-operator.*'"
oc debug-queries logs --name deployment/rook-ceph-operator --namespace openshift-storage --tail 50
Check for high restart counts:
oc metrics query --query "topk(10, sort_desc(kube_pod_container_status_restarts_total))"
Run these periodically to avoid surprise outages:
oc metrics query --query "ceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100"
oc debug-queries list --resource pv --all-namespaces --query "where status.phase = 'Released'"
oc debug-queries list --resource pvc --all-namespaces --query "where status.phase = 'Pending'"
Act when usage exceeds 70% -- start cleaning up or expanding capacity before hitting the 85% full threshold.
These remediation operations require shell access:
oc get pv --field-selector status.phase=Released
oc delete pv <released-pv-names>
MON_POD=$(oc -n openshift-storage get pods -l app=rook-ceph-mon -o jsonpath='{.items[0].metadata.name}')
MON_ADDR=$(oc -n openshift-storage get pod $MON_POD -o jsonpath='{.spec.containers[0].env[?(@.name=="ROOK_CEPH_MON_HOST")].value}' | sed 's/\[//;s/\]//')
# Raise to 0.92 to unblock writes temporarily
oc -n openshift-storage exec $MON_POD -c mon -- \
ceph -m $MON_ADDR --keyring /etc/ceph/keyring-store/keyring \
osd set-full-ratio 0.92
# After space is freed, reset to default
oc -n openshift-storage exec $MON_POD -c mon -- \
ceph -m $MON_ADDR --keyring /etc/ceph/keyring-store/keyring \
osd set-full-ratio 0.85
When you need to discover available flags or verify syntax:
oc debug-queries list --help
oc debug-queries logs --help
oc metrics query --help
npx claudepluginhub kubev2v/mtv-skills --plugin mtv-skillsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.