From cel-kdev
Live kernel debugging with drgn. Guides inspection of /proc/kcore, per-cpu variables, stack traces, slab caches, and data structure traversal. Covers correct API patterns, type introspection, container_of usage, and common pitfalls for SUNRPC, NFS, and AIO subsystems.
How this skill is triggered — by the user, by Claude, or both
Slash command
/cel-kdev:drgnThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Inspect running kernel state through `/proc/kcore`. Use for
Inspect running kernel state through /proc/kcore. Use for
diagnosing hangs, verifying data structure contents, tracing
reference counts, and examining queue states.
Always use -k for live kernel debugging:
sudo drgn -k
Do NOT use -c /proc/kcore -s <path> -e vmlinux. The -e
flag is parsed as inline Python, not as a vmlinux path. The
-k flag handles symbol resolution automatically.
For one-shot commands:
sudo drgn -k -c 'print(prog["jiffies"])'
For multi-line scripts, write to a temp file and execute:
cat > /tmp/drgn-script.py << 'PYEOF'
from drgn.helpers.linux.pid import find_task
task = find_task(prog, 1234)
print(task.comm)
PYEOF
sudo drgn -k /tmp/drgn-script.py
To create a typed pointer from a raw address:
from drgn import Object, cast
# From an address
obj = Object(prog, 'struct kioctx', address=addr)
# Cast an existing object to a different type
page = cast('struct page *', folio)
Do NOT use prog.object(type_=..., value=...) -- that API
does not exist. Use Object() from drgn directly.
Printing a pointer field dumps the entire target struct.
Use .value_() to get the raw address:
# BAD: prints entire task_struct
print(task.tk_client)
# GOOD: prints the pointer address
print(hex(task.tk_client.value_()))
Discover struct members when field names have changed across kernel versions:
members = [m.name for m in obj.type_.type.members if m.name]
print(members)
Essential when a field has been renamed (e.g., ring_pages
became ring_folios in the AIO subsystem).
When a struct is embedded inside another, cast the outer
struct using container_of, not a direct cast of the inner
pointer:
from drgn.helpers.linux.list import container_of
# xprt is embedded in rpcrdma_xprt
rdma_xprt = container_of(xprt, 'struct rpcrdma_xprt', 'rx_xprt')
Per-cpu variables require computing the actual address from the base symbol and the per-cpu offset for each CPU:
from drgn.helpers.linux.percpu import per_cpu_ptr
# Method 1: per_cpu_ptr helper (preferred)
symbol = prog['runqueues']
for cpu in range(nr_cpus):
rq = per_cpu_ptr(symbol, cpu)
print(f"CPU {cpu}: nr_running={rq.nr_running}")
# Method 2: manual offset calculation
offsets = prog['__per_cpu_offset']
for cpu in range(nr_cpus):
addr = base_addr + offsets[cpu].value_()
obj = Object(prog, 'struct kioctx_cpu', address=addr)
print(f"CPU {cpu}: reqs_available={obj.reqs_available}")
To get the number of online CPUs:
from drgn.helpers.linux.cpumask import for_each_online_cpu
cpus = list(for_each_online_cpu(prog))
from drgn.helpers.linux.pid import find_task
task = find_task(prog, PID)
print(prog.stack_trace(task))
Use frame[name] to access local variables in stack frames.
Do NOT use frame.locals() for values -- it returns metadata
tuples, not the variables themselves.
Variables may be <optimized out> -- the frame[name] call
succeeds but the value is unusable. Check by printing it
before dereferencing fields. Wrap access in try/except:
trace = prog.stack_trace(task)
for frame in trace:
try:
ctx = frame['ctx']
print(f"frame {frame}: ctx={ctx}")
except (KeyError, LookupError):
pass
WARNING: slab iteration is extremely slow on live systems. Iterating all allocated objects in a slab can take minutes and may time out. Prefer targeted approaches first:
/sys/kernel/slab/<name>/ stats for object countsWrap sudo drgn -k script.py in sudo timeout 60 drgn ...
for slab iterations or other operations that may take
unbounded time.
See references/helpers.md for slab iteration code patterns.
Struct layouts change across kernel versions. Always check member names before accessing unfamiliar fields:
members = [m.name for m in obj.type_.type.members
if m.name]
print(members)
atomic_t has .counter. atomic_long_t has .counter
directly (NOT .refs.counter). refcount_t wraps
atomic_t as .refs.counter. When in doubt, introspect:
# atomic_t / atomic_long_t
val = obj.counter.value_()
# refcount_t
val = obj.refs.counter.value_()
Every drgn invocation, /sys/kernel/slab/ read, and
/proc/kcore access requires root. Do not attempt
unprivileged access first -- it wastes a round-trip.
When iterating tasks or slab objects, filter inside the drgn script rather than dumping everything and scanning the output. Large unfiltered dumps consume context and often time out:
# BAD: dump all tasks
for task in for_each_task(prog):
print(task.pid.value_(), task.comm.string_())
# GOOD: filter to what matters
for task in for_each_task(prog):
if task.comm.string_() == b'fio':
print(task.pid.value_(), hex(task.state.value_()))
Separate scripts for "discover the struct layout" and "read the values" cost two drgn invocations and double the startup overhead. Probe the layout and act on it in one script:
members = {m.name for m in obj.type_.type.members if m.name}
if 'ring_folios' in members:
folio = obj.ring_folios[0]
elif 'ring_pages' in members:
folio = obj.ring_pages[0]
print(folio)
| Symptom | Cause | Fix |
|---|---|---|
NameError: vmlinux | -e vmlinux parsed as Python | Use drgn -k |
AttributeError: ring_pages | Field renamed | Introspect members first |
TypeError: prog.object(value=) | Wrong API | Use Object(prog, type, address=) |
FaultError on cast | Embedded struct | Use container_of() |
| Huge output on print | Printing struct pointer | Use .value_() for address |
frame.locals() confusion | Returns tuples | Use frame[name] directly |
| Wrong per-cpu value | Missing offset | Add __per_cpu_offset[cpu] |
folio_address import error | Not in drgn helpers | Cast to page, use page_to_virt |
.refs.counter on atomic_long_t | Wrong atomic type | Use .counter directly |
| slab iteration hangs | Too many objects | Use timeout, prefer targeted lookup |
list_empty(array[i]) fails | Need pointer, not value | Use array[i].address_of_() |
npx claudepluginhub chucklever/cel-kdev --plugin cel-kdevAnalyzes C/C++ core dumps and debugs live processes using GDB and gdb-cli, correlating crashes, deadlocks, memory issues with source code context for multi-threaded apps.
Guides systematic debugging with principles, workflows, bug patterns, and tools for memory (Valgrind), performance (perf/cProfile), and system tracing (strace/eBPF).
Debugs Kubernetes pods, nodes, and workloads using kubectl debug: ephemeral containers, pod copies, node access, debug profiles, and interactive sessions.