hpc-netcdf-parallelization

Make large gridded/meteorological array computes fast and portable on ANY HPC. Detects the cluster's backends (netCDF4, HDF5, zarr, NCZarr, PnetCDF), MPI/parallel-HDF5, filesystem (NFS/Lustre/GPFS), and core/NUMA/RAM layout, then recommends + applies the optimal chunking, read pattern, storage backend, and worker count. 1 skill + 3 commands + 4 CLI/JSON tools.

HPC netCDF Parallelization — for Claude Code

A Claude Code skill that makes large gridded (climate / meteorological) array computes fast and portable on any HPC by fixing the two things that actually slow them down: I/O (chunking + read patterns) and parallelism (worker sizing + load balance). It detects your cluster's available backends and hardware, then recommends and applies the right strategy — so the same workflow works on NFS laptops, NUMA fat-nodes, and Lustre/MPI clusters alike.

The core lesson: a "slow parallel compute" is usually slow serial I/O (redundant decompression), not slow math. This skill makes you profile I/O before blaming compute or adding cores. In the case that motivated it, a 3.5-hour "compute" was 99 % netCDF decompression; a contiguous-slab read was 77× faster, and adding cores would have made it worse.

Install (Claude Code, v3.7.0+)

/plugin marketplace add yanxingjianken/hpc-netcdf-parallelization
/plugin install hpc-netcdf-parallelization

Then run /hpc-detect on your cluster, or just describe a slow netCDF/parallel job and the skill activates. (Traditional flow: clone and symlink the skill dir into ~/.claude/skills/ or your agent's skills path.)

What it does

Detects the environment — detect_env.py probes netCDF4 / HDF5 / zarr / NCZarr / PnetCDF, MPI + parallel-HDF5, the filesystem (NFS/Lustre/GPFS), and cores / NUMA / RAM, then prints a JSON recommendation (write backend, worker count for memory- vs CPU-bound work, NUMA + multi-node path).
Fixes I/O — contiguous-slab reads instead of scattered isel(time=[…]); access-pattern chunking on write (time=1); data-preserving rechunking of existing files; backend choice.
Parallelizes correctly — fine-grained load-balanced pmap (process or Dask), combinable per-pixel (sum, valid-count) accumulators for correct means, BLAS pinning, NUMA pinning, and worker counts sized to the memory-bandwidth knee (not the core count).
Detaches long jobs so they survive an editor/SSH disconnect.

Bundled tools (CLI + JSON → MCP/agent-friendly)

Tool	Purpose
`scripts/detect_env.py`	Environment + backend detection → JSON recommendation (`--human` for a table)
`scripts/read_slab.py`	Contiguous-slab reader + `bench` (scattered vs slab on your file)
`scripts/rechunk.py`	Data-preserving rechunk to the access pattern (`nccopy`/xarray, `--verify`)
`scripts/parallel_map.py`	Load-balanced `pmap` + combinable-mean `add_accumulators`, BLAS-pinned

Commands

/hpc-detect — report the cluster's backends + recommendation.
/hpc-rechunk — benchmark + rechunk file(s) to fast access-pattern chunks.
/hpc-parallelize — diagnose a slow pipeline and apply detect → fix-I/O → parallelize.

Quick start (standalone, no plugin)

PY=python   # any env with xarray + netCDF4 (zarr/dask optional)
$PY hpc-netcdf-parallelization/scripts/detect_env.py --human
$PY hpc-netcdf-parallelization/scripts/read_slab.py bench myfile.nc --vars t u v --n 30
$PY hpc-netcdf-parallelization/scripts/rechunk.py in.nc out.nc --chunks "time/1,lev/9,lat/192,lon/288" --verify

from parallel_map import pmap, add_accumulators
import functools, numpy as np
parts = pmap(process_event, events, n_workers=96)          # fine-grained, BLAS-pinned
total = functools.reduce(add_accumulators, parts)           # (sum, count) merge
mean  = {v: np.where(c > 0, s / c, np.nan) for v, (s, c) in total.items()}

Layout

hpc-netcdf-parallelization/
  SKILL.md                  # the skill (when/how to use)
  scripts/                  # detect_env, read_slab, rechunk, parallel_map  (CLI + JSON)
  references/               # chunking.md, backends.md, parallel.md
  examples/                 # example_workflow.md (worked diagnosis)
commands/                   # /hpc-detect, /hpc-rechunk, /hpc-parallelize
.claude-plugin/             # plugin.json + marketplace.json

Golden rules

Profile I/O before blaming compute. 2. Contiguous slabs in, access-pattern chunks out.
More cores ≠ faster for memory-bound work. 4. Detect the cluster; don't assume.

hpc-netcdf-parallelization

Popularity

What's Inside

README