Safely rechunks Zarr datasets with validation, progress reporting, memory limits, rollback safety via rechunk.py CLI. Supports local and S3 cloud storage.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zarr-chunk-optimization:rechunkingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Rechunking is one of the most expensive operations you can perform on a Zarr dataset.** Depending on chunk size and dataset volume, rechunking can take anywhere from approximately 6 minutes to over 46 hours (Nguyen et al., 2023, [DOI: 10.1002/essoar.10511054.2](https://doi.org/10.1002/essoar.10511054.2)). The operation rewrites every byte of data, and an interrupted or misconfigured rechunk c...
Rechunking is one of the most expensive operations you can perform on a Zarr dataset. Depending on chunk size and dataset volume, rechunking can take anywhere from approximately 6 minutes to over 46 hours (Nguyen et al., 2023, DOI: 10.1002/essoar.10511054.2). The operation rewrites every byte of data, and an interrupted or misconfigured rechunk can corrupt an entire dataset. This skill wraps the rechunk.py script with a safety-first workflow: validate inputs, estimate costs, rechunk to a new location, verify outputs, and only then swap the result into place.
# Basic rechunk (local)
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "50,512,512"
# Rechunk with memory limit
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "100,256,256" \
--max-mem "4GB"
# Rechunk cloud data
python rechunk.py --input s3://bucket/source.zarr \
--output s3://bucket/rechunked.zarr \
--chunks "50,512,512" \
--max-mem "4GB"
# Overwrite existing output
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "200,128,128" \
--overwrite
# Save summary to specific path
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "50,512,512" \
--summary results/rechunk_report.json
# Verbose mode for debugging
python rechunk.py --input /data/source.zarr \
--output /data/rechunked.zarr \
--chunks "50,512,512" \
--verbose
CLI flags:
| Flag | Required | Default | Description |
|---|---|---|---|
--input, -i | Yes | — | Input Zarr store path (local or s3://) |
--output, -o | Yes | — | Output Zarr store path (local or s3://) |
--chunks, -c | Yes | — | Target chunk shape, comma-separated (e.g., "50,512,512") |
--max-mem | No | 2GB | Maximum memory for rechunker library |
--overwrite | No | false | Overwrite output if it exists |
--summary | No | auto-generated | Path for JSON summary file |
--verbose, -v | No | false | Enable debug-level logging |
Use this skill when:
Do not use this skill if you have not first benchmarked the proposed chunk configuration. Rechunking is expensive and irreversible without a backup.
Nguyen et al. (2023) measured rechunking times across a range of chunk sizes for multi-dimensional scientific datasets stored in cloud object stores:
| Chunk Size Category | Approximate Time | Example Shape |
|---|---|---|
| Large chunks | ~6 minutes | (100, 1024, 1024) |
| Medium chunks | ~1–4 hours | (50, 512, 512) |
| Small chunks | ~10–46 hours | (1, 64, 64) |
Factors affecting rechunking cost:
The rechunk.py script reports the Nguyen et al. range (6 minutes to 46 hours) as a guideline because precise estimation requires knowing the specific storage backend and network conditions.
The scripts/rechunk.py script performs the following steps automatically:
--overwrite to proceed).rechunker library if available (preferred), falls back to manual chunk-by-chunk copy via zarr.copy.# Standard local rechunk
python scripts/rechunk.py \
--input /data/observations.zarr \
--output /data/observations_rechunked.zarr \
--chunks "50,512,512"
# Cloud-to-cloud with increased memory budget
python scripts/rechunk.py \
--input s3://my-bucket/raw/data.zarr \
--output s3://my-bucket/optimized/data.zarr \
--chunks "100,256,256" \
--max-mem "8GB"
Follow this four-step protocol for every rechunking operation:
Rechunk a small subset of the data to verify the configuration works before committing to the full dataset.
# Extract a sample (e.g., first 10 time steps)
python -c "
import zarr
src = zarr.open('/data/full.zarr', 'r')
sample = zarr.open('/tmp/sample.zarr', 'w',
shape=(10,) + src.shape[1:],
chunks=src.chunks, dtype=src.dtype)
sample[:] = src[:10]
"
# Rechunk the sample
python scripts/rechunk.py \
--input /tmp/sample.zarr \
--output /tmp/sample_rechunked.zarr \
--chunks "5,512,512"
Inspect the rechunked sample to confirm data integrity:
import zarr
import numpy as np
source = zarr.open('/tmp/sample.zarr', 'r')
result = zarr.open('/tmp/sample_rechunked.zarr', 'r')
assert source.shape == result.shape, "Shape mismatch"
assert source.dtype == result.dtype, "Dtype mismatch"
assert np.array_equal(source[:], result[:]), "Data mismatch"
print("Sample validation passed")
Once the sample passes validation, rechunk the full dataset to a new path (never in place):
python scripts/rechunk.py \
--input /data/full.zarr \
--output /data/full_rechunked.zarr \
--chunks "50,512,512" \
--max-mem "4GB"
After the full rechunk completes, verify the output and then swap it into the production path:
# Verify (the script does basic validation automatically)
# For extra safety, spot-check a few values:
python -c "
import zarr, numpy as np
src = zarr.open('/data/full.zarr', 'r')
dst = zarr.open('/data/full_rechunked.zarr', 'r')
# Check random slices
for idx in [0, len(src)//2, len(src)-1]:
assert np.array_equal(src[idx], dst[idx]), f'Mismatch at index {idx}'
print('Spot-check passed')
"
# Swap
mv /data/full.zarr /data/full_backup.zarr
mv /data/full_rechunked.zarr /data/full.zarr
The --max-mem flag controls how much memory the rechunker library is allowed to use. This is critical for large datasets that do not fit in RAM.
How it works:
rechunker library is available, --max-mem is passed directly to rechunker.rechunk(), which plans an execution graph that respects the memory bound.Guidelines for setting --max-mem:
| System Memory | Recommended --max-mem | Reasoning |
|---|---|---|
| 8 GB | 2GB | Leave headroom for OS and other processes |
| 16 GB | 4GB | Safe default for most workloads |
| 64 GB | 16GB | HPC nodes with dedicated rechunking jobs |
| 128+ GB | 32GB | Large-scale production rechunking |
Warning: Setting --max-mem too high can cause OOM kills. Setting it too low increases the number of read/write cycles and slows down the operation. Start conservative and increase if rechunking is too slow.
When rechunking data stored in S3 or GCS, additional considerations apply:
The rechunker library requires an intermediate (temporary) storage location. The script automatically creates a temporary directory for this purpose. For cloud-to-cloud rechunking, the intermediate store is created locally by default. To use cloud intermediate storage:
# In custom scripts, specify cloud intermediate storage
import rechunker
plan = rechunker.rechunk(
source, target_chunks=chunks,
max_mem="4GB",
target_store="s3://bucket/output.zarr",
temp_store="s3://bucket/tmp/rechunk_temp"
)
plan.execute()
s3fs for S3 access. Ensure AWS credentials are configured (~/.aws/credentials or environment variables).The script automatically validates:
For additional validation beyond what the script provides:
import zarr
import numpy as np
src = zarr.open("source.zarr", "r")
dst = zarr.open("output.zarr", "r")
# Dtype check
assert src.dtype == dst.dtype, "Dtype mismatch"
# Metadata check
for key in src.attrs:
assert key in dst.attrs, f"Missing attribute: {key}"
assert src.attrs[key] == dst.attrs[key], f"Attribute mismatch: {key}"
# Value sampling (spot-check random positions)
rng = np.random.default_rng(42)
for _ in range(100):
idx = tuple(rng.integers(0, s) for s in src.shape)
assert src[idx] == dst[idx], f"Value mismatch at {idx}"
The safest approach to rechunking follows a write-verify-swap pattern:
# Swap pattern
mv /data/dataset.zarr /data/dataset_backup.zarr
mv /data/dataset_rechunked.zarr /data/dataset.zarr
# After validation in production
rm -rf /data/dataset_backup.zarr
For cloud storage, use versioned buckets or copy to a backup prefix before swapping.
--max-mem appropriately can cause OOM errors that corrupt the output. Start conservative.--overwrite on the source path. Write to a separate output and swap after validation.--verbose to watch memory usage during rechunking. If the process is killed by OOM, reduce --max-mem.--summary to keep a record of every rechunking operation for reproducibility.rechunker for better memory management and execution planning. The fallback chunk-by-chunk copy works but is slower and less memory-efficient.npx claudepluginhub uw-ssec/rse-plugins --plugin zarr-chunk-optimizationBenchmarks and optimizes Zarr chunking strategies for multi-dimensional scientific datasets in S3/GCS/local storage, measuring time/memory/I/O for spatial/time/spectral access patterns.
Provides chunked N-D arrays for cloud storage with compression, parallel I/O, S3/GCS integration, and NumPy/Dask/Xarray compatibility for large-scale scientific computing.
Chunked, compressed N-dimensional arrays for cloud storage with Zarr — parallel I/O, S3/GCS integration, and NumPy/Dask/Xarray compatibility.