From zarr-data-format
Configure and optimize numcodecs compression for Zarr arrays: Blosc, Zstd, LZ4, Gzip, LZMA; pre-filters (Delta, Quantize); pipelines, Blosc thread safety, speed/ratio trade-offs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zarr-data-format:compression-codecsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Configure, select, and optimize compression codecs for Zarr arrays using **numcodecs**. This skill covers every compressor and filter in the Zarr ecosystem, thread safety for multi-process workloads, codec pipelines in Zarr v3, and performance trade-offs.
Configure, select, and optimize compression codecs for Zarr arrays using numcodecs. This skill covers every compressor and filter in the Zarr ecosystem, thread safety for multi-process workloads, codec pipelines in Zarr v3, and performance trade-offs.
Zarr Performance Guide: https://zarr.readthedocs.io/en/latest/user-guide/performance/ numcodecs Reference: https://numcodecs.readthedocs.io/ GitHub: https://github.com/zarr-developers/numcodecs
| Codec | Speed (compress) | Speed (decompress) | Ratio | Best For |
|---|---|---|---|---|
| Blosc+LZ4 | Very Fast | Very Fast | Low-Med | Real-time analysis, frequent reads |
| Blosc+Zstd | Medium | Fast | High | General purpose (v2 default) |
| Zstd standalone | Medium | Fast | High | Zarr v3 default |
| Blosc+LZ4HC | Slow | Very Fast | Medium | Write-once, read-many |
| Gzip | Slow | Medium | Med-High | Interop with non-Python tools |
| LZ4 standalone | Very Fast | Very Fast | Low | Maximum throughput |
| LZMA | Very Slow | Very Slow | Very High | Archival only |
Blosc wraps internal algorithms and adds byte-shuffling — the single most impactful setting for numerical data compression. Shuffle rearranges bytes to expose patterns, yielding 10–40× better ratios.
| Parameter | Options | Default |
|---|---|---|
cname | blosclz, lz4, lz4hc, snappy, zlib, zstd | blosclz |
clevel | 0 (none) – 9 (max) | 5 |
shuffle | NOSHUFFLE (0), SHUFFLE (1), BITSHUFFLE (2) | SHUFFLE |
from numcodecs import Blosc
Blosc(cname='zstd', clevel=5, shuffle=Blosc.SHUFFLE) # balanced
Blosc(cname='lz4', clevel=1, shuffle=Blosc.SHUFFLE) # max speed
Blosc(cname='zstd', clevel=9, shuffle=Blosc.BITSHUFFLE) # max ratio
Blosc's internal threading is not fork-safe. Multi-process use (Dask workers, multiprocessing, joblib) can cause silent data corruption.
from numcodecs import blosc
blosc.use_threads = False # ALWAYS set this in multi-process environments
# For Dask distributed:
client.run(lambda: setattr(__import__('numcodecs').blosc, 'use_threads', False))
| Codec | Import | Key Config |
|---|---|---|
| Zstd (v3 default) | from numcodecs import Zstd | Zstd(level=3) — levels 1–22 |
| LZ4 | from numcodecs import LZ4 | LZ4(acceleration=1) |
| Gzip | from numcodecs import GZip | GZip(level=5) — levels 1–9 |
| Zlib | from numcodecs import Zlib | Zlib(level=4) — levels 1–9 |
| BZ2 | from numcodecs import BZ2 | BZ2(level=5) — levels 1–9 |
| LZMA | from numcodecs import LZMA | LZMA(preset=6) — presets 0–9 |
Filters transform data before compression to improve ratios. Applied in order.
| Filter | Use Case | Example |
|---|---|---|
| Delta | Monotonic data (timestamps, indices) | Delta(dtype='int64') |
| Quantize | Reduce float precision | Quantize(digits=3, dtype='float64') |
| FixedScaleOffset | Convert floats to ints | FixedScaleOffset(offset=273.15, scale=100, dtype='float64', astype='int32') |
| PackBits | Boolean arrays (8× reduction) | PackBits() |
| Categorize | String→integer encoding | Categorize(labels=['a','b','c'], dtype='U10', astype='u1') |
# v2: filters + compressor
z = zarr.open_array('data.zarr', mode='w', shape=(10000,), dtype='int64', chunks=(1000,),
filters=[Delta(dtype='int64')], compressor=Blosc(cname='zstd', clevel=5))
# Chain: Delta → Quantize → compressor
filters=[Delta(dtype='float64'), Quantize(digits=3, dtype='float64')]
v3 replaces compressor + filters with a unified pipeline: array→array → array→bytes → bytes→bytes.
import zarr
# v3 with default Zstd
z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100),
dtype='float64', zarr_format=3)
# v3 with explicit compressor
z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100),
dtype='float64', compressors=zarr.codecs.ZstdCodec(level=5))
# No compression
z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100),
dtype='float64', compressors=None)
Primary constraint?
├── STORAGE SIZE → Zstd level 9 or LZMA (archival only)
├── READ SPEED → Blosc+LZ4 with SHUFFLE (numerical) or LZ4 standalone
├── WRITE SPEED → LZ4(acceleration=10) or Blosc+LZ4 clevel=1
├── BALANCED → Blosc+Zstd clevel=3 (v2) or Zstd level=3 (v3)
├── INTEROP → Gzip (universal) or Zlib (NetCDF compat)
└── DATA TYPE
├── Monotonic → Delta filter + any compressor
├── Boolean → PackBits + LZ4
├── Integer → Blosc BITSHUFFLE
└── Limited precision float → Quantize filter + Zstd
blosc.use_threads = False in any multi-process environmentnpx claudepluginhub uw-ssec/rse-plugins --plugin zarr-data-formatStores large N-dimensional arrays with chunking, compression, and pluggable storage (local, S3, GCS, ZIP, memory). Supports out-of-core computation and Dask/Xarray integration.
Provides chunked N-D arrays for cloud storage with compression, parallel I/O, S3/GCS integration, and NumPy/Dask/Xarray compatibility for large-scale scientific computing.
Work with Zarr Python library for chunked, compressed N-dimensional arrays. Covers creation, groups, metadata/attributes, indexing, data types, sharding, v2/v3 differences.