From hpc
Speed up Python on the Yale SOM HPC cluster with profiling, DuckDB/Polars/Numba, and right-sized parallelism before reaching for multiprocessing or GPUs. TRIGGER when a Python Slurm job on the Yale SOM HPC cluster is slow/CPU-bound/memory-heavy, or when considering parallelism or GPU acceleration on the cluster.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hpc:accelerating-pythonThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Rule: profile first. Then choose the smallest acceleration that matches the bottleneck.
Rule: profile first. Then choose the smallest acceleration that matches the bottleneck.
Do not start with multiprocessing if the real bottleneck is CSV parsing, GPFS metadata, a database query, or network latency.
import cProfile
import pstats
with cProfile.Profile() as profiler:
main()
stats = pstats.Stats(profiler).sort_stats("cumtime")
stats.print_stats(25)
For running jobs, py-spy dump --pid PID is often more useful if available.
import duckdb
duckdb.sql("""
COPY (
SELECT id, score
FROM read_parquet('/gpfs/project/myproject/data/raw/*.parquet')
WHERE score IS NOT NULL
) TO '/gpfs/project/myproject/data/derived/scores.parquet'
(FORMAT PARQUET)
""")
This is often faster and simpler than writing Python loops.
Use Numba when you have loops over NumPy arrays and Python overhead dominates.
import numpy as np
from numba import njit
@njit(cache=True)
def scale(arr):
out = np.empty_like(arr)
for i in range(arr.shape[0]):
out[i] = arr[i] * 2
return out
Use cache=True so compiled code is reused across runs. nogil=True is redundant under @njit and adds noise — leave it off.
prange is parallel only with parallel=True.
from numba import njit, prange
@njit(parallel=True, cache=True)
def add_one(arr):
for i in prange(arr.shape[0]):
arr[i] += 1
Parallelize the outer loop. Nested inner-loop prange often gives little speedup.
Bad:
@njit(parallel=True, cache=True)
def count_bad(counts, idx):
for i in prange(idx.shape[0]):
counts[idx[i]] += 1
Multiple workers can update the same counts element. Use per-worker accumulators and reduce, or use a serial loop if correctness matters more than speed. For sparse scatter-add patterns where the parallel speedup is modest, np.add.at(counts, idx, 1) outside Numba is thread-safe and clear.
Numba compiles on first call. Benchmark steady-state runtime separately from compile time:
scale(arr) # compile
scale(arr) # measure this call
@njit(cache=True).scan_* / sink_*.@njit, prange, parallel reductions.cProfile / pstats — built-in profilers.py-spy dump --pid PID).npx claudepluginhub yale-som-hpc/claude-code-marketplace --plugin hpcSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.