Skill

accelerating-python

From hpc

Speed up Python on the Yale SOM HPC cluster with profiling, DuckDB/Polars/Numba, and right-sized parallelism before reaching for multiprocessing or GPUs. TRIGGER when a Python Slurm job on the Yale SOM HPC cluster is slow/CPU-bound/memory-heavy, or when considering parallelism or GPU acceleration on the cluster.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/hpc:accelerating-python

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Rule: profile first. Then choose the smallest acceleration that matches the bottleneck.

SKILL.md

133 lines · ~1.1k tokens

Stats

Parent stars1

MaintenanceExcellent

Last CommitApr 29, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Accelerating Python

Rule: profile first. Then choose the smallest acceleration that matches the bottleneck.

Order of operations

Measure the slow part.
Reduce data with filters/projections before loading.
Use better engines: DuckDB, Polars, Arrow, NumPy.
Vectorize simple array/dataframe operations.
Use Numba for tight numeric loops that cannot be expressed cleanly in vectorized form.
Use multiprocessing for CPU-bound Python functions.
Use GPUs only for GPU-shaped work.

Do not start with multiprocessing if the real bottleneck is CSV parsing, GPFS metadata, a database query, or network latency.

Quick profiler

import cProfile
import pstats

with cProfile.Profile() as profiler:
    main()

stats = pstats.Stats(profiler).sort_stats("cumtime")
stats.print_stats(25)

For running jobs, py-spy dump --pid PID is often more useful if available.

Prefer query engines for data work

import duckdb

duckdb.sql("""
COPY (
  SELECT id, score
  FROM read_parquet('/gpfs/project/myproject/data/raw/*.parquet')
  WHERE score IS NOT NULL
) TO '/gpfs/project/myproject/data/derived/scores.parquet'
(FORMAT PARQUET)
""")

This is often faster and simpler than writing Python loops.

Numba for tight numeric loops

Use Numba when you have loops over NumPy arrays and Python overhead dominates.

import numpy as np
from numba import njit

@njit(cache=True)
def scale(arr):
    out = np.empty_like(arr)
    for i in range(arr.shape[0]):
        out[i] = arr[i] * 2
    return out

Use cache=True so compiled code is reused across runs. nogil=True is redundant under @njit and adds noise — leave it off.

Parallel Numba

prange is parallel only with parallel=True.

from numba import njit, prange

@njit(parallel=True, cache=True)
def add_one(arr):
    for i in prange(arr.shape[0]):
        arr[i] += 1

Parallelize the outer loop. Nested inner-loop prange often gives little speedup.

Avoid race conditions

Bad:

@njit(parallel=True, cache=True)
def count_bad(counts, idx):
    for i in prange(idx.shape[0]):
        counts[idx[i]] += 1

Multiple workers can update the same counts element. Use per-worker accumulators and reduce, or use a serial loop if correctness matters more than speed. For sparse scatter-add patterns where the parallel speedup is modest, np.add.at(counts, idx, 1) outside Numba is thread-safe and clear.

Compilation tax

Numba compiles on first call. Benchmark steady-state runtime separately from compile time:

scale(arr)  # compile
scale(arr)  # measure this call

Checklist

The bottleneck is measured, not guessed.
Filters/projections happen before full materialization.
DuckDB/Polars/NumPy are considered before Python loops.
Numba functions use numeric arrays and @njit(cache=True).
Parallel Numba avoids shared write races.
Multiprocessing is used only when a single-process optimization is not enough.
GPU use is justified by actual CUDA/GPU-backed code.

accelerating-python

Popularity

Invocation

Context Preview

SKILL.md

accelerating-python

Popularity

Invocation

Context Preview

SKILL.md

Accelerating Python

Order of operations

Quick profiler

Prefer query engines for data work

Numba for tight numeric loops

Parallel Numba

Avoid race conditions

Compilation tax

Checklist

Further reading

Similar Skills

Accelerating Python

Order of operations

Quick profiler

Prefer query engines for data work

Numba for tight numeric loops

Parallel Numba

Avoid race conditions

Compilation tax

Checklist

Further reading

Similar Skills