From hpx-dev
Generates `pytest-benchmark` scripts for HPyX bindings, runs HPX vs NumPy/pure-Python comparisons, measures thread scaling and binding overhead, and interprets timing results. Use when the user asks about "benchmarking", "performance testing", "pytest-benchmark", "benchmark HPX vs Python", "benchmark HPX vs NumPy", "measure binding overhead", "profile HPyX", "threadpoolctl", "benchmark scaling", "performance comparison", mentions the "benchmarks/" directory or "pixi run benchmark", or asks about performance characteristics of HPyX operations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hpx-dev:benchmarkingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
HPyX uses pytest-benchmark for performance testing:
HPyX uses pytest-benchmark for performance testing:
benchmarks/ directorybenchmark-py313t pixi environmentpytest-benchmark>=5.1.0, threadpoolctl>=3.6.0pixi run benchmark# Run all benchmarks
pixi run benchmark
# Run specific benchmarks by keyword
pixi run benchmark keyword_expression="dot1d"
# The underlying command (for reference):
pytest ./benchmarks \
--benchmark-group-by=func \
--benchmark-warmup=on \
--benchmark-min-rounds=3 \
--benchmark-time-unit=ms
# Save results for comparison
pytest ./benchmarks --benchmark-save=baseline
# Compare against saved results
pytest ./benchmarks --benchmark-compare=baseline
# Generate JSON output
pytest ./benchmarks --benchmark-json=results.json
# Disable benchmarks (run tests only)
pytest ./benchmarks --benchmark-disable
Compare HPyX bindings against NumPy equivalents — the primary benchmark pattern in this project.
import numpy as np
import pytest
from hpyx.runtime import HPXRuntime
import hpyx
@pytest.mark.parametrize("size", [10_000_000, 50_000_000, 100_000_000])
def test_bench_hpx_operation(benchmark, size):
"""Benchmark HPX implementation."""
rng = np.random.default_rng()
data = rng.random(size)
with HPXRuntime():
_ = benchmark(hpyx._core.operation, data)
@pytest.mark.parametrize("size", [10_000_000, 50_000_000, 100_000_000])
def test_bench_numpy_operation(benchmark, size):
"""Benchmark NumPy equivalent."""
rng = np.random.default_rng()
data = rng.random(size)
_ = benchmark(np.operation, data)
Reference: benchmarks/test_bench_hpx_linalg.py
Measure how performance scales with thread count:
@pytest.mark.parametrize("threads", [1, 2, 4, 8])
@pytest.mark.parametrize("size", [1_000_000, 10_000_000])
def test_bench_scaling(benchmark, threads, size):
"""Benchmark thread scaling."""
data = np.random.random(size)
def run():
with HPXRuntime(os_threads=threads):
return hpyx._core.operation(data)
benchmark(run)
Use threadpoolctl to force single-threaded NumPy for fair comparison:
from threadpoolctl import threadpool_limits
@pytest.mark.parametrize("size", [10_000_000, 50_000_000])
def test_bench_hpx_single_thread(benchmark, size):
data = np.random.random(size)
with HPXRuntime(os_threads=1):
_ = benchmark(hpyx._core.operation, data)
@pytest.mark.parametrize("size", [10_000_000, 50_000_000])
def test_bench_numpy_single_thread(benchmark, size):
data = np.random.random(size)
with threadpool_limits(limits=1):
_ = benchmark(np.operation, data)
Reference: benchmarks/test_bench_hpx_linalg.py (single-thread variants)
Compare against pure Python loops to show binding overhead:
@pytest.mark.parametrize("size", [100_000, 1_000_000])
def test_bench_hpx_for_loop(benchmark, size):
arr = list(range(size))
def run():
with HPXRuntime():
hpyx.multiprocessing.for_loop(lambda x: x * 2, arr, "seq")
benchmark(run)
@pytest.mark.parametrize("size", [100_000, 1_000_000])
def test_bench_python_for_loop(benchmark, size):
arr = list(range(size))
def run():
for i in range(len(arr)):
arr[i] = arr[i] * 2
benchmark(run)
Isolate the overhead of crossing the Python/C++ boundary:
def test_bench_submit_overhead(benchmark):
"""Measure async submit overhead (trivial function)."""
with HPXRuntime():
def noop():
return 42
def run():
f = hpyx.futures.submit(noop)
return f.get()
benchmark(run)
def test_bench_python_call_overhead(benchmark):
"""Baseline: Python function call overhead."""
def noop():
return 42
benchmark(noop)
Create data before the benchmarked function, not inside it:
# CORRECT: Data created once, benchmark measures only the operation
def test_bench(benchmark, size):
data = np.random.random(size) # Outside benchmark
with HPXRuntime():
_ = benchmark(hpyx._core.dot1d, data, data)
# WRONG: Data creation is measured too
def test_bench(benchmark, size):
def run():
data = np.random.random(size) # Inside benchmark — wrong!
return hpyx._core.dot1d(data, data)
benchmark(run)
Place HPXRuntime() context based on what to measure:
# Measure only the operation (exclude runtime startup):
def test_bench_operation(benchmark, size):
data = np.random.random(size)
with HPXRuntime(): # Started once
_ = benchmark(op, data) # Operation measured many times
# Measure operation + runtime startup:
def test_bench_with_startup(benchmark, size):
data = np.random.random(size)
def run():
with HPXRuntime(): # Started each iteration
return op(data)
benchmark(run)
Use parametrize with ranges that reveal scaling behavior:
Follow the pattern test_bench_{implementation}_{operation}:
test_bench_hpx_dot1d — HPX implementationtest_bench_np_dot1d — NumPy baselinetest_bench_python_for_loop — Pure Python baselineBefore trusting timing data, verify:
Key metrics to evaluate:
references/benchmark-analysis.md — Guide to interpreting benchmark results, common performance bottlenecks, and optimization strategies for C++/Python bindingsProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
npx claudepluginhub uw-ssec/hpyx --plugin hpx-dev