From bigdft
Create and manage RemoteManager Dataset workflows for running Python functions on remote HPC systems. Covers function definition, run submission, result retrieval, error handling, and dataset chaining. Use when setting up or debugging remote computation workflows.
How this skill is triggered — by the user, by Claude, or both
Slash command
/bigdft:datasetThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Help the user create and manage `Dataset` workflows for executing Python functions on remote machines. **Ask each question one at a time.** Skip questions whose answers are obvious from context.
Help the user create and manage Dataset workflows for executing Python functions on remote machines. Ask each question one at a time. Skip questions whose answers are obvious from context.
Check if remotemanager is installed and whether the user already has a connection set up:
python3 -c "import remotemanager; print(remotemanager.__version__)" 2>/dev/null
Look for existing connection code (URL/Computer objects) or saved YAML configs in the working directory:
ls *.yaml *.py *.ipynb 2>/dev/null
If no connection exists, suggest running /bigdft:remote first.
What function do you want to run remotely?
1. I'll describe it and you write the function
2. I have an existing function
3. I want to run a BigDFT calculation
How do you want to set up the runs?
1. A single run
2. A parameter sweep (vary one or more arguments)
3. A set of specific configurations
Only ask if non-obvious:
What do you need from the results?
1. Just the return values
2. Return values plus output files from the remote
3. Chain into another computation (dataset chaining)
These rules are essential and must always be followed when writing functions for Dataset:
# CORRECT
def compute(x):
import numpy as np
return np.sqrt(x)
# WRONG -- will fail on remote
import numpy as np
def compute(x):
return np.sqrt(x)
Pass the function object, not a call. Use function=compute, never function=compute().
Functions must be self-contained. They cannot reference variables, classes, or other functions from the local scope.
To use helper functions, decorate them with @RemoteFunction:
from remotemanager import RemoteFunction
@RemoteFunction
def helper(x):
return x ** 2
def main_func(a, b):
return helper(a) + helper(b)
ds = Dataset(function=main_func, url=url)
dill serializer:ds = Dataset(function=my_func, url=url,
serialiser='dill')
This requires pip install dill on both local and remote.
from remotemanager import Dataset, URL
# Connection (or load from YAML, or use Computer for SLURM)
url = URL(host='FILL', user='FILL')
# Define the function -- all imports MUST be inside
def FILL(FILL):
# FILL: imports
# FILL: computation
return FILL
# Create dataset
ds = Dataset(
function=FILL,
url=url,
local_dir='FILL', # local staging directory
remote_dir='FILL', # directory on the remote machine
)
# Add runs
ds.append_run(args={FILL})
# Execute
ds.run()
# Wait for completion
ds.wait(interval=FILL, timeout=FILL) # interval in seconds, timeout in seconds
# Retrieve results
ds.fetch_results()
print(ds.results)
# Check for errors
if ds.errors:
print("Errors:", ds.errors)
for r in ds.failed:
print(r.full_error_)
| Parameter | Default | Description |
|---|---|---|
function | (required) | Python callable to execute remotely |
url | localhost | URL or Computer object |
name | None | Dataset identifier (auto-generated if None) |
local_dir | None | Local staging directory |
remote_dir | None | Working directory on remote |
run_dir | None | Subdirectory within remote_dir for each run |
transport | 'rsync' | File transfer: 'rsync', 'scp', or 'cp' |
serialiser | 'json' | Data encoding: 'json', 'yaml', 'dill', 'jsonpickle' |
script | None | Jobscript template string (for SLURM) |
skip | True | Reuse existing database on reinit |
extra_files_send | [] | Files to upload with every run |
extra_files_recv | [] | Files to download after every run |
verbose | True | Print status messages |
asynchronous | True | Run all runners in parallel (False for sequential) |
def energy(structure_file):
from BigDFT.Calculators import SystemCalculator
from BigDFT.Inputfiles import Inputfile
inp = Inputfile.from_yaml('input.yaml')
calc = SystemCalculator()
log = calc.run(input=inp, name='calc', run_dir='.')
return log.energy
ds = Dataset(function=energy, url=url,
local_dir='local_energy', remote_dir='remote_energy')
ds.append_run(
args={'structure_file': 'molecule.xyz'},
extra_files_send=['molecule.xyz', 'input.yaml'],
)
ds.run()
ds.wait(interval=30, timeout=3600)
ds.fetch_results()
print("Energy:", ds.results[0])
def converge(hgrid, rmult_coarse, rmult_fine):
from BigDFT.Inputfiles import Inputfile
from BigDFT.Calculators import SystemCalculator
inp = Inputfile()
inp.set_hgrid(hgrid)
inp.set_rmult(coarse=rmult_coarse, fine=rmult_fine)
inp.set_xc('PBE')
calc = SystemCalculator()
log = calc.run(input=inp, name='conv', run_dir='.')
return {'energy': log.energy, 'hgrid': hgrid, 'rmult': [rmult_coarse, rmult_fine]}
ds = Dataset(function=converge, url=url,
local_dir='convergence', remote_dir='convergence')
# Parameter sweep over grid spacing
for h in [0.5, 0.45, 0.4, 0.35, 0.3]:
ds.append_run(args={'hgrid': h, 'rmult_coarse': 5.0, 'rmult_fine': 8.0})
ds.run()
ds.wait(interval=60, timeout=7200)
ds.fetch_results()
for r in ds.results:
print(f"hgrid={r['hgrid']:.2f} energy={r['energy']:.6f}")
For large sweeps, lazy_append is more efficient:
with ds.lazy_append():
for h in [0.5, 0.45, 0.4, 0.35, 0.3]:
for rmult in [(4, 6), (5, 7), (5, 8), (6, 9)]:
ds.append_run(args={
'hgrid': h,
'rmult_coarse': rmult[0],
'rmult_fine': rmult[1],
})
Set SLURM parameters per-run or globally:
from remotemanager.connection.computer import Computer
conn = Computer.from_yaml('my_cluster.yaml')
ds = Dataset(function=my_func, url=conn,
local_dir='slurm_runs', remote_dir='slurm_runs')
# Set defaults for all runs
ds.set_run_arg('nodes', 1)
ds.set_run_arg('ntasks', 4)
ds.set_run_arg('walltime', '2h')
ds.set_run_arg('account', 'my_project')
# Or override per run
ds.append_run(args={'size': 'small'})
ds.append_run(args={'size': 'large'},
nodes=4, ntasks=16, walltime='8h')
ds.run()
# Set one parameter for all runners
ds.set_run_arg('nodes', 2)
# Set multiple parameters at once
ds.set_run_args(nodes=2, ntasks=8, walltime='4h')
# Update (merge) parameters
ds.update_run_args({'account': 'proj123'})
Parameter priority (highest wins): run() call > per-runner args > Dataset defaults > Computer defaults > template defaults
Send input files to the remote and retrieve output files:
ds.append_run(
args={'input_file': 'data.txt'},
extra_files_send=['data.txt', 'config.yaml'],
extra_files_recv=['output.dat', 'result.log'],
)
Or set files for all runs:
ds = Dataset(function=my_func, url=url,
extra_files_send=['shared_data.h5'],
extra_files_recv=['result.json'])
Chain datasets so the output of one feeds into the next. Child functions receive parent results via the loaded variable:
def generate(n_atoms):
import random
positions = [[random.random()*10 for _ in range(3)] for _ in range(n_atoms)]
return {'positions': positions, 'n_atoms': n_atoms}
def calculate(method):
# `loaded` is automatically available -- contains parent's result
positions = loaded['positions'] # noqa: F821
# ... run calculation with positions ...
return {'energy': -42.0, 'method': method}
def analyze(threshold):
energy = loaded['energy'] # noqa: F821
return energy < threshold
ds_gen = Dataset(function=generate, url=url,
local_dir='chain/gen', remote_dir='chain/gen')
ds_calc = Dataset(function=calculate, url=url,
local_dir='chain/calc', remote_dir='chain/calc')
ds_post = Dataset(function=analyze, url=url,
local_dir='chain/post', remote_dir='chain/post')
# Set up the chain
ds_gen.set_downstream(ds_calc)
ds_calc.set_downstream(ds_post)
# Append runs -- propagates through the chain
ds_gen.append_run(args={'n_atoms': 10})
ds_calc.set_run_args(method='PBE')
ds_post.set_run_args(threshold=-40.0)
# Run the first dataset -- downstream runs automatically when upstream completes
ds_gen.run()
ds_gen.wait(interval=10, timeout=300)
ds_gen.fetch_results()
# Now run downstream
ds_calc.run()
ds_calc.wait(interval=30, timeout=3600)
ds_calc.fetch_results()
ds_post.run()
ds_post.wait(interval=5, timeout=60)
ds_post.fetch_results()
print(ds_post.results)
Important: In chained functions, the variable loaded is injected automatically -- it contains the deserialized result from the upstream dataset. Do not define it yourself; just use it. Add # noqa: F821 to suppress linter warnings.
The ds.run() call executes three stages internally:
stage() -- generates all files locally (run scripts, jobscripts, master script)transfer() -- uploads files to remote via rsync/scpexecute() -- runs the master script via SSH, which submits each runnerYou can also call these individually:
ds.stage()
ds.transfer()
# ... inspect files before executing ...
ds.execute()
Each runner progresses through these states:
created -> staged -> transferred -> submitted -> started -> completed (or failed) -> satisfied
Check states:
for i, runner in enumerate(ds.runners):
print(f"Run {i}: {runner.state_}")
# Check for errors
print(ds.errors) # list of error summaries (one per runner)
print(ds.failed) # list of failed Runner objects
# Get full traceback for a failed runner
for runner in ds.failed:
print(runner.full_error_)
# Retry all failed runs
ds.retry_failed()
ds.run()
ds.wait(interval=10, timeout=300)
ds.fetch_results()
# Remove a specific run and re-add with corrected args
ds.remove_run(3) # remove by index
ds.append_run(args={'corrected': 'value'})
Datasets auto-persist to YAML database files. This means you can restart your notebook/script and the dataset will pick up where it left off:
# First session
ds = Dataset(function=my_func, url=url, name='experiment1',
local_dir='exp1', remote_dir='exp1')
ds.append_run(args={'x': 42})
ds.run()
# ... close notebook ...
# Later session -- automatically restores state because skip=True (default)
ds = Dataset(function=my_func, url=url, name='experiment1',
local_dir='exp1', remote_dir='exp1')
ds.wait(interval=10, timeout=300)
ds.fetch_results()
print(ds.results)
# Full backup (database + all result files)
ds.backup('experiment_backup.zip', full=True)
# Lightweight backup (database only)
ds.pack('experiment_snapshot')
# Restore from backup
ds_restored = Dataset.restore('experiment_backup.zip')
# Clean up everything (local dirs, remote dirs, database)
ds.hard_reset()
| Serializer | Install | Use When |
|---|---|---|
'json' | built-in | Default. Simple types (numbers, strings, lists, dicts) |
'yaml' | built-in | Same as JSON, YAML formatting |
'dill' | pip install dill | Complex objects, custom classes, lambdas |
'jsonpickle' | pip install jsonpickle | Complex objects with JSON readability |
# For complex return types
ds = Dataset(function=my_func, url=url, serialiser='dill')
Important: If using dill or jsonpickle, the package must be installed on both local and remote machines.
For one-off remote function calls without manually creating a Dataset:
from remotemanager import SanzuFunction, URL
@SanzuFunction(url=URL(host='cluster', user='jdoe'))
def compute(x, y):
import math
return math.sqrt(x**2 + y**2)
result = compute(x=3, y=4) # transparently runs on remote
print(result) # 5.0
For running notebook cells on a remote machine:
%load_ext remotemanager
Then in a cell:
%%sanzu url = URL(host='cluster', user='jdoe')
%%sanzu local_dir = "local_magic"
%%sanzu remote_dir = "remote_magic"
%%sargs x = 42
%%spull result
import math
result = math.sqrt(x)
After execution, result is available in the notebook namespace.
ds.run()
while not all(r.is_finished_ for r in ds.runners):
ds.fetch_results()
done = sum(1 for r in ds.runners if r.is_finished_)
total = len(ds.runners)
print(f"Progress: {done}/{total}")
import time
time.sleep(30)
ds.fetch_results()
ds.fetch_results()
import json
for i, (runner, result) in enumerate(zip(ds.runners, ds.results)):
print(f"Run {i}: args={runner.args_} result={result}")
ds_new = Dataset(function=new_func, url=url,
local_dir='new', remote_dir='new')
ds_new.copy_runners(ds_old) # copies all runner args
Before submitting BigDFT calculations to a remote HPC system, always validate locally with a dry run. This catches input errors, missing pseudopotentials, and incorrect parameters in seconds instead of discovering them after waiting in a queue.
If the Dataset function runs BigDFT, structure it so you can test locally first:
from BigDFT.Calculators import SystemCalculator
# Test locally with dry run before creating the Dataset
calc_dry = SystemCalculator(dry_run=True)
log_dry = calc_dry.run(input=inp, name='test', run_dir='local_test')
# If this fails, fix the input before submitting remotely
# Only after dry run passes, create the remote Dataset
ds = Dataset(function=my_bigdft_func, url=url, ...)
For functions that aren't easily testable locally (e.g., they depend on remote-only software), at minimum verify that the input files are well-formed before ds.run().
skip=True (default) means reinitializing a Dataset with the same function and name will restore state from the database file. Set skip=False to force a fresh start.dataset-{8char-hex}.yaml based on a hash of the function. Use name= for human-readable identification.ds.run() is asynchronous by default -- all runners execute simultaneously. Use asynchronous=False for sequential execution.local_dir and remote_dir should generally be unique per dataset to avoid file collisions.ds.wait() polls the remote for completion. The interval parameter controls how often it checks (in seconds). Don't set it too low to avoid SSH overhead.None for runners that haven't completed yet. Always call ds.fetch_results() before accessing ds.results.npx claudepluginhub william-dawson/bigdft-skills --plugin bigdftRuns Python workloads on Hugging Face Jobs with managed CPUs, GPUs, TPUs, secrets, and Hub persistence. Use for batch inference, data processing, ML experiments, and testing without local GPU setup.
Scales pandas/NumPy workflows to larger-than-memory datasets using Dask's parallel DataFrames, arrays, and delayed task graphs for single-machine or cluster execution.
Distributes ML experiment tasks via queues, converts for-loop scripts to parallel GPU workers, handles failures and retries, and queries/updates experiment results using Labtasker.