From isambard
Guidance for writing, reviewing, and running AI-generated code responsibly on Bristol Centre for Supercomputing (BriCS) shared HPC resources: Isambard-AI and Isambard 3. Use this skill whenever generating, adapting, or debugging code intended to run on BriCS facilities, writing Slurm job scripts, managing storage, installing software, or ensuring compliance with BriCS policies.
How this skill is triggered — by the user, by Claude, or both
Slash command
/isambard:docsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill guides agents in producing correct, safe, and policy-compliant code
This skill guides agents in producing correct, safe, and policy-compliant code for the Bristol Centre for Supercomputing (BriCS) HPC facilities.
Always follow these rules when generating or suggesting code for BriCS systems.
| Principle | Rule |
|---|---|
| Shared resource respect | Never generate code that runs heavy workloads directly on login nodes |
| Accurate resource requests | Always estimate realistic --time, --gpus, --ntasks in job scripts |
| Storage awareness | Use the correct storage area; never assume data persists after project end |
| Policy compliance | All generated code must be consistent with the BriCS Acceptable Use Policy |
| Architecture awareness | Isambard-AI and Isambard 3 Grace are aarch64 (Arm64); MACS has mixed archs |
| Verify before submit | Always review AI-generated scripts before sbatch—especially resource flags |
| Task | Correct approach |
|---|---|
| Run a short test command | srun --time=00:05:00 [--gpus=1] <cmd> |
| Run a batch workload | Write a script; submit with sbatch |
| Install Python packages | Conda (Miniforge) or uv; never pip install --user in $HOME |
| Share data with project members | Write to $PROJECTDIR |
| Share data with all users | Write to $PROJECTDIR_PUBLIC |
| Temporary/intermediate data | Use $SCRATCHDIR (auto-deleted after 60 days on Isambard 3) |
| Fast in-job scratch | Use $LOCALDIR (wiped at job end) |
| Long job (>24h) | Break into chained jobs with --dependency=afterok:<JOBID> |
| Check quota | lfs quota -hp $(lfs project -d $SCRATCHDIR | awk '{print $1}') $SCRATCHDIR |
Each GPU requested allocates 1 Grace Hopper Superchip = 1 GH200 GPU + 72 CPU cores + 115 GiB RAM.
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=my_job_%j.out
#SBATCH --gpus=1 # Required: always specify GPU resource
#SBATCH --time=01:00:00 # Required: set a realistic time limit (max 24h)
module load cray-python # Or activate your Conda/venv environment
python3 my_script.py
Gotchas for Isambard-AI:
--gpus (or --gpus-per-*). Jobs without GPU directives will fail.workq. Do not specify a partition unless you have a reason.--dependency=afterok:<JOBID>.32gpu_qos).#!/bin/bash
#SBATCH --job-name=my_cpu_job
#SBATCH --output=my_cpu_job_%j.out
#SBATCH --ntasks=4
#SBATCH --time=02:00:00 # Max 24h on grace partition
module load cray-python
srun python3 my_script.py
# Run two job steps concurrently on separate GPUs
srun --ntasks=1 --gpus=1 --exclusive step_a.sh &
srun --ntasks=1 --gpus=1 --exclusive step_b.sh &
wait
# Chain jobs that save/restore state
JOBID_1=$(sbatch --parsable job_part1.sh)
JOBID_2=$(sbatch --parsable --dependency=afterok:${JOBID_1} job_part2.sh)
All storage is working storage — not backed up. Data is deleted at project end.
| Variable | Path | Purpose | Quota (Isambard-AI) | Retention |
|---|---|---|---|---|
$HOME | /home/<PROJECT>/<USER>.<PROJECT> | Config files, scripts, job outputs | 100 GiB | Project end |
$SCRATCHDIR | /scratch/<PROJECT>/<USER>.<PROJECT> | Intermediate job data, containers | 5 TiB | 60 days (i3) / Project end (iAI) |
$PROJECTDIR | /projects/<PROJECT> | Shared datasets, shared environments | 200 TiB | Project end |
$PROJECTDIR_PUBLIC | /projects/public/<PROJECT> | Data readable by all users | 200 TiB | Project end |
$LOCALDIR | /local/user/<UID> | Fast RAM-backed in-job scratch | 48 GiB (compute) | End of job/session |
Critical reminders:
/tmp directly. /tmp is node-local, may have very limited space, and is not reliably available or cleaned up across the cluster. Always reference storage locations through their environment variables ($SCRATCHDIR, $LOCALDIR, etc.) and use set -eu at the top of job scripts to catch unset variables early:#!/bin/bash
set -eu # Exit on error (-e); treat unset variables as errors (-u)
WORKDIR="${SCRATCHDIR}/myjob_${SLURM_JOB_ID}"
mkdir -p "${WORKDIR}"
# ... your work here ...
# Explicitly clean up at end of job — do not rely on automated deletion
rm -rf "${WORKDIR}"
$HOME is for scripts and configs — not large datasets.$LOCALDIR on compute nodes is a tmpfs RAM disk — very fast but limited.$LOCALDIR data survives between jobs.Login nodes are shared and must not be used for compute-intensive or long-running work.
| Allowed on login node | NOT allowed on login node |
|---|---|
| Editing files | Running model training |
| Compiling small programs | Running data preprocessing pipelines |
| Submitting/monitoring jobs | Running benchmarks or tests |
| File transfer and compression | Long python / bash loops |
| Building containers | Using watch with squeue -i repeatedly |
Using
squeue -iorwatch squeueexcessively disrupts all users and is a breach of the Acceptable Use Policy. Usesqueue --meonce to check, or set a reasonable interval.
# Install Miniforge (once per user)
cd $HOME
curl --location --remote-name \
"https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
rm Miniforge3-$(uname)-$(uname -m).sh
# Activate (each session — do NOT use conda init)
source ~/miniforge3/bin/activate
# Create and use isolated environments
conda create -n myenv python=3.11
conda activate myenv
conda install
Gotchas:
conda init — it modifies shell startup scripts and causes problems.pip install --user; it installs into $HOME/.local which is shared across architectures (aarch64 and x86_64 on Isambard 3) — use venvs instead.# In .bashrc or setup scripts, guard architecture-specific code:
if [ "$(arch)" == "x86_64" ]; then
source ~/miniforge3/bin/activate # x86_64 env
elif [ "$(arch)" == "aarch64" ]; then
source ~/miniforge3_arm/bin/activate # aarch64 env
fi
module avail # List available modules
module load cray-python # Load Cray Python (pre-installed)
# Install Clifton (Linux)
curl -L https://github.com/isambard-sc/clifton/releases/latest/download/clifton-linux-musl-x86_64 -o clifton
chmod u+x clifton && mv clifton ~/.local/bin/
# Authenticate (required daily)
clifton auth
# Write SSH config (only needed when added to a new project)
clifton ssh-config write
# Connect
ssh .aip2.isambard # Isambard-AI Phase 2
ssh .3.isambard # Isambard 3
Gotchas:
clifton auth each day.tmux/screen sessions on login nodes — they violate security policy and may be terminated without warning.squeue --me # View your running/pending jobs
sacct # View current and completed jobs
scancel # Cancel a job
salloc --gpus=1 --time=00:30:00 # Reserve a node interactively (always set --time)
Gotchas:
salloc allocations with scancel <JOBID> when finished.--time-min and --time together to allow backfill scheduling.--array) can strain the scheduler — prefer chained job steps where possible.Before submitting any AI-generated code or job script to BriCS:
--time, --gpus, --ntasks match your actual workload$SCRATCHDIR or $PROJECTDIR, not $HOME--usertmux/screen used only within a job, not left on login nodessacct or output files after a job completes| Mistake | Consequence | Fix |
|---|---|---|
No --gpus on Isambard-AI | Job fails | Always include --gpus=1 (or more) |
| Running compute on login node | Account suspension | Use sbatch/srun |
pip install --user across archs | Package conflicts | Use Conda env or venv |
conda init in .bashrc | Shell startup failures | Use source ~/miniforge3/bin/activate |
Leaving data in $SCRATCHDIR for >60 days (Isambard 3) | Data deleted | Move to $PROJECTDIR or back up |
Persistent tmux on login node | Session terminated | Submit long jobs via Slurm |
Using watch with any Slurm command | Disrupts scheduler for all users; AUP violation | Never combine watch with squeue, sinfo, sacct, or similar — check once manually |
Using /tmp directly in scripts | /tmp is node-local, not guaranteed to exist, and not cleaned up reliably | Use $SCRATCHDIR, $LOCALDIR, or a subdirectory of a known env variable |
Leaving temp files in $SCRATCHDIR or $LOCALDIR after a job | Wastes quota; may cause future jobs to fail on space | Explicitly delete temp files at the end of your job script |
| Raising a support ticket for a known outage | Unnecessary load on the helpdesk | Always check https://status.isambard.ac.uk before submitting a ticket |
Forgetting clifton auth | SSH fails | Run daily before connecting |
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub isambard-sc/skills.isambard.ac.uk --plugin isambard