From research-skills
Use when working on PSC Bridges-2 — SSH/login to bridges2.psc.edu, SLURM job submission (sbatch, srun, interact), partitions (RM, RM-shared, RM-512, EM, GPU, GPU-shared, ROBO H100), GPU types (h100-80, l40s-48, v100-32, v100-16), allocations/SU accounting, parallel-allocation racing to cut queue wait (submit to multiple allocations/partitions, cancel losers on first start), Ocean/jet filesystems, $LOCAL/$RAMDISK, modules, Singularity/Apptainer containers (Docker→SIF via bundled scripts/singularity_pull_docker_local.sh + scripts/start_sif.sh), pinned-workspace rsync bridge via scripts/sync_local_to_psc.sh + scripts/sync_psc_to_local.sh (driven by .psc-config, safety-checked), Rerun port-forwarding, AirLab (<allocation-id>) workflows, or data transfers via data.bridges2.psc.edu.
How this skill is triggered — by the user, by Claude, or both
Slash command
/research-skills:psc-bridges2The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Reference for running work on the Pittsburgh Supercomputing Center's Bridges-2 system. See `partitions.md`, `job-scripts.md`, `filesystems.md`, `airlab-fast-setup.md` (AirLab-specific workflow: Singularity, ROBO/H100, Rerun, `<allocation-id>`), `docker-to-singularity.md` (running Docker images on PSC via the bundled `scripts/singularity_pull_docker_local.sh` + `scripts/start_sif.sh`), and `remo...
Reference for running work on the Pittsburgh Supercomputing Center's Bridges-2 system. See partitions.md, job-scripts.md, filesystems.md, airlab-fast-setup.md (AirLab-specific workflow: Singularity, ROBO/H100, Rerun, <allocation-id>), docker-to-singularity.md (running Docker images on PSC via the bundled scripts/singularity_pull_docker_local.sh + scripts/start_sif.sh), and remote-workspace.md (pinned rsync bridge: .psc-config + scripts/sync_local_to_psc.sh / scripts/sync_psc_to_local.sh, with mandatory workspace safety checks).
ssh [email protected] (port 22, HPN-SSH supported)apr.psc.edu (8+ chars, 3 of 4 char groups, annual change)data.bridges2.psc.edu DTNs, not login nodesprojects — list allocations, balances, IDs, Ocean usagemy_quotas — check $HOME and $PROJECT quotas-A <allocation-id>newgrp <group> — temporary Unix group switchchange_primary_group <account-id> — permanent default switchService Unit (SU) charging:
projects is always negative (e.g. -168,249 / 0 SU) — this is normal and the allocation remains usable| Variable | Path | Quota | Backup | Notes |
|---|---|---|---|---|
$HOME | /jet/home/<user> | 25 GB | Daily | Scripts, source |
$PROJECT | /ocean/projects/<group>/<PSC-user> | Per-allocation | None | Ocean; 6,070 inodes/GB |
$LOCAL | Node-local scratch | Node-dependent | — | Fast I/O, wiped at job end |
$RAMDISK | RAM-backed | Depends on memory req | — | Lost on abnormal exit |
Exceeding $PROJECT quota blocks job submission.
Ask the user to indicate the workspace — an absolute Ocean path under /ocean/projects/<group>/<PSC-user>/... — before doing anything on PSC. Do not guess. If .psc-config exists locally, read PSC_WORKSPACE and confirm with the user.
Allowed paths:
~ / $HOME whenever possible (only for things that belong there, e.g. SSH keys, dotfiles, module config).interact / sbatch): the pinned Ocean workspace and $LOCAL. Nothing else./tmp is never allowed, on any node. If a tool defaults there, redirect it (TMPDIR, APPTAINER_TMPDIR, cache flags).If the workspace has not been specified yet, stop and ask — don't cd, mkdir, download, singularity pull, or sbatch first.
module avail — list availablemodule load <name> / module unload <name> / module listmodule spider <name> — searchbioinformatics lists bio software| Partition | Node type | Cores/Node | Max nodes | Max time | Notes |
|---|---|---|---|---|---|
| RM | 256GB RM | 128 | 64 | 72h | Full-node; charges all 128 cores |
| RM-shared | 256GB RM | 1–64 | 1 | 72h | 2 GB/core |
| RM-512 | 512GB RM | 128 | 2 | 72h | Large memory |
| EM | 4TB EM | 96 | 1 | 120h | Request 24/48/72/96 cores; no interactive |
| GPU | h100-80/l40s-48/v100-32/v100-16 | 8 or 16 GPUs | 4 | 48h | Full-node |
| GPU-shared | same GPU types | ≤4 GPUs | 1 | 48h | Partial node |
See partitions.md for full specs.
interact # RM-shared default: 1 core, 60 min
interact -p RM-shared --ntasks-per-node=32 -t 5:00:00
interact -p GPU-shared --gres=gpu:v100-32:1 -t 2:00:00
Options: -p, -t, -N, --ntasks-per-node, --gres=gpu:<type>:<n>, -A.
Important constraints when using interact:
-t 8:00:00), regardless of the partition's batch limit (RM/RM-shared 72h and GPU/GPU-shared 48h apply only to sbatch). Default is 60 min, and idle sessions are auto-logged-out after 30 min — always pass -t explicitly.sbatch for EM.interact session per user (observed: a second interact while one is already running will queue behind it; not explicitly documented in the PSC user guide). Before launching, check squeue -u $USER for an existing interactive job and reuse it (re-attach via the original terminal / tmux) instead of spawning another.interact (up to 8h) is often the right tool — but if the work needs >8h or must survive disconnects, use sbatch.interact in tmux/screen on the login node so SSH drops don't kill the session.sbatch -p RM -t 5:00:00 -N 1 script.job
squeue -u $USER
scancel <jobid>
Output: slurm-<jobid>.out. States: PD (pending), R (running), CA (cancelled), F (failed).
Interactive uses --gres=gpu:<type>:<n>; batch uses --gpus=<type>:<n> (multiple of 8 on GPU full-node).
See job-scripts.md for ready-to-adapt templates (RM, RM-shared, EM, GPU, GPU-shared, MPI, OpenMP).
Queues on Bridges-2 vary dramatically by allocation and partition. A single -A <alloc> submission can sit in PD for hours while another allocation/partition would start in minutes. When wait time matters, submit the same job to multiple allocations or compatible partitions in parallel and cancel the losers once one starts.
When to use:
projects), or the job can run on multiple partitions (e.g. GPU-shared vs ROBO for H100 work).How to do it:
--job-name so you can find them:
sbatch -A <alloc-A> -p GPU-shared --gres=gpu:h100-80:1 -J race_h100_A job.sh
sbatch -A <alloc-B> -p GPU-shared --gres=gpu:h100-80:1 -J race_h100_B job.sh
sbatch -A <alloc-B> -p ROBO --gres=h100:1 -J race_h100_R job.sh
squeue -u $USER -n race_h100_A,race_h100_B,race_h100_R.R, cancel the rest:
RUNNING=$(squeue -u $USER -n race_h100_A,race_h100_B,race_h100_R -h -t R -o %i | head -1)
squeue -u $USER -n race_h100_A,race_h100_B,race_h100_R -h -t PD -o %i | xargs -r scancel
Rules / cautions:
R," not "I've been waiting a few minutes." Racing only works if every sibling stays alive until exactly one starts; killing PD jobs early to "settle" on one defeats the purpose, since the one you kept may itself be the slowest. If no sibling has started yet, keep waiting.interact for small / short / exploratory tasks. A single interact -p RM-shared ... or interact -p GPU-shared --gres=gpu:<type>:<n> ... session is usually better than batch racing for debug work, quick checks, or anything under ~15–30 min — lower overhead, no duplicate-job SU risk, and the user can drive it directly. Reach for racing only when the job is long enough that queue wait actually dominates.interact instead.| Compiler | Module | C / C++ / Fortran | OpenMP flag |
|---|---|---|---|
| Intel Classic | intel | icc / icpc / ifort | -qopenmp |
| Intel LLVM | intel-oneapi | icx / icpx / ifx | -fopenmp |
| GNU | gcc | gcc / g++ / gfortran | -fopenmp |
| AMD | aocc | clang / clang++ / flang | -fopenmp |
| NVIDIA | nvhpc | nvcc / nvc++ / nvfortran | -mp |
MPI implementations: MVAPICH2, OpenMPI, Intel MPI. Load compiler module + MPI module, then use mpicc / mpicxx / mpifort.
# rsync (recommended: faster MAC)
rsync -rltDvp [email protected] source/ [email protected]:/ocean/projects/<group>/<PSC-user>/dest/
# scp
scp file [email protected]:/ocean/projects/<group>/<PSC-user>/
# sftp
sftp [email protected]
Globus: endpoint PSC Bridges-2 /ocean and /jet filesystems. Best for large/resumable transfers.
Pinned-workspace rsync (recommended for project code): use the bundled scripts/sync_local_to_psc.sh / scripts/sync_psc_to_local.sh. Both read .psc-config in the project root, which fixes the remote path ONCE (PSC_WORKSPACE=...); the loader rejects unsafe destinations ($HOME roots, allocation roots, other users' trees). Agents must NEVER construct a remote path by hand — use these scripts, never pass a remote arg, and fix .psc-config if a check fails. Full rules and exclude list in remote-workspace.md.
singularity exec --nv -B /ocean:/ocean <img.sif> <cmd>/ocean/containers/ngcAPPTAINER_CACHEDIR / APPTAINER_TMPDIR under $PROJECT before pulling/buildingscripts/singularity_pull_docker_local.sh (pull a Docker image on an allocated node using $LOCAL scratch, then save the .sif to the working dir) and scripts/start_sif.sh (fuzzy-match a .sif by keyword and start/exec it with --nv + /local bind). Full usage in docker-to-singularity.md. For a bare Docker-on-PSC workflow, copy those two scripts to the cluster — nothing else is required.[email protected]For lab-standard container workflows, tmux-pinning to a login node, job arrays, and Rerun visualization, see airlab-fast-setup.md.
data.bridges2.psc.edu.RM — you pay for all 128 cores. Use RM-shared.--gres=gpu:... — batch uses --gpus=<type>:<n>.-A <allocation> when multiple allocations exist.$LOCAL/$RAMDISK — wiped at job end.[email protected]Official PSC Bridges-2
bridges2.psc.edudata.bridges2.psc.eduAccess / allocations
Tooling
AirLab
airlab-storage.andrew.cmu.edu:5001Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub jakoerror/claude-research-skills --plugin research-skills