From snap-skills
Use this skill when the user needs to run code, submit jobs, or access data on the SNAP/ILC cluster at Stanford. Covers SLURM job submission (single `il` partition with QoS-based scheduling, correct account, GPU types), the three-tier storage system, and uv/pixi virtual environment management on compute nodes. Do NOT use this for Sherlock — ILC has a different account, storage layout, and cluster configuration.
How this skill is triggered — by the user, by Claude, or both
Slash command
/snap-skills:ilc-accessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill teaches the agent how to interact with the SNAP (Stanford Networking and Analysis Platform) cluster, also known as ILC. The login node is reachable via `ssh ilc`.
This skill teaches the agent how to interact with the SNAP (Stanford Networking and Analysis Platform) cluster, also known as ILC. The login node is reachable via ssh ilc.
Before this skill can be used, the user must have:
infolab SLURM account.ilc in ~/.ssh/config (may require Stanford VPN when off-campus).If the user cannot ssh ilc or gets Invalid account or account/partition combination specified errors, ask them to check with the SNAP admins or Jure's lab manager.
sbatch or srun).--account=infolab or it will fail.--qos=il-interactive — otherwise srun fails with Interactive jobs must request --qos=il-interactive.srun --qos=il-interactive so stdout/stderr streams directly. For production, use sbatch (defaults to il QoS, 7-day wall).sinfo -p il -O NodeHost:.20,Gres:.40,GresUsed:.60,CPUsState:.15,AllocMem:.10,Memory:.10
(The user has this aliased as gpu_util.)ILC consolidated to a single production partition il (the default). Scheduling priority and wall-time limits are now controlled by QoS, not by partition.
| QoS | Max wall | Priority | Use case |
|---|---|---|---|
il (default) | 7 days | 1000 | Standard production jobs |
il-interactive | 12 hours | 1500 | Debugging, testing, short iterations — highest priority |
il-lo | 21 days | 100 | Long runs (lowest priority, easily preempted) |
Rules of thumb:
--qos=il-interactive for any iteration / debugging work — it has the highest priority and forces you to keep jobs short.--qos flag (or set --qos=il) for production runs that need >12 hours.--qos=il-lo only for very long, low-urgency runs the user explicitly asked for.(The legacy il-interactive and il-lo partitions no longer exist — passing them via --partition= will fail.)
The dev and dev-interactive partitions still exist for the moogle and ilcx nodes but should not be used for real work.
| Node | GPU Type | GPUs | Mem (GB) | CPUs | Notes |
|---|---|---|---|---|---|
blackwell1 | B200 | 7 | 3090 | 288 | Currently in repair — not in any partition. Will return; once back, this is the top performance node. |
ampere{1,4,8,9} | A100 80G | 8 | 2048 | 128 | Default production nodes while blackwell1 is down |
hyperturing2 | RTX8000 | 10 | 2050 | 252 | Preferred for testing/debugging |
turing3 | 2080Ti | 10 | 1480 | 80 | Lighter workloads; no /lfs/local/0 (uses /lfs/turing3/0) |
Node recommendation for agents:
blackwell1) when available, otherwise A100 (ampere{1,4,8,9}) via --gres=gpu:a100:Nhyperturing2) via --gres=gpu:rtx8000:Nturing3 unless explicitly instructed — its local SSD lives at /lfs/turing3/0 instead of /lfs/local/0, which breaks most venv/scratch setups.If --gres=gpu:b200:* or --nodelist=blackwell1 fails with a partition error, blackwell1 is still in repair — fall back to A100.
ILC has a three-tier storage system. This is the single most common source of errors.
| Location | Quota | Speed | Visibility | Purpose |
|---|---|---|---|---|
/sailhome/$USER | 20 GB | NFS, medium | All nodes | Code, configs, small files. Cannot fit venvs. |
/dfs/user/$USER | 2 TB | NFS, slow | All nodes | Large datasets, results, archives. |
/lfs/local/0/$USER | Large, fast SSD | Node-local | Per compute node | Venvs, caches, temp computation. |
/lfs/turing3/0/$USER | 14 TB, fast SSD | turing3 only | turing3 | Alternative node-local SSD on the one node that lacks /lfs/local/0. |
Critical constraints:
/sailhome is 20 GB. A single vLLM venv is ~10 GB — venvs do not go here./lfs/local/0 is per compute node: files written on ampere1 are NOT visible on ampere4. The login node has no access to /lfs/local/0 at all./tmp is also node-local and ephemeral.SLURM jobs do not source ~/.bashrc, so every job script (or --wrap body) must set its own env. The required variables on ILC are:
export UV_PROJECT_ENVIRONMENT=/lfs/local/0/$USER/uv-envs/$(basename $PWD) # venv (~GBs)
export XDG_CACHE_HOME=/lfs/local/0/$USER/.cache # uv wheel/build cache
export XDG_DATA_HOME=/lfs/local/0/$USER/.local/share # uv-managed Python interpreters
All three keep large, write-heavy state off the 20 GB /sailhome quota and onto fast node-local SSD. SLURM inherits the submitter's PATH (so uv itself is already reachable); only set XDG_BIN_HOME and prepend /lfs/local/0/$USER/.local/bin to PATH if the job uses tools installed via uv tool install.
The user's interactive ~/.bashrc does the equivalent via PROMPT_COMMAND; SLURM jobs must replicate it.
Use uv for all Python dependency management. The recommended pattern lets uv run auto-create a venv on each compute node:
/lfs/local/0/$USER/uv-envs/<project> (~5–10 min for large deps like vLLM/torch).Because /lfs/local/0 is node-local, each compute node maintains its own venv. There is no NFS congestion and no /sailhome quota pressure.
Pixi is configured (via ~/.pixi/config.toml) to use XDG_CACHE_HOME for environments, so the same env-var setup above applies. After pixi install, if you need a CUDA build of PyTorch (conda-forge's pytorch is CPU-only), force-install from a compute node:
pixi run pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 \
--index-url https://download.pytorch.org/whl/cu124
pixi run pip install 'numpy<2' # avoid breakage with scvi-tools, numba, etc.
ssh ilc "srun --account=infolab --partition=il --qos=il-interactive \
--gres=gpu:a100:1 --cpus-per-task=8 --mem=100G --time=04:00:00 \
bash -c 'export UV_PROJECT_ENVIRONMENT=/lfs/local/0/\$USER/uv-envs/my_project \
&& cd /sailhome/\$USER/my_project \
&& uv run --no-progress python train.py'"
ssh ilc "sbatch --account=infolab --partition=il \
--gres=gpu:a100:1 --cpus-per-task=8 --mem=100G --time=2-00:00:00 \
--output=/dfs/user/\$USER/logs/job_%j.out \
--error=/dfs/user/\$USER/logs/job_%j.err \
my_script.sh"
(Omit --qos= → defaults to il, 7-day cap. Use --qos=il-lo for runs >7 days.)
ssh ilc "sbatch --account=infolab --partition=il --qos=il-interactive \
--cpus-per-task=4 --mem=16G --time=02:00:00 \
--wrap='cd /sailhome/\$USER/my_project && uv run --no-progress python analysis.py'"
--wrapssh ilc "sbatch --account=infolab --partition=il --qos=il-interactive \
--mem=80G --time=01:00:00 \
--output=/dfs/user/\$USER/logs/jobname_%j.out \
--error=/dfs/user/\$USER/logs/jobname_%j.err \
--wrap='<command>'"
Keys must be injected ephemerally from the local machine via pass (or equivalent). Never write them to ~/.bashrc, .env files, or any file on the server.
ssh ilc "ANTHROPIC_API_KEY=$(pass api_keys/anthropic) sbatch \
--account=infolab --partition=il --qos=il-interactive \
--export=ALL --wrap='uv run --no-progress python my_script.py'"
| Error | Fix |
|---|---|
Invalid account or account/partition combination specified | Add --account=infolab; if still failing, the user is not in the infolab account — contact SNAP admins. |
Interactive jobs must request --qos=il-interactive | Add --qos=il-interactive to your srun. |
Invalid partition name specified: il-interactive (or il-lo) | These are no longer partitions. Use --partition=il --qos=il-interactive (or il-lo). |
Disk quota exceeded while installing a venv | UV_PROJECT_ENVIRONMENT is unset or pointing at /sailhome. Set it to /lfs/local/0/$USER/uv-envs/.... |
Permission denied writing to /lfs/local/0 from the login node | You're trying to do compute work on the login node. Submit a SLURM job. |
gpu:b200 not found / blackwell1 not in partition | blackwell1 is in repair. Fall back to --gres=gpu:a100:N. |
zinaida), different storage, conda-based env strategy. Use a Sherlock-specific skill.lsyncd configs in project directories (see the user's toolchain notes).npx claudepluginhub snap-stanford/snap-skills --plugin snap-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.