Skill

ilc-access

Use this skill when the user needs to run code, submit jobs, or access data on the SNAP/ILC cluster at Stanford. Covers SLURM job submission (single `il` partition with QoS-based scheduling, correct account, GPU types), the three-tier storage system, and uv/pixi virtual environment management on compute nodes. Do NOT use this for Sherlock — ILC has a different account, storage layout, and cluster configuration.

Popularity

Stars

Forks

Shared by

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/snap-skills:ilc-access

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill teaches the agent how to interact with the SNAP (Stanford Networking and Analysis Platform) cluster, also known as ILC. The login node is reachable via `ssh ilc`.

SKILL.md

186 lines · ~2.5k tokens

Stats

LanguagePython

Stars1

Forks2

MaintenanceExcellent

Last CommitMay 22, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

SNAP/ILC Cluster Access

This skill teaches the agent how to interact with the SNAP (Stanford Networking and Analysis Platform) cluster, also known as ILC. The login node is reachable via ssh ilc.

Setup (one-time, user-side)

Before this skill can be used, the user must have:

An active SNAP/ILC account with membership in the infolab SLURM account.
SSH access configured as ilc in ~/.ssh/config (may require Stanford VPN when off-campus).

If the user cannot ssh ilc or gets Invalid account or account/partition combination specified errors, ask them to check with the SNAP admins or Jure's lab manager.

Core rules

Never run code on the login node. Always submit via SLURM (sbatch or srun).
Every job must specify --account=infolab or it will fail.
Interactive jobs require --qos=il-interactive — otherwise srun fails with Interactive jobs must request --qos=il-interactive.
For debugging/testing, use srun --qos=il-interactive so stdout/stderr streams directly. For production, use sbatch (defaults to il QoS, 7-day wall).
Check resource availability before submitting big jobs:
```
sinfo -p il -O NodeHost:.20,Gres:.40,GresUsed:.60,CPUsState:.15,AllocMem:.10,Memory:.10
```
(The user has this aliased as gpu_util.)
Be generous with time and memory allocation — jobs that crash from OOM waste the queue more than over-allocation does.

Partition and QoS structure

ILC consolidated to a single production partition il (the default). Scheduling priority and wall-time limits are now controlled by QoS, not by partition.

QoS	Max wall	Priority	Use case
`il` (default)	7 days	1000	Standard production jobs
`il-interactive`	12 hours	1500	Debugging, testing, short iterations — highest priority
`il-lo`	21 days	100	Long runs (lowest priority, easily preempted)

Rules of thumb:

Default to --qos=il-interactive for any iteration / debugging work — it has the highest priority and forces you to keep jobs short.
Drop the --qos flag (or set --qos=il) for production runs that need >12 hours.
Use --qos=il-lo only for very long, low-urgency runs the user explicitly asked for.

(The legacy il-interactive and il-lo partitions no longer exist — passing them via --partition= will fail.)

The dev and dev-interactive partitions still exist for the moogle and ilcx nodes but should not be used for real work.

Nodes (GPU inventory)

Node	GPU Type	GPUs	Mem (GB)	CPUs	Notes
`blackwell1`	B200	7	3090	288	Currently in repair — not in any partition. Will return; once back, this is the top performance node.
`ampere{1,4,8,9}`	A100 80G	8	2048	128	Default production nodes while `blackwell1` is down
`hyperturing2`	RTX8000	10	2050	252	Preferred for testing/debugging
`turing3`	2080Ti	10	1480	80	Lighter workloads; no `/lfs/local/0` (uses `/lfs/turing3/0`)

Node recommendation for agents:

Heavy production → B200 (blackwell1) when available, otherwise A100 (ampere{1,4,8,9}) via --gres=gpu:a100:N
Testing/debugging → RTX8000 (hyperturing2) via --gres=gpu:rtx8000:N
Do NOT use turing3 unless explicitly instructed — its local SSD lives at /lfs/turing3/0 instead of /lfs/local/0, which breaks most venv/scratch setups.

If --gres=gpu:b200:* or --nodelist=blackwell1 fails with a partition error, blackwell1 is still in repair — fall back to A100.

Storage architecture

ILC has a three-tier storage system. This is the single most common source of errors.

Location	Quota	Speed	Visibility	Purpose
`/sailhome/$USER`	20 GB	NFS, medium	All nodes	Code, configs, small files. Cannot fit venvs.
`/dfs/user/$USER`	2 TB	NFS, slow	All nodes	Large datasets, results, archives.
`/lfs/local/0/$USER`	Large, fast SSD	Node-local	Per compute node	Venvs, caches, temp computation.
`/lfs/turing3/0/$USER`	14 TB, fast SSD	`turing3` only	`turing3`	Alternative node-local SSD on the one node that lacks `/lfs/local/0`.

Critical constraints:

/sailhome is 20 GB. A single vLLM venv is ~10 GB — venvs do not go here.
/lfs/local/0 is per compute node: files written on ampere1 are NOT visible on ampere4. The login node has no access to /lfs/local/0 at all.
/tmp is also node-local and ephemeral.

Environment-variable setup for compute jobs

SLURM jobs do not source ~/.bashrc, so every job script (or --wrap body) must set its own env. The required variables on ILC are:

export UV_PROJECT_ENVIRONMENT=/lfs/local/0/$USER/uv-envs/$(basename $PWD)   # venv (~GBs)
export XDG_CACHE_HOME=/lfs/local/0/$USER/.cache                              # uv wheel/build cache
export XDG_DATA_HOME=/lfs/local/0/$USER/.local/share                         # uv-managed Python interpreters

All three keep large, write-heavy state off the 20 GB /sailhome quota and onto fast node-local SSD. SLURM inherits the submitter's PATH (so uv itself is already reachable); only set XDG_BIN_HOME and prepend /lfs/local/0/$USER/.local/bin to PATH if the job uses tools installed via uv tool install.

The user's interactive ~/.bashrc does the equivalent via PROMPT_COMMAND; SLURM jobs must replicate it.

uv virtual environment strategy

Use uv for all Python dependency management. The recommended pattern lets uv run auto-create a venv on each compute node:

Set the env vars above in the job script.
First job on a given node auto-installs the venv into /lfs/local/0/$USER/uv-envs/<project> (~5–10 min for large deps like vLLM/torch).
Subsequent jobs on the same node start instantly.

Because /lfs/local/0 is node-local, each compute node maintains its own venv. There is no NFS congestion and no /sailhome quota pressure.

pixi virtual environments

Pixi is configured (via ~/.pixi/config.toml) to use XDG_CACHE_HOME for environments, so the same env-var setup above applies. After pixi install, if you need a CUDA build of PyTorch (conda-forge's pytorch is CPU-only), force-install from a compute node:

pixi run pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 \
  --index-url https://download.pytorch.org/whl/cu124
pixi run pip install 'numpy<2'   # avoid breakage with scvi-tools, numba, etc.

Job submission patterns

Interactive GPU job (debugging)

ssh ilc "srun --account=infolab --partition=il --qos=il-interactive \
  --gres=gpu:a100:1 --cpus-per-task=8 --mem=100G --time=04:00:00 \
  bash -c 'export UV_PROJECT_ENVIRONMENT=/lfs/local/0/\$USER/uv-envs/my_project \
    && cd /sailhome/\$USER/my_project \
    && uv run --no-progress python train.py'"

Batch GPU job (production)

ssh ilc "sbatch --account=infolab --partition=il \
  --gres=gpu:a100:1 --cpus-per-task=8 --mem=100G --time=2-00:00:00 \
  --output=/dfs/user/\$USER/logs/job_%j.out \
  --error=/dfs/user/\$USER/logs/job_%j.err \
  my_script.sh"

(Omit --qos= → defaults to il, 7-day cap. Use --qos=il-lo for runs >7 days.)

CPU-only job

ssh ilc "sbatch --account=infolab --partition=il --qos=il-interactive \
  --cpus-per-task=4 --mem=16G --time=02:00:00 \
  --wrap='cd /sailhome/\$USER/my_project && uv run --no-progress python analysis.py'"

One-shot inline job via `--wrap`

ssh ilc "sbatch --account=infolab --partition=il --qos=il-interactive \
  --mem=80G --time=01:00:00 \
  --output=/dfs/user/\$USER/logs/jobname_%j.out \
  --error=/dfs/user/\$USER/logs/jobname_%j.err \
  --wrap='<command>'"

API key injection (never write keys to disk)

Keys must be injected ephemerally from the local machine via pass (or equivalent). Never write them to ~/.bashrc, .env files, or any file on the server.

ssh ilc "ANTHROPIC_API_KEY=$(pass api_keys/anthropic) sbatch \
  --account=infolab --partition=il --qos=il-interactive \
  --export=ALL --wrap='uv run --no-progress python my_script.py'"

Common errors and fixes

Error	Fix
`Invalid account or account/partition combination specified`	Add `--account=infolab`; if still failing, the user is not in the `infolab` account — contact SNAP admins.
`Interactive jobs must request --qos=il-interactive`	Add `--qos=il-interactive` to your `srun`.
`Invalid partition name specified: il-interactive` (or `il-lo`)	These are no longer partitions. Use `--partition=il --qos=il-interactive` (or `il-lo`).
`Disk quota exceeded` while installing a venv	`UV_PROJECT_ENVIRONMENT` is unset or pointing at `/sailhome`. Set it to `/lfs/local/0/$USER/uv-envs/...`.
`Permission denied` writing to `/lfs/local/0` from the login node	You're trying to do compute work on the login node. Submit a SLURM job.
`gpu:b200` not found / `blackwell1 not in partition`	`blackwell1` is in repair. Fall back to `--gres=gpu:a100:N`.

What this skill does NOT cover

Sherlock cluster — different account (zinaida), different storage, conda-based env strategy. Use a Sherlock-specific skill.
Project-specific Snakemake profiles — refer to the individual project repos.
Cross-machine code syncing — handled by lsyncd configs in project directories (see the user's toolchain notes).

ilc-access

Popularity

Invocation

Context Preview

SKILL.md

ilc-access

Popularity

Invocation

Context Preview

SKILL.md

SNAP/ILC Cluster Access

Setup (one-time, user-side)

Core rules

Partition and QoS structure

Nodes (GPU inventory)

Storage architecture

Environment-variable setup for compute jobs

uv virtual environment strategy

pixi virtual environments

Job submission patterns

Interactive GPU job (debugging)

Batch GPU job (production)

CPU-only job

One-shot inline job via --wrap

API key injection (never write keys to disk)

Common errors and fixes

What this skill does NOT cover

Similar Skills

SNAP/ILC Cluster Access

Setup (one-time, user-side)

Core rules

Partition and QoS structure

Nodes (GPU inventory)

Storage architecture

Environment-variable setup for compute jobs

uv virtual environment strategy

pixi virtual environments

Job submission patterns

Interactive GPU job (debugging)

Batch GPU job (production)

CPU-only job

One-shot inline job via --wrap

API key injection (never write keys to disk)

Common errors and fixes

What this skill does NOT cover

Similar Skills

One-shot inline job via `--wrap`

One-shot inline job via `--wrap`