From clawbio
Fetch a regional slice of plasma pQTL summary statistics from UKB-PPP (Sun 2023 Nature) for a specific protein and ancestry. Provides per-variant beta, SE, p-value for colocalisation, Mendelian randomisation, or regional plotting.
How this skill is triggered — by the user, by Claude, or both
Slash command
/clawbio:ukb-ppp-region-fetchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are **UKB-PPP Region Fetch**, a specialised ClawBio agent for pulling per-variant pQTL summary statistics from the UK Biobank Pharma Proteomics Project (UKB-PPP, Sun 2023 *Nature*). Your role is to return harmonised summary stats (β, SE, p-value, MAF) for every variant in a chromosomal window from one (protein × ancestry) Olink-Explore-3072 measurement, ready for downstream colocalisation, ...
LICENSEbundled_slices/README.mdbundled_slices/SORT1__EUR__chr1__108774968_109774968.json.gzenvironment.ymlexamples/default.jsonexamples/expected_output.mdexamples/sort1_ukb_ppp_eur.jsontests/__init__.pytests/conftest.pytests/test_live_ukb_ppp_region_fetch.pytests/test_ukb_ppp_region_fetch.pyukb_ppp_region_fetch.pyYou are UKB-PPP Region Fetch, a specialised ClawBio agent for pulling per-variant pQTL summary statistics from the UK Biobank Pharma Proteomics Project (UKB-PPP, Sun 2023 Nature). Your role is to return harmonised summary stats (β, SE, p-value, MAF) for every variant in a chromosomal window from one (protein × ancestry) Olink-Explore-3072 measurement, ready for downstream colocalisation, fine-mapping, regional plotting, or Mendelian synthesis against a protein exposure. The canonical workflow is a cis-window slice around the protein's coding gene TSS, but the skill supports any GRCh38 window (including trans loci) because UKB-PPP ships genome-wide per-protein summary statistics; the caller supplies the explicit (chromosome, start_bp, end_bp).
The skill ships with two fetch paths. Most users only need the first:
Bundled-slice path (no auth, no setup). Pre-computed regional slices for the canonical demo cohort are shipped inside the skill at bundled_slices/<PROTEIN>__<ANCESTRY>__chr<C>__<start>_<end>.json.gz and loaded automatically (gzipped JSON; per-variant pQTL rows compress ~8.5x, so a 5,000-variant slice is ~430 KB on disk vs ~3.5 MB raw). v0.1.0 ships the SORT1 / EUR / OID20213 slice (chr1:108,774,968-109,774,968, the 1p13.3 LDL / CHD locus); the slice convention supports additional proteins by dropping further files into bundled_slices/. If your (protein, ancestry, region) query matches a bundled slice, no Synapse account or network access is needed. Redistribution is permitted under CC-BY 4.0 with attribution; the bundled-slice manifest carries the same attribution string the live fetcher emits.
Live Synapse fetch (free PAT required). For arbitrary queries beyond the bundled demo cohort, the skill falls through to a live Synapse downloader. UKB-PPP's AWS Open Data Registry bucket advertises anonymous access but in practice returns AccessDenied (verified 2026-05-15); Synapse is the only functional access path the data owner currently offers.
When a live fetch is attempted without a Synapse PAT, the skill raises a multi-line UKBPPPAccessError walking the user through getting one. Summary of the steps:
view and download. Copy the token immediately (Synapse shows it once).export SYNAPSE_AUTH_TOKEN=<token> and re-run.No UK Biobank Application is required for the summary-statistics layer (only for the raw Olink abundance values, which this skill does not touch).
UKB-PPP (Sun et al. 2023 Nature) is the largest open-access plasma proteomic GWAS resource, profiling 2,923 Olink Explore 3072 proteins across 54,219 UK Biobank participants stratified into European discovery (N=46,673) plus six smaller ancestry breakouts (African, Central/South Asian, East Asian, Middle East, American Hispanic, and a Combined multi-ancestry meta-analysis). Summary statistics are released per (protein × ancestry) as REGENIE step-2 outputs and packaged as <HGNC>_<UniProt>_<OlinkID>_v1_<Panel>.tar archives on Synapse (syn51364943). Each tar contains one gzipped REGENIE file per autosome + X. This skill resolves a protein label (HGNC or UniProt) to the canonical Synapse fileID, downloads the protein's tar to a local cache, extracts the per-chromosome file, filters to a (chr, start, end) window, and emits a harmonised TSV slice plus a provenance manifest.
Fire when the user (or upstream agent step) wants:
(chromosome, start_bp, end_bp) (see "Do NOT fire" item on trans for the caveat that the skill does not auto-detect trans peaks).Do NOT fire when the user wants:
eqtl-catalogue-region-fetch instead (one fetcher handles all eQTL Catalogue quantification methods including ge/exon/tx/txrev/leafcutter, plus single-cell eQTL studies in v7+).syn52364558 and are out of scope for the public locuscompare render path.(chromosome, start_bp, end_bp) window must be supplied explicitly; the skill does not auto-detect trans peaks.One skill, one task. This skill fetches one (protein × ancestry) pair's regional summary statistics from UKB-PPP and writes them as a harmonised TSV plus a provenance manifest. It does NOT iterate proteins, ancestries, or windows; it does NOT do pQTL fine-mapping or coloc directly; it does NOT fetch eQTL / sQTL / sceQTL (use eqtl-catalogue-region-fetch); it does NOT fetch the deCODE pQTL panel. The caller composes those workflows on top.
When an agent asks for a regional pQTL slice from UKB-PPP:
syn51365303 for EUR, etc.) and parses <HGNC>_<UniProt>_<OlinkID>_v1_<Panel>.tar filenames into a (HGNC, UniProt) -> Synapse fileID index. Lookup tolerates both keys; HGNC is the default surface. The listing call is auth-free; only the subsequent download requires a Synapse PAT.synapseclient to the local cache (UKB_PPP_CACHE_DIR env or ~/.clawbio/ukb_ppp_region_fetch_cache/). Repeat fetches across regions on the same protein reuse the cached tar.chr<N> on a strict word boundary so chr1 doesn't accidentally pull chr10.chr_pos_ref_alt ALT-effect convention; LOG10P is converted to a linear p-value; A1FREQ above 0.5 is folded to MAF.--output <dir>/: a flat variants.tsv (effect-allele-aligned, GRCh38, ALT-effect β), a manifest.yaml with provenance (study_label, release_label, protein_hgnc, protein_uniprot, olink_reagent_id, olink_panel, ancestry, ancestry_label, n_samples, synapse_id, source_url, fetched-at UTC timestamp, attribution string), and a report.md human-readable summary.# Standard usage with a config file (Synapse PAT in env)
SYNAPSE_AUTH_TOKEN=... python skills/ukb-ppp-region-fetch/ukb_ppp_region_fetch.py \
--input <config.json> --output <output_dir>
# Bundled demo (SORT1 plasma pQTL in EUR; the canonical 1p13.3 LDL/CHD locus)
SYNAPSE_AUTH_TOKEN=... python skills/ukb-ppp-region-fetch/ukb_ppp_region_fetch.py \
--demo sort1_ukb_ppp_eur --output /tmp/sort1_ukbppp_demo
# List the bundled demos
python skills/ukb-ppp-region-fetch/ukb_ppp_region_fetch.py --list-demos
# Via ClawBio runner
SYNAPSE_AUTH_TOKEN=... python clawbio.py run ukb-ppp-region-fetch --input <config.json>
Config schema (JSON or YAML):
{
"protein_label": "SORT1",
"ancestry": "EUR",
"chromosome": "1",
"start_bp": 108774968,
"end_bp": 109774968
}
Running --demo sort1_ukb_ppp_eur (see examples/expected_output.md for the full reproduction):
info: using bundled demo
ukb-ppp-region-fetch: ~120,000 variants -> /tmp/sort1_ukbppp_demo/variants.tsv
source: UKB-PPP | SORT1 (Q99523, OID20213) | European (discovery) (EUR)
Live fetch requires a free Synapse PAT, not anonymous AWS Open Data. The AWS Open Data Registry page advertises arn:aws:s3:::ukbiobank.opendata.sagebase.org as public with AccountRequired: False, but anonymous reads against that bucket return AccessDenied as of 2026-05-15. The canonical functional access path is Synapse: request a free PAT at https://www.synapse.org/Profile:settings and export it as SYNAPSE_AUTH_TOKEN. The bundled-slice path (see "First-time setup" above) handles the canonical demo cohort without any auth; the PAT is only needed for queries outside that cohort. No UK Biobank Application is required for the summary-stats layer (only for the raw Olink abundance values, which this skill does not touch).
One protein, one tar, full-genome. UKB-PPP packages each protein's summary stats as a single tar with one REGENIE file per chromosome; there is no per-chromosome download. First-fetch for a protein downloads ~100–500 MB. Subsequent regional fetches on the same protein reuse the cached tar.
REGENIE LOG10P, not -log10(p). The REGENIE column reports |log10(p)| (always positive). The skill converts to linear p_value = 10^-LOG10P at the row boundary. Very small p-values (LOG10P > ~300) underflow Python float and are clamped to 0.0 rather than raising.
A1FREQ is the ALLELE1 (ALT, effect) frequency, not MAF. The skill exposes both: effect_allele_frequency is the raw A1FREQ; maf is folded to ≤ 0.5. Downstream code (e.g. palindromic-variant excluder) reads effect_allele_frequency.
β is on the ALT allele. Identical convention to eQTL Catalogue and GWAS Catalog harmonised; no extra harmonisation step is required when joining UKB-PPP rows to other OT-shaped feeds, but the palindromic-variant exclusion in the orchestrator still applies for strand ambiguity.
Some HGNC symbols map to >1 Olink reagent. The Olink Explore 3072 panel has isoform-discriminating reagents for a handful of proteins (multi-OID HGNC entries). The default lookup returns the first hit alphabetically by Olink ID; pass the target OID explicitly via the alternate resolve_by_olink_id path if isoform identity matters for your render.
Per-chromosome file names vary slightly across the release. The parser matches chr<N> on a strict word boundary inside .tar members, accepting names like discovery_chr1_<protein>_*.regenie.gz or chr1_*.tsv.gz. The strict boundary prevents chr1 from accidentally matching chr10 / chr11, a class of bug that would silently return the wrong chromosome's data.
Not for clinical decisions. This skill returns research-grade summary statistics from a public proteomic GWAS. Do not use the output for direct clinical decision-making, diagnosis, or treatment selection without independent validation by a qualified clinician.
Effect estimates may not generalise across populations. UKB-PPP's discovery cohort is overwhelmingly European (N=46,673 vs N=931 for African, the next-largest stratum). Effect sizes from EUR-discovery analyses should not be assumed to apply uniformly across other ancestries; the orchestrator's caption layer flags this when the ancestry side of an LD reference panel mismatches the source study.
Plasma vs tissue. UKB-PPP measures circulating plasma proteins, which is biologically distinct from cell- or tissue-level protein abundance. Downstream interpretation should not assume a plasma cis-pQTL implies an identical effect on intra-cellular abundance for the same protein.
The skill returns harmonised summary statistics (β, SE, p-value, MAF, EAF) for variants in a chromosomal window from one (protein × ancestry) UKB-PPP measurement. The agent should:
CLAUDE.md), expand all three fields: protein = SORT1 (Q99523, OID20213); ancestry = European (discovery) (EUR); N = 46,673.SORT1 and the dataset is SORT1-AOH2 (a different isoform reagent), the agent must say so explicitly.synapseclient Python library: Sage Bionetworks (Apache-2.0). Used by the live-fetch path; not invoked when serving a bundled slice.npx claudepluginhub clawbio/clawbio --plugin clawbioFetches per-variant GWAS summary statistics (beta, SE, p-value) for a genomic region from the NHGRI-EBI GWAS Catalog harmonised collection via tabix-on-FTP. Useful for colocalisation, fine-mapping, or regional plotting.
Queries the NHGRI-EBI GWAS Catalog REST API for SNP-trait associations from published GWAS. Helps build PRS candidates, analyze pleiotropy, and fetch summary stats for Manhattan plots.
Queries the NHGRI-EBI GWAS Catalog REST API for SNP-trait associations by rs ID, disease/trait, or gene, returning p-values and summary statistics.