From sequencing-pipelines
Build and install new genome references on FGCZ infrastructure following ezRun conventions. Use when adding new species/assemblies to /srv/GT/reference/, creating CellRanger-compatible indices, or updating existing references with new Ensembl/GENCODE releases. CRITICAL - must run on fgcz-r-029 due to rtracklayer segfault issues on compute nodes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sequencing-pipelines:genome-reference-buildThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build and deploy genome references for the FGCZ genomics infrastructure using the ezRun R package. This skill covers:
Build and deploy genome references for the FGCZ genomics infrastructure using the ezRun R package. This skill covers:
/srv/GT/reference/cellranger mkrefUse this skill when:
MUST run builds on fgcz-r-029 (the development server)
The R packages rtracklayer and restfulr have segfault issues on compute nodes (fgcz-c-053, fgcz-c-055, etc.) due to LD_LIBRARY_PATH not being properly inherited. Only fgcz-r-029 has a working environment.
# Connect to development server
ssh fgcz-r-029
# Set up environment
export PATH="/misc/ngseq12/packages/Dev/R/4.5.0/bin:/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"
export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"
| Tool | Path | Purpose |
|---|---|---|
| R 4.5.0 | /misc/ngseq12/packages/Dev/R/4.5.0/ | ezRun functions |
| Picard 3.2.0 | /usr/local/ngseq/packages/Tools/Picard/3.2.0/picard.jar | Sequence dictionary |
| samtools 1.20 | /usr/local/ngseq/packages/Tools/samtools/1.20/bin/ | FASTA indexing |
| IGVTools 2.8.10 | /usr/local/ngseq/packages/Tools/IGVTools/2.8.10/ | GTF sorting/indexing |
| JDK 21 | /usr/local/ngseq/packages/Dev/jdk/21/bin/ | Java runtime |
~/git/reference_files/
├── README.md # Conventions documentation
├── Organism_Name/ # One directory per species
│ └── Database_ReleaseXX_BuildName.R # Installation script(s)
└── ...
GitLab: https://gitlab.bfabric.org/Genomics/reference_files
Priority order for databases:
Finding URLs:
# Ensembl FTP (most organisms)
https://ftp.ensembl.org/pub/release-XXX/gtf/organism_name/
https://ftp.ensembl.org/pub/release-XXX/fasta/organism_name/dna/
# GENCODE (human/mouse only)
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_XX/
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_MXX/
File selection:
Species.Assembly.XXX.gtf.gz)dna_sm.toplevel.fa.gz (soft-masked, all sequences)Naming convention: Database_ReleaseXX_BuildName.R
Examples:
Ensembl_Release115_SCA1.R (Canary)GENCODE_Release42_GRCh38.p13.R (Human)Ensembl_Release112_bGalGal1.mat.broiler.GRCg7b.R (Chicken)Template script:
library(ezRun)
rm(list=ls())
options(timeout=300)
## Load tool environments
Sys.setenv(Picard_jar = '/usr/local/ngseq/packages/Tools/Picard/3.2.0/picard.jar')
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/IGVTools/2.8.10", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/samtools/1.20/bin/", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Dev/jdk/21/bin", Sys.getenv("PATH"), sep = ":"))
## Config
setwdNew("/srv/GT/analysis/USERNAME/references") # Change USERNAME
organism <- "Genus_species"
db <- "Ensembl" # or "GENCODE", "NCBI"
build <- "AssemblyName"
release <- "Release_XXX"
## Download files
gtfURL <- "https://ftp.ensembl.org/pub/release-XXX/gtf/organism/file.gtf.gz"
download.file(gtfURL, basename(gtfURL))
genomeURL <- "https://ftp.ensembl.org/pub/release-XXX/fasta/organism/dna/file.fa.gz"
download.file(genomeURL, basename(genomeURL))
featureFn <- basename(gtfURL)
genomeFn <- basename(genomeURL)
## Build reference (date = download date)
refBuild <- file.path(organism, db, build, "Annotation",
str_c(release, Sys.Date(), sep="-"))
param <- ezParam(list(refBuild=refBuild, genomesRoot="."))
## Core build functions
buildRefDir(param$ezRef, genomeFile=genomeFn, genesFile=featureFn)
buildIgvGenome(param$ezRef)
## Build annotation tables (optional - may fail for some organisms)
featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "features.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
genomeFile=param$ezRef@refFastaFile,
organism="organism_gene_ensembl") # BioMart dataset name
featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "genes.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
genomeFile=param$ezRef@refFastaFile,
organism="organism_gene_ensembl")
For makeFeatAnnoEnsembl(), you need the BioMart dataset name:
# In R, find available datasets:
library(biomaRt)
ensembl <- useEnsembl(biomart = "genes")
datasets <- listDatasets(ensembl)
datasets[grep("canaria", datasets$dataset, ignore.case = TRUE), ]
Common patterns:
hsapiens_gene_ensemblmmusculus_gene_ensemblggallus_gene_ensemblscanaria_gene_ensembl# SSH to development server
ssh fgcz-r-029
# Set up environment
export PATH="/misc/ngseq12/packages/Dev/R/4.5.0/bin:/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"
export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"
# Run script
cd ~/git/reference_files/Organism_Name/
Rscript Database_ReleaseXX_BuildName.R
Expected runtime: 10-30 minutes depending on genome size
Required files in /srv/GT/analysis/USERNAME/references/Organism/Database/Build/:
Organism/Database/Build/
├── igv_Build.genome
├── Annotation/
│ ├── Genes -> Release_XXX-YYYY-MM-DD/Genes (symlink)
│ └── Release_XXX-YYYY-MM-DD/
│ └── Genes/
│ ├── genes.gtf
│ ├── genes.sorted.gtf
│ ├── genes.sorted.gtf.idx
│ ├── features.gtf
│ ├── transcripts.only.gtf
│ ├── genes_annotation_byGene.txt (optional)
│ └── genes_annotation_byTranscript.txt (optional)
└── Sequence/
└── WholeGenomeFasta/
├── genome.fa
├── genome.fa.fai
├── genome.dict
└── genome-chromsizes.txt
If the symlink wasn't created automatically:
cd /srv/GT/analysis/USERNAME/references/Organism/Database/Build/Annotation/
ln -sf Release_XXX-YYYY-MM-DD/Genes Genes
# Copy to production location
cp -r /srv/GT/analysis/USERNAME/references/Organism /srv/GT/reference/
# Verify
ls -la /srv/GT/reference/Organism/Database/Build/
For 10x Genomics compatibility:
# Load CellRanger
module load Tools/cellranger/9.0.0 # or current version
# Build index (takes 1-2 hours, ~32GB RAM)
cellranger mkref \
--genome=Organism_Build \
--fasta=/srv/GT/reference/Organism/Database/Build/Sequence/WholeGenomeFasta/genome.fa \
--genes=/srv/GT/reference/Organism/Database/Build/Annotation/Genes/genes.gtf \
--nthreads=16 \
--memgb=64
# Move to reference location
mv Organism_Build /srv/GT/reference/Organism/Database/Build/Annotation/Release_XXX-YYYY-MM-DD/Genes/genes_10XGEX_Index/
cd ~/git/reference_files
git add Organism_Name/Database_ReleaseXX_BuildName.R
git commit -m "Add Organism (common name) Database Release XXX reference
- Assembly: BuildName
- Source: Database Release XXX
- Build date: YYYY-MM-DD
Co-Authored-By: Claude Opus 4.5 <[email protected]>"
git push
Ensembl_Release115_SCA1.Rlibrary(ezRun)
rm(list=ls())
options(timeout=300)
## Load environments
Sys.setenv(Picard_jar = '/usr/local/ngseq/packages/Tools/Picard/3.2.0/picard.jar')
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/IGVTools/2.8.10", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/samtools/1.20/bin/", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Dev/jdk/21/bin", Sys.getenv("PATH"), sep = ":"))
## Config
setwdNew("/srv/GT/analysis/pgueguen/references")
organism <- "Serinus_canaria"
db <- "Ensembl"
build <- "SCA1"
release <- "Release_115"
## Download
gtfURL <- "https://ftp.ensembl.org/pub/release-115/gtf/serinus_canaria/Serinus_canaria.SCA1.115.gtf.gz"
download.file(gtfURL, basename(gtfURL))
genomeURL <- "https://ftp.ensembl.org/pub/release-115/fasta/serinus_canaria/dna/Serinus_canaria.SCA1.dna_sm.toplevel.fa.gz"
download.file(genomeURL, basename(genomeURL))
featureFn <- basename(gtfURL)
genomeFn <- basename(genomeURL)
## Build reference
refBuild <- file.path(organism, db, build, "Annotation",
str_c(release, Sys.Date(), sep="-"))
param <- ezParam(list(refBuild=refBuild, genomesRoot="."))
buildRefDir(param$ezRef, genomeFile=genomeFn, genesFile=featureFn)
buildIgvGenome(param$ezRef)
## Annotation tables (BioMart dataset: scanaria_gene_ensembl)
featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "features.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
genomeFile=param$ezRef@refFastaFile,
organism="scanaria_gene_ensembl")
featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "genes.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
genomeFile=param$ezRef@refFastaFile,
organism="scanaria_gene_ensembl")
ssh fgcz-r-029
export PATH="/misc/ngseq12/packages/Dev/R/4.5.0/bin:/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"
export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"
cd ~/git/reference_files/Serinus_canaria/
Rscript Ensembl_Release115_SCA1.R
cp -r /srv/GT/analysis/pgueguen/references/Serinus_canaria /srv/GT/reference/
refBuild = Serinus_canaria/Ensembl/SCA1
Symptom: caught segfault, address 0x..., cause 'invalid permissions'
Cause: LD_LIBRARY_PATH not set correctly, libR.so not found
Solution:
LD_LIBRARY_PATH includes R library path:
export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"
Symptom: "Biotypes configuration file not found" or BioMart connection errors
Cause: BioMart dataset name incorrect or server unreachable
Solution:
library(biomaRt)
listDatasets(useEnsembl("genes"))
Symptom: sh: java: not found
Cause: JDK not in PATH
Solution: Add JDK to PATH before running:
export PATH="/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"
Symptom: Download fails or times out
Solution: Increase timeout in script:
options(timeout=600) # 10 minutes
| Type | Path |
|---|---|
| Production references | /srv/GT/reference/ |
| Working directory | /srv/GT/analysis/USERNAME/references/ |
| Script repository | ~/git/reference_files/ |
| GitLab | https://gitlab.bfabric.org/Genomics/reference_files |
| Function | Purpose |
|---|---|
setwdNew(path) | Create and change to working directory |
ezParam(list) | Create parameter object with refBuild |
buildRefDir(ezRef, genomeFile, genesFile) | Build core reference structure |
buildIgvGenome(ezRef) | Create IGV genome file |
makeFeatAnnoEnsembl(featureFile, genomeFile, organism) | Generate annotation tables |
Check existing references:
ls /srv/GT/reference/
Common organisms with 10x indices:
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
npx claudepluginhub cpanse/skills --plugin sequencing-pipelines