Skill

genome-reference-build

Build and install new genome references on FGCZ infrastructure following ezRun conventions. Use when adding new species/assemblies to /srv/GT/reference/, creating CellRanger-compatible indices, or updating existing references with new Ensembl/GENCODE releases. CRITICAL - must run on fgcz-r-029 due to rtracklayer segfault issues on compute nodes.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/sequencing-pipelines:genome-reference-build

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Build and deploy genome references for the FGCZ genomics infrastructure using the ezRun R package. This skill covers:

SKILL.md

417 lines · ~3.3k tokens

Stats

LanguagePython

Parent stars0

MaintenanceExcellent

Last CommitMay 27, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Genome Reference Build Skill

Overview

Build and deploy genome references for the FGCZ genomics infrastructure using the ezRun R package. This skill covers:

Source identification - Finding genomes from Ensembl, GENCODE, or NCBI
Script creation - Following FGCZ naming and structure conventions
Build execution - Running on the correct server (fgcz-r-029)
Deployment - Copying to production /srv/GT/reference/
CellRanger compatibility - Building 10x indices with cellranger mkref
Version control - Committing scripts to GitLab

When to Use

Use this skill when:

Adding a new species/organism to the reference collection
Updating an existing reference with a new Ensembl/GENCODE release
Building CellRanger-compatible indices for 10x workflows
Troubleshooting reference build failures

Critical Requirements

Server Requirement

MUST run builds on fgcz-r-029 (the development server)

The R packages rtracklayer and restfulr have segfault issues on compute nodes (fgcz-c-053, fgcz-c-055, etc.) due to LD_LIBRARY_PATH not being properly inherited. Only fgcz-r-029 has a working environment.

# Connect to development server
ssh fgcz-r-029

# Set up environment
export PATH="/misc/ngseq12/packages/Dev/R/4.5.0/bin:/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"
export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"

Tool Dependencies

Tool	Path	Purpose
R 4.5.0	`/misc/ngseq12/packages/Dev/R/4.5.0/`	ezRun functions
Picard 3.2.0	`/usr/local/ngseq/packages/Tools/Picard/3.2.0/picard.jar`	Sequence dictionary
samtools 1.20	`/usr/local/ngseq/packages/Tools/samtools/1.20/bin/`	FASTA indexing
IGVTools 2.8.10	`/usr/local/ngseq/packages/Tools/IGVTools/2.8.10/`	GTF sorting/indexing
JDK 21	`/usr/local/ngseq/packages/Dev/jdk/21/bin/`	Java runtime

Repository Structure

~/git/reference_files/
├── README.md                              # Conventions documentation
├── Organism_Name/                         # One directory per species
│   └── Database_ReleaseXX_BuildName.R    # Installation script(s)
└── ...

GitLab: https://gitlab.bfabric.org/Genomics/reference_files

Step-by-Step Workflow

Step 1: Identify Genome Source

Priority order for databases:

GENCODE - Preferred for human/mouse (comprehensive annotations)
Ensembl - Available for most organisms
NCBI/RefSeq - Fallback for rare organisms
Species-specific - CGD (Candida), WBPS (parasites), etc.

Finding URLs:

# Ensembl FTP (most organisms)
https://ftp.ensembl.org/pub/release-XXX/gtf/organism_name/
https://ftp.ensembl.org/pub/release-XXX/fasta/organism_name/dna/

# GENCODE (human/mouse only)
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_XX/
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_MXX/

File selection:

GTF: Use the main annotation file (e.g., Species.Assembly.XXX.gtf.gz)
FASTA: Use dna_sm.toplevel.fa.gz (soft-masked, all sequences)

Step 2: Create Installation Script

Naming convention: Database_ReleaseXX_BuildName.R

Examples:

Ensembl_Release115_SCA1.R (Canary)
GENCODE_Release42_GRCh38.p13.R (Human)
Ensembl_Release112_bGalGal1.mat.broiler.GRCg7b.R (Chicken)

Template script:

library(ezRun)
rm(list=ls())
options(timeout=300)

## Load tool environments
Sys.setenv(Picard_jar = '/usr/local/ngseq/packages/Tools/Picard/3.2.0/picard.jar')
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/IGVTools/2.8.10", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/samtools/1.20/bin/", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Dev/jdk/21/bin", Sys.getenv("PATH"), sep = ":"))

## Config
setwdNew("/srv/GT/analysis/USERNAME/references")  # Change USERNAME
organism <- "Genus_species"
db <- "Ensembl"  # or "GENCODE", "NCBI"
build <- "AssemblyName"
release <- "Release_XXX"

## Download files
gtfURL <- "https://ftp.ensembl.org/pub/release-XXX/gtf/organism/file.gtf.gz"
download.file(gtfURL, basename(gtfURL))
genomeURL <- "https://ftp.ensembl.org/pub/release-XXX/fasta/organism/dna/file.fa.gz"
download.file(genomeURL, basename(genomeURL))
featureFn <- basename(gtfURL)
genomeFn <- basename(genomeURL)

## Build reference (date = download date)
refBuild <- file.path(organism, db, build, "Annotation",
                      str_c(release, Sys.Date(), sep="-"))
param <- ezParam(list(refBuild=refBuild, genomesRoot="."))

## Core build functions
buildRefDir(param$ezRef, genomeFile=genomeFn, genesFile=featureFn)
buildIgvGenome(param$ezRef)

## Build annotation tables (optional - may fail for some organisms)
featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "features.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
                    genomeFile=param$ezRef@refFastaFile,
                    organism="organism_gene_ensembl")  # BioMart dataset name

featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "genes.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
                    genomeFile=param$ezRef@refFastaFile,
                    organism="organism_gene_ensembl")

Step 3: Find BioMart Organism Name

For makeFeatAnnoEnsembl(), you need the BioMart dataset name:

# In R, find available datasets:
library(biomaRt)
ensembl <- useEnsembl(biomart = "genes")
datasets <- listDatasets(ensembl)
datasets[grep("canaria", datasets$dataset, ignore.case = TRUE), ]

Common patterns:

Human: hsapiens_gene_ensembl
Mouse: mmusculus_gene_ensembl
Chicken: ggallus_gene_ensembl
Canary: scanaria_gene_ensembl

Step 4: Run the Build

# SSH to development server
ssh fgcz-r-029

# Set up environment
export PATH="/misc/ngseq12/packages/Dev/R/4.5.0/bin:/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"
export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"

# Run script
cd ~/git/reference_files/Organism_Name/
Rscript Database_ReleaseXX_BuildName.R

Expected runtime: 10-30 minutes depending on genome size

Step 5: Verify Build Output

Required files in /srv/GT/analysis/USERNAME/references/Organism/Database/Build/:

Organism/Database/Build/
├── igv_Build.genome
├── Annotation/
│   ├── Genes -> Release_XXX-YYYY-MM-DD/Genes  (symlink)
│   └── Release_XXX-YYYY-MM-DD/
│       └── Genes/
│           ├── genes.gtf
│           ├── genes.sorted.gtf
│           ├── genes.sorted.gtf.idx
│           ├── features.gtf
│           ├── transcripts.only.gtf
│           ├── genes_annotation_byGene.txt      (optional)
│           └── genes_annotation_byTranscript.txt (optional)
└── Sequence/
    └── WholeGenomeFasta/
        ├── genome.fa
        ├── genome.fa.fai
        ├── genome.dict
        └── genome-chromsizes.txt

Step 6: Create Genes Symlink

If the symlink wasn't created automatically:

cd /srv/GT/analysis/USERNAME/references/Organism/Database/Build/Annotation/
ln -sf Release_XXX-YYYY-MM-DD/Genes Genes

Step 7: Deploy to Production

# Copy to production location
cp -r /srv/GT/analysis/USERNAME/references/Organism /srv/GT/reference/

# Verify
ls -la /srv/GT/reference/Organism/Database/Build/

Step 8: Build CellRanger Index (Optional)

For 10x Genomics compatibility:

# Load CellRanger
module load Tools/cellranger/9.0.0  # or current version

# Build index (takes 1-2 hours, ~32GB RAM)
cellranger mkref \
    --genome=Organism_Build \
    --fasta=/srv/GT/reference/Organism/Database/Build/Sequence/WholeGenomeFasta/genome.fa \
    --genes=/srv/GT/reference/Organism/Database/Build/Annotation/Genes/genes.gtf \
    --nthreads=16 \
    --memgb=64

# Move to reference location
mv Organism_Build /srv/GT/reference/Organism/Database/Build/Annotation/Release_XXX-YYYY-MM-DD/Genes/genes_10XGEX_Index/

Step 9: Commit to GitLab

cd ~/git/reference_files
git add Organism_Name/Database_ReleaseXX_BuildName.R
git commit -m "Add Organism (common name) Database Release XXX reference

- Assembly: BuildName
- Source: Database Release XXX
- Build date: YYYY-MM-DD

Co-Authored-By: Claude Opus 4.5 <[email protected]>"
git push

Complete Example: Canary (Serinus canaria)

Script: `Ensembl_Release115_SCA1.R`

library(ezRun)
rm(list=ls())
options(timeout=300)

## Load environments
Sys.setenv(Picard_jar = '/usr/local/ngseq/packages/Tools/Picard/3.2.0/picard.jar')
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/IGVTools/2.8.10", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Tools/samtools/1.20/bin/", Sys.getenv("PATH"), sep = ":"))
Sys.setenv(PATH = paste("/usr/local/ngseq/packages/Dev/jdk/21/bin", Sys.getenv("PATH"), sep = ":"))

## Config
setwdNew("/srv/GT/analysis/pgueguen/references")
organism <- "Serinus_canaria"
db <- "Ensembl"
build <- "SCA1"
release <- "Release_115"

## Download
gtfURL <- "https://ftp.ensembl.org/pub/release-115/gtf/serinus_canaria/Serinus_canaria.SCA1.115.gtf.gz"
download.file(gtfURL, basename(gtfURL))
genomeURL <- "https://ftp.ensembl.org/pub/release-115/fasta/serinus_canaria/dna/Serinus_canaria.SCA1.dna_sm.toplevel.fa.gz"
download.file(genomeURL, basename(genomeURL))
featureFn <- basename(gtfURL)
genomeFn <- basename(genomeURL)

## Build reference
refBuild <- file.path(organism, db, build, "Annotation",
                      str_c(release, Sys.Date(), sep="-"))
param <- ezParam(list(refBuild=refBuild, genomesRoot="."))

buildRefDir(param$ezRef, genomeFile=genomeFn, genesFile=featureFn)
buildIgvGenome(param$ezRef)

## Annotation tables (BioMart dataset: scanaria_gene_ensembl)
featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "features.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
                    genomeFile=param$ezRef@refFastaFile,
                    organism="scanaria_gene_ensembl")

featureFile <- file.path(dirname(param$ezRef@refFeatureFile), "genes.gtf")
makeFeatAnnoEnsembl(featureFile=featureFile,
                    genomeFile=param$ezRef@refFastaFile,
                    organism="scanaria_gene_ensembl")

Execution

ssh fgcz-r-029
export PATH="/misc/ngseq12/packages/Dev/R/4.5.0/bin:/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"
export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"
cd ~/git/reference_files/Serinus_canaria/
Rscript Ensembl_Release115_SCA1.R

Deployment

cp -r /srv/GT/analysis/pgueguen/references/Serinus_canaria /srv/GT/reference/

SUSHI Usage

refBuild = Serinus_canaria/Ensembl/SCA1

Troubleshooting

Segfault when loading rtracklayer

Symptom: caught segfault, address 0x..., cause 'invalid permissions'

Cause: LD_LIBRARY_PATH not set correctly, libR.so not found

Solution:

Run on fgcz-r-029, NOT compute nodes

Ensure LD_LIBRARY_PATH includes R library path:

export LD_LIBRARY_PATH="/misc/ngseq12/packages/Dev/R/4.5.0/lib/R/lib:$LD_LIBRARY_PATH"

makeFeatAnnoEnsembl fails

Symptom: "Biotypes configuration file not found" or BioMart connection errors

Cause: BioMart dataset name incorrect or server unreachable

Solution:

Skip annotation tables - core reference still works

Verify BioMart dataset name in R:

library(biomaRt)
listDatasets(useEnsembl("genes"))

Try again later if BioMart server is down

Java not found

Symptom: sh: java: not found

Cause: JDK not in PATH

Solution: Add JDK to PATH before running:

export PATH="/usr/local/ngseq/packages/Dev/jdk/21/bin:$PATH"

Download timeout

Symptom: Download fails or times out

Solution: Increase timeout in script:

options(timeout=600)  # 10 minutes

Reference Paths

Type	Path
Production references	`/srv/GT/reference/`
Working directory	`/srv/GT/analysis/USERNAME/references/`
Script repository	`~/git/reference_files/`
GitLab	https://gitlab.bfabric.org/Genomics/reference_files

ezRun Functions Reference

Function	Purpose
`setwdNew(path)`	Create and change to working directory
`ezParam(list)`	Create parameter object with refBuild
`buildRefDir(ezRef, genomeFile, genesFile)`	Build core reference structure
`buildIgvGenome(ezRef)`	Create IGV genome file
`makeFeatAnnoEnsembl(featureFile, genomeFile, organism)`	Generate annotation tables

Available References

Check existing references:

ls /srv/GT/reference/

Common organisms with 10x indices:

Homo_sapiens/GENCODE/GRCh38.p13
Mus_musculus/GENCODE/GRCm39
Gallus_gallus/Ensembl/bGalGal1.mat.broiler.GRCg7b
Serinus_canaria/Ensembl/SCA1

genome-reference-build

Invocation

Context Preview

SKILL.md

genome-reference-build

Invocation

Context Preview

SKILL.md

Genome Reference Build Skill

Overview

When to Use

Critical Requirements

Server Requirement

Tool Dependencies

Repository Structure

Step-by-Step Workflow

Step 1: Identify Genome Source

Step 2: Create Installation Script

Step 3: Find BioMart Organism Name

Step 4: Run the Build

Step 5: Verify Build Output

Step 6: Create Genes Symlink

Step 7: Deploy to Production

Step 8: Build CellRanger Index (Optional)

Step 9: Commit to GitLab

Complete Example: Canary (Serinus canaria)

Script: Ensembl_Release115_SCA1.R

Execution

Deployment

SUSHI Usage

Troubleshooting

Segfault when loading rtracklayer

makeFeatAnnoEnsembl fails

Java not found

Download timeout

Reference Paths

ezRun Functions Reference

Available References

Similar Skills

Genome Reference Build Skill

Overview

When to Use

Critical Requirements

Server Requirement

Tool Dependencies

Repository Structure

Step-by-Step Workflow

Step 1: Identify Genome Source

Step 2: Create Installation Script

Step 3: Find BioMart Organism Name

Step 4: Run the Build

Step 5: Verify Build Output

Step 6: Create Genes Symlink

Step 7: Deploy to Production

Step 8: Build CellRanger Index (Optional)

Step 9: Commit to GitLab

Complete Example: Canary (Serinus canaria)

Script: Ensembl_Release115_SCA1.R

Execution

Deployment

SUSHI Usage

Troubleshooting

Segfault when loading rtracklayer

makeFeatAnnoEnsembl fails

Java not found

Download timeout

Reference Paths

ezRun Functions Reference

Available References

Similar Skills

Script: `Ensembl_Release115_SCA1.R`

Script: `Ensembl_Release115_SCA1.R`