From sciagent-skills
Counts RNA-seq reads overlapping GTF gene features from sorted STAR BAMs, outputting per-gene count matrices across samples. Handles strandedness, paired-end, multi-sample batching, and assignment stats.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sciagent-skills:featurecounts-rna-countingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
featureCounts (part of the Subread package) assigns sequencing reads in BAM files to genomic features defined in a GTF/GFF annotation. It counts how many reads overlap each gene (or exon, intron, or custom feature), producing a gene × sample count matrix suitable for differential expression analysis with DESeq2 or edgeR. featureCounts processes multiple BAM files in a single command, reporting ...
featureCounts (part of the Subread package) assigns sequencing reads in BAM files to genomic features defined in a GTF/GFF annotation. It counts how many reads overlap each gene (or exon, intron, or custom feature), producing a gene × sample count matrix suitable for differential expression analysis with DESeq2 or edgeR. featureCounts processes multiple BAM files in a single command, reporting read assignment statistics (assigned, unassigned by category) alongside the count matrix. It is the standard counting step after STAR alignment in RNA-seq pipelines.
featureCounts)Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v featureCountsfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run featureCountsrather than barefeatureCounts.
# Install with conda (recommended)
conda install -c bioconda subread
# Verify
featureCounts -v
# featureCounts v2.0.6
# Alternative: install via apt (Ubuntu/Debian)
sudo apt-get install subread
# Count reads for multiple samples (unstranded paired-end RNA-seq)
featureCounts \
-a gencode.v47.annotation.gtf \
-o counts/gene_counts.txt \
-T 8 \
-p --countReadPairs \
results/sample1/Aligned.sortedByCoord.out.bam \
results/sample2/Aligned.sortedByCoord.out.bam
echo "Count matrix: counts/gene_counts.txt"
head -3 counts/gene_counts.txt
Ensure BAM files are sorted and indexed, and the GTF matches the genome assembly.
# Verify BAM files are sorted
samtools view -H results/sample1/Aligned.sortedByCoord.out.bam | grep "SO:"
# Expected: SO:coordinate
# List BAMs to count
ls results/*/Aligned.sortedByCoord.out.bam | head -5
# Download GENCODE GTF (same version used for STAR indexing)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.annotation.gtf.gz
gunzip gencode.v47.primary_assembly.annotation.gtf.gz
echo "GTF lines: $(wc -l < gencode.v47.primary_assembly.annotation.gtf)"
Test strandedness using a small read count to set the -s parameter correctly.
# Quick strandedness check: count 1 sample with all 3 modes
# Compare assigned rates: highest = correct mode
for strand in 0 1 2; do
echo "=== Strandedness -s $strand ==="
featureCounts \
-a gencode.v47.primary_assembly.annotation.gtf \
-o /tmp/test_s${strand}.txt \
-T 4 \
-p --countReadPairs \
-s $strand \
results/sample1/Aligned.sortedByCoord.out.bam 2>&1 \
| grep "Successfully assigned"
done
# Rule: 0=unstranded if similar rates; 1 or 2 if one is much higher
Standard configuration for unstranded libraries (most polyA-selected RNA-seq).
mkdir -p counts
# Multi-sample batch counting: pass all BAMs as positional arguments
featureCounts \
-a gencode.v47.primary_assembly.annotation.gtf \
-o counts/gene_counts.txt \
-T 8 \
-p \
--countReadPairs \
-s 0 \
-t exon \
-g gene_id \
results/ctrl_1/Aligned.sortedByCoord.out.bam \
results/ctrl_2/Aligned.sortedByCoord.out.bam \
results/treat_1/Aligned.sortedByCoord.out.bam \
results/treat_2/Aligned.sortedByCoord.out.bam
echo "Count matrix: $(wc -l < counts/gene_counts.txt) genes"
# Also generates: counts/gene_counts.txt.summary (assignment stats)
cat counts/gene_counts.txt.summary
For strand-specific libraries (TruSeq Stranded, QuantSeq), set the correct strandedness.
# Reverse-stranded library (most TruSeq Stranded protocols): -s 2
featureCounts \
-a gencode.v47.primary_assembly.annotation.gtf \
-o counts/gene_counts_stranded.txt \
-T 8 \
-p --countReadPairs \
-s 2 \
results/*/Aligned.sortedByCoord.out.bam
# Forward-stranded (e.g., Lexogen QuantSeq, Takara SMARTer): -s 1
# featureCounts ... -s 1 ...
echo "Stranded count complete."
head -2 counts/gene_counts_stranded.txt
Parse the featureCounts output file and prepare for differential expression.
import pandas as pd
# featureCounts output has 6 metadata columns before count columns
counts_raw = pd.read_csv("counts/gene_counts.txt", sep="\t", comment="#")
print(f"Columns: {list(counts_raw.columns)}")
# Metadata columns: Geneid, Chr, Start, End, Strand, Length
# Count columns start at index 6
count_cols = counts_raw.columns[6:] # BAM file paths as column names
counts = counts_raw.set_index("Geneid")[count_cols].copy()
# Rename columns to sample names (strip path and file extension)
import re
counts.columns = [re.sub(r".*/|Aligned\.sortedByCoord\.out\.bam", "", col)
for col in counts.columns]
print(f"Count matrix shape: {counts.shape}") # (genes × samples)
print(f"Samples: {list(counts.columns)}")
print(f"Genes with counts > 0: {(counts.sum(axis=1) > 0).sum()}")
counts.to_csv("gene_count_matrix.tsv", sep="\t")
print("Saved: gene_count_matrix.tsv")
Use the count matrix directly in pydeseq2 for differential expression.
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
# Load count matrix (genes × samples)
counts = pd.read_csv("gene_count_matrix.tsv", sep="\t", index_col=0).T
print(f"Count matrix: {counts.shape} (samples × genes)")
# Sample metadata
metadata = pd.DataFrame({
"condition": ["control", "control", "treated", "treated"]
}, index=counts.index)
# Filter low-count genes (recommended before DESeq2)
counts_filtered = counts.loc[:, counts.sum() > 10]
print(f"Genes after low-count filter: {counts_filtered.shape[1]}")
# Run DESeq2
dds = DeseqDataSet(counts=counts_filtered, metadata=metadata,
design_factors="condition",
inference=DefaultInference(n_cpus=8))
dds.deseq2()
stat_res = DeseqStats(dds, contrast=["condition", "treated", "control"],
inference=DefaultInference())
stat_res.summary()
results = stat_res.results_df
sig = results[results["padj"] < 0.05]
print(f"DE genes (padj < 0.05): {len(sig)}")
print(sig.sort_values("log2FoldChange", ascending=False).head())
| Parameter | Default | Range/Options | Effect |
|---|---|---|---|
-a | required | GTF/GFF3 path | Annotation file; must match genome assembly used for alignment |
-o | required | file path | Output count table path (also creates <output>.summary) |
-T | 1 | 1–64 | CPU threads; 8–16 is typical |
-s | 0 | 0 (unstranded), 1 (stranded), 2 (reverse-stranded) | Library strandedness; wrong value causes major undercounting |
-p | off | flag | Paired-end mode; reads counted as fragments not individual reads |
--countReadPairs | off | flag | For PE: count pairs not reads (use with -p) |
-t | exon | feature type string | Feature type to count from GTF column 3 |
-g | gene_id | attribute string | GTF attribute to group features (use gene_id for genes) |
--minOverlap | 1 | 1–100 | Minimum bases a read must overlap a feature to be counted |
--fracOverlap | 0 | 0–1 | Fraction of read that must overlap; 0.2 for stricter counting |
-O | off | flag | Allow reads to be assigned to multiple overlapping features |
-M | off | flag | Count multi-mapping reads (default: only uniquely mapped) |
import subprocess
import re
from pathlib import Path
def run_featurecounts(bam_files: list, gtf: str, outfile: str,
threads: int = 8, strandedness: int = 0,
paired_end: bool = True) -> dict:
"""Run featureCounts and return assignment statistics."""
cmd = [
"featureCounts",
"-a", gtf,
"-o", outfile,
"-T", str(threads),
"-s", str(strandedness),
"-t", "exon",
"-g", "gene_id",
]
if paired_end:
cmd += ["-p", "--countReadPairs"]
cmd += bam_files
result = subprocess.run(cmd, capture_output=True, text=True)
# Parse summary from stderr
stats = {}
for line in result.stderr.splitlines():
if "Assigned" in line:
stats["assigned_pct"] = float(re.search(r"(\d+\.\d+)%", line).group(1))
return stats
bams = list(Path("results").glob("*/Aligned.sortedByCoord.out.bam"))
bam_list = [str(b) for b in sorted(bams)]
stats = run_featurecounts(bam_list, "gencode.v47.primary_assembly.annotation.gtf",
"counts/gene_counts.txt")
print(f"Assigned reads: {stats.get('assigned_pct', 'N/A')}%")
# Snakefile — featureCounts rule after STAR alignment
configfile: "config.yaml"
SAMPLES = config["samples"]
rule featurecounts:
input:
bams = expand("results/{sample}/Aligned.sortedByCoord.out.bam", sample=SAMPLES),
gtf = config["gtf"]
output:
counts = "counts/gene_counts.txt",
summary = "counts/gene_counts.txt.summary"
params:
strandedness = config.get("strandedness", 0)
threads: 8
shell:
"""
featureCounts \
-a {input.gtf} \
-o {output.counts} \
-T {threads} \
-p --countReadPairs \
-s {params.strandedness} \
-t exon -g gene_id \
{input.bams}
"""
| Output | Format | Description |
|---|---|---|
gene_counts.txt | TSV | Count matrix: gene metadata + one count column per BAM |
gene_counts.txt.summary | TSV | Read assignment statistics per sample (Assigned, Unassigned_*) |
| stderr log | Text | Per-sample assignment percentages and warnings |
| Problem | Cause | Solution |
|---|---|---|
| Very low assigned rate (< 40%) | Wrong strandedness -s value | Test all 3 -s modes; match to library prep protocol |
| GTF not matching genome | Different assembly or annotation version | Verify genome + GTF are same version (e.g., both GRCh38/GENCODE v47) |
Error: Failed to open the annotation file | GTF file path wrong or compressed | Decompress GTF; use absolute path |
| Count matrix has 0 for all genes | Wrong -t feature type | Check GTF column 3 with awk '{print $3}' file.gtf | sort -u | head |
| Multi-mapping reads not counted | -M not set | Add -M to count multi-mappers; may inflate counts for repetitive regions |
| Paired-end reads counted as single | -p flag missing | Add -p --countReadPairs for paired-end BAMs |
| Very slow on large BAM files | Low thread count | Increase -T to 8–16; ensure BAMs are sorted by coordinate |
gene_id attribute missing | GFF3 file uses different attribute | Use -g ID for GFF3; check attributes with grep -v "^#" file.gff3 | head -5 |
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsAligns RNA-seq reads to genome using splice-aware STAR aligner. Builds index, runs two-pass mode for junctions. Outputs sorted BAM, SJ tables, stats, gene counts for IGV, variant calling, ENCODE.
Wraps nf-core/rnaseq bulk RNA-seq preprocessing from FASTQ or BAM inputs with preflight checks, reproducibility tracking, and downstream handoff to differential expression skills.
Differential gene expression analysis using PyDESeq2 (Python DESeq2). Identify DE genes from bulk RNA-seq counts with Wald tests, FDR correction, and volcano/MA plots.