From metabolomics-data-analysis
Comprehensive toolkit for curating untargeted metabolomics features from LC-MS data. Use when analyzing metabolomics feature tables from FGCZ or other platforms, reducing misannotation rates through automated QC metrics (CV%, blank ratios), duplicate resolution, and semi-automated flagging. Supports Excel/CSV inputs with ~100-300 features and generates interactive HTML reports following FGCZ standards.
How this skill is triggered — by the user, by Claude, or both
Slash command
/metabolomics-data-analysis:metabolomics-curationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Assist with curating metabolomics features from untargeted LC-MS data analysis. Untargeted metabolomics typically produces 100-300 putative metabolite features, of which 60-70% are low-quality (poor reproducibility, background noise, or duplicate annotations). This skill provides a semi-automated workflow to:
PYTHON_HTML_REPORT.mdassets/fgcz_curation_report_template.Rmdreferences/curation_workflow.mdreferences/quality_thresholds.mdscripts/calculate_qc_metrics.pyscripts/generate_html_report.pyscripts/generate_html_report_v2.pyscripts/generate_report_with_check.pyscripts/parse_metabolomics_table.pyscripts/resolve_duplicates.pyAssist with curating metabolomics features from untargeted LC-MS data analysis. Untargeted metabolomics typically produces 100-300 putative metabolite features, of which 60-70% are low-quality (poor reproducibility, background noise, or duplicate annotations). This skill provides a semi-automated workflow to:
The workflow reduces manual curation time from hours to minutes while improving consistency and reproducibility.
Invoke this skill when users request:
Typical file structure:
When a user requests metabolomics curation, follow this decision tree:
Is the input file a metabolomics feature table?
│
├─ YES → Does it have Area/Intensity columns for samples?
│ │
│ ├─ YES → Proceed with Step 1: Parse Metadata
│ │
│ └─ NO → Ask user to provide raw feature table (not processed/filtered)
│
└─ NO → Ask user for clarification or correct file type
Workflow Steps:
Goal: Load metabolomics feature table and auto-detect sample types.
Script: scripts/parse_metabolomics_table.py
What it does:
Usage:
# Basic usage (auto-detect all samples)
python scripts/parse_metabolomics_table.py input_features.xlsx --output metadata.json
# Interactive mode (review and adjust classifications)
python scripts/parse_metabolomics_table.py input_features.xlsx --output metadata.json --interactive
Expected output:
✅ Loaded 261 features × 141 columns
✅ Technical information detected:
Ionization: ESI-
Instrument: Orbitrap (Thermo Fisher)
Chromatography: HILIC (Hydrophilic Interaction)
MS Level: MS1, MS2
Dataset IDs: o38905, o39174
Fragmentation: Stepped NCE (HCD)
✅ Detected column structure:
identification: 10 columns
annotation: 15 columns
group_statistics: 16 columns
differential_analysis: 28 columns
area_columns: 33 columns
normalized_area_columns: 33 columns
✅ Sample classification:
Blank samples: 7
QC samples: 9
Standard samples: 5
Biological groups: 3
Groups: Starvation, Untreated, Washout
✅ Metadata saved to metadata.json
User interaction (if needed):
metadata.json if classifications are wrongTips:
--interactive mode for first-time analysis of a new datasetGoal: Compute quality metrics and flag features that fail QC thresholds.
Script: scripts/calculate_qc_metrics.py
What it does:
flag_high_cv: CV% > 30% (poor reproducibility)flag_low_signal: Biological/Blank ratio < 5× (low signal)flag_for_review: Fails any criterionFGCZ Quality Thresholds:
--cv-threshold)--blank-ratio)Usage:
# Standard FGCZ thresholds
python scripts/calculate_qc_metrics.py input_features.xlsx metadata.json --output qc_results.xlsx
# Save only flagged features (for quick review)
python scripts/calculate_qc_metrics.py input_features.xlsx metadata.json --output flagged_features.xlsx --flagged-only
# Custom thresholds (e.g., exploratory study)
python scripts/calculate_qc_metrics.py input_features.xlsx metadata.json --output qc_results.xlsx --cv-threshold 40 --blank-ratio 3
Expected output:
✅ Loaded 261 features
✅ Loaded metadata with 9 QC samples
Calculating CV% across 9 QC samples...
✅ Calculated 3 biological/blank ratios
============================================================
QC METRICS SUMMARY
============================================================
Total features: 261
Flagged for review: 172 (65.9%)
- High CV% (>30%): 44
- Low signal (<5× blank): 165
Passing QC: 89 (34.1%)
✅ Saved 261 features with QC metrics to qc_results.xlsx
Output file structure:
New columns added to original data:
Pooled_QC_CV_percent - CV% across pooled QC samples[Group]_to_Blank_ratio - Ratio for each biological group (e.g., Untreated_to_Blank_ratio)min_blank_ratio - Minimum ratio across all biological groupsflag_high_cv - Boolean: CV% > thresholdflag_low_signal - Boolean: Any ratio < thresholdflag_for_review - Boolean: Fails any criterionflag_reason - Text description of why flaggedTips:
references/quality_thresholds.md)Goal: Identify features with duplicate compound names and recommend which to keep.
Script: scripts/resolve_duplicates.py
What it does:
KEEP (best): Highest quality scoreALTERNATIVE (close score): Within 10% of best (manual review)REMOVE: All othersUsage:
# Standard duplicate resolution
python scripts/resolve_duplicates.py qc_results.xlsx metadata.json --output duplicates_resolved.xlsx
Expected output:
✅ Loaded 261 features with QC metrics
✅ Found 38 compounds with duplicates
Total duplicate features: 179
✅ Ranked 179 duplicate features
============================================================
DUPLICATE RESOLUTION SUMMARY
============================================================
Compounds with duplicates: 38
Total duplicate features: 179
Recommendations:
KEEP (best): 38
ALTERNATIVE (close score): 12
REMOVE: 129
Expected reduction: 129 features removed
Example resolutions (top 5 compounds):
D-(+)-Glucose: 9 entries
RT=5.73, CV=8.5%, Score=0.842 → KEEP (best)
RT=5.89, CV=12.3%, Score=0.798 → ALTERNATIVE (close score)
RT=4.21, CV=45.2%, Score=0.321 → REMOVE
RT=6.12, CV=38.9%, Score=0.298 → REMOVE
...
✅ Saved duplicate resolution to duplicates_resolved.xlsx
Output file structure:
All duplicate features with added columns:
composite_score - Weighted quality score (0-1, higher = better)rank - Rank within duplicate group (1 = best)recommendation - Action to take (KEEP, ALTERNATIVE, REMOVE)Tips:
Detailed documentation: See references/curation_workflow.md for composite score formula and rationale
Goal: Create HTML report with QC visualizations and interactive tables.
Template: assets/fgcz_curation_report_template.Rmd
What it does:
Usage:
# Option 1: Copy template to working directory and customize
cp assets/fgcz_curation_report_template.Rmd ./curation_report.Rmd
# Edit paths in curation_report.Rmd:
# qc_results_path <- "qc_results.xlsx"
# duplicates_path <- "duplicates_resolved.xlsx"
# Render report
module load Dev/R/4.5.0
Rscript -e "rmarkdown::render('curation_report.Rmd')"
# Output: curation_report.html
Expected output:
✅ Loaded 261 features with QC metrics
✅ Loaded 179 duplicate features
Rendering curation_report.Rmd...
Output created: curation_report.html
Report sections:
Tips:
Customization:
Goal: User reviews flagged features and duplicate recommendations, makes final curation decisions.
Process (semi-automated):
Review flagged features in interactive HTML table:
Review duplicate recommendations:
Document decisions:
User workflow:
# In R console or RStudio (after rendering report)
# Load QC results
library(readxl)
library(tidyverse)
qc_results <- read_excel("qc_results.xlsx")
duplicates_ranked <- read_excel("duplicates_resolved.xlsx")
# Example: Review flagged features
flagged <- qc_results |> filter(flag_for_review)
View(flagged)
# Example: Check which duplicates to remove
to_remove <- duplicates_ranked |> filter(recommendation == "REMOVE")
View(to_remove)
# Example: Generate final curated table
curated_features <- qc_results |>
# Remove flagged features (or apply custom filter)
filter(!flag_for_review) |>
# Remove duplicate features (keep only KEEP and ALTERNATIVE)
anti_join(to_remove, by = "Name")
# Save curated table
write_csv(curated_features, "curated_features.csv")
cat(sprintf("✅ Saved %d curated features (%.1f%% reduction)\n",
nrow(curated_features),
100 * (1 - nrow(curated_features) / nrow(qc_results))))
Expected reduction:
Quality improvement metrics:
Goal: Generate final curated feature table for downstream analysis.
User workflow:
# Option 1: Fully automated (remove all flagged and duplicate features)
curated_features <- qc_results |>
filter(!flag_for_review) |>
anti_join(duplicates_ranked |> filter(recommendation == "REMOVE"), by = "Name")
write_csv(curated_features, "curated_features.csv")
# Option 2: Semi-automated (keep specific flagged features)
# Manually create list of features to keep despite flags
keep_despite_flags <- c("Glutathione", "Taurine", "Hypotaurine")
curated_features <- qc_results |>
filter(!flag_for_review | Name %in% keep_despite_flags) |>
anti_join(duplicates_ranked |> filter(recommendation == "REMOVE"), by = "Name")
write_csv(curated_features, "curated_features.csv")
# Option 3: Export separate tables for review
# Passing QC features
passing_qc <- qc_results |> filter(!flag_for_review)
write_csv(passing_qc, "features_passing_qc.csv")
# Flagged for manual review
flagged_features <- qc_results |> filter(flag_for_review)
write_csv(flagged_features, "features_flagged.csv")
# Best duplicate representatives
best_duplicates <- duplicates_ranked |> filter(recommendation == "KEEP (best)")
write_csv(best_duplicates, "best_duplicate_representatives.csv")
Final deliverables:
curated_features.csv - Final filtered feature tablecuration_report.html - Interactive QC report with visualizationsmetadata.json - Sample classificationsqc_results.xlsx - Full feature table with QC metricsduplicates_resolved.xlsx - Ranked duplicates with recommendationsShare results:
g-req copynow curated_features.csv /srv/gstore/projects/pXXXXX/Analyses_Metabolomics/Default FGCZ thresholds work for most untargeted metabolomics experiments, but can be adjusted for specific use cases.
When to adjust:
| Experiment Type | CV% Threshold | Blank Ratio | Rationale |
|---|---|---|---|
| Standard untargeted | 30% | 5× | Default FGCZ settings |
| Targeted metabolomics | 20% | 10× | Stricter (uses standards) |
| Exploratory/discovery | 40% | 3× | More lenient (hypothesis generation) |
| Plasma/serum | 30% | 5× | Standard |
| Urine | 35% | 4× | Higher variability |
| Tissue | 25% | 7× | More homogeneous |
| Plant extracts | 35% | 3× | Complex matrix |
How to adjust:
# Example: Exploratory study with relaxed thresholds
python scripts/calculate_qc_metrics.py input_features.xlsx metadata.json \
--output qc_results.xlsx \
--cv-threshold 40 \
--blank-ratio 3
# Example: Targeted metabolomics with strict thresholds
python scripts/calculate_qc_metrics.py input_features.xlsx metadata.json \
--output qc_results.xlsx \
--cv-threshold 20 \
--blank-ratio 10
Detailed threshold guidelines: See references/quality_thresholds.md for:
Possible causes:
Solutions:
Possible causes:
Solutions:
Possible causes:
Solutions:
resolve_duplicates.py:
Possible causes:
Solutions:
devtools::install_github("uzh/ezRun")tidyverse, readxl, ggplot2, DT, patchworkThis skill includes bundled resources to support the curation workflow:
Python scripts for automated QC calculations and duplicate resolution:
All scripts are executable standalone (no Claude context needed). Can be run via command line or SLURM batch jobs.
Dependencies:
pip install pandas numpy openpyxlDetailed documentation for manual review and curation decisions:
Load these references when:
R Markdown template for generating interactive QC reports:
Copy this template to working directory and customize for specific datasets. Follows FGCZ reporting standards (ezRun integration, tabset structure, 300 DPI figures).
Always use interactive mode for new datasets
Review flagged features in context
Validate duplicate resolution
Preserve original data
Iterate and refine
This skill follows FGCZ standards and integrates with existing infrastructure:
Storage:
/srv/GT/analysis/pXXXXX/Metabolomics_Curation//srv/gstore/projects/pXXXXX/Analyses_Metabolomics/ (use g-req)Reporting:
Workflow:
Example batch job:
#!/bin/bash
#SBATCH --job-name=metabolomics_curation
#SBATCH --output=curation_%j.log
#SBATCH --error=curation_%j.err
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=8G
#SBATCH --cpus-per-task=4
#SBATCH --partition=employee
module load Dev/R/4.5.0
# Run curation workflow
python scripts/parse_metabolomics_table.py input.xlsx --output metadata.json
python scripts/calculate_qc_metrics.py input.xlsx metadata.json --output qc_results.xlsx
python scripts/resolve_duplicates.py qc_results.xlsx metadata.json --output duplicates_resolved.xlsx
# Generate report
Rscript -e "rmarkdown::render('curation_report.Rmd')"
echo "Curation complete at $(date)"
Last Updated: 2025-01-10
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub cpanse/skills --plugin metabolomics-data-analysis