From skymcp
Use when selecting a training framework, comparing NeMo vs Axolotl vs torchtune vs TRL vs DeepSpeed vs Megatron, choosing between FSDP and DeepSpeed, deciding how to fine-tune or pretrain a model, or configuring LoRA/QLoRA/full fine-tuning - the definitive framework selection guide for ML training at any scale
How this skill is triggered — by the user, by Claude, or both
Slash command
/skymcp:ml-training-frameworksThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
| Use Case | Framework | Why |
| Use Case | Framework | Why |
|---|---|---|
| Pretraining 100B+ params | NeMo 2.0 / Megatron-LM | Highest MFU, 5 parallelism axes (TP/PP/DP/EP/CP), FP8/FP4 |
| Pretraining 1B-70B params | torchtune / Axolotl + FSDP2 | Portable, good torch.compile support, no vendor lock |
| Fine-tuning (SFT) | Axolotl | YAML-driven, widest model support, LoRA/QLoRA/full |
| Post-training (DPO/GRPO/PPO) | TRL v0.28+ | Standard HF ecosystem, SFTTrainer/DPOTrainer/GRPOTrainer |
| Memory-constrained (huge model, small GPU) | DeepSpeed ZeRO-3 + CPU offload | Unbeatable memory savings for giant models on limited VRAM |
| Inference serving | vLLM | PagedAttention, 2-24x throughput vs naive, continuous batching |
| Research / custom architectures | torchtune | PyTorch-native, deep compile integration, clean recipe system |
| Multi-modal fine-tuning | Axolotl | Vision + language support, multipack, sample packing |
Use Axolotl:
# axolotl.yaml
base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_target_linear: true
dataset_prepared_path: last_run_prepared
datasets:
- path: dataset.jsonl
type: alpaca
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 2e-4
bf16: auto
flash_attention: true
accelerate launch -m axolotl.cli.train axolotl.yaml
Use torchtune with FSDP2:
tune run --nproc_per_node 8 full_finetune_distributed \
--config recipes/llama3/7B_full.yaml \
model.compile=True \
training.enable_activation_checkpointing=True
Use TRL:
from trl import DPOTrainer, DPOConfig
config = DPOConfig(
model_name_or_path="meta-llama/Llama-3-8B-SFT",
learning_rate=5e-7,
beta=0.1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
bf16=True,
)
trainer = DPOTrainer(config=config, train_dataset=dataset)
trainer.train()
Use NeMo 2.0:
from nemo.collections.llm import GPTModel, MegatronStrategy
from nemo.lightning import NeMoTrainer
strategy = MegatronStrategy(
tensor_model_parallel_size=8,
pipeline_model_parallel_size=4,
context_parallel_size=2,
expert_model_parallel_size=1,
sequence_parallel=True,
)
trainer = NeMoTrainer(strategy=strategy, max_steps=100000)
| Nodes | GPUs Total | Strategy |
|---|---|---|
| 1 | 1-2 | FSDP2 or DeepSpeed ZeRO-2 |
| 1 | 4-8 | FSDP2 with activation checkpointing |
| 2-8 | 16-64 | FSDP2 + gradient checkpointing |
| 8-32 | 64-256 | Megatron TP + FSDP DP |
| 32+ | 256+ | Megatron TP + PP + DP + CP |
Rule-of-thumb for model memory (bf16 training):
| Component | Per-Parameter Cost |
|---|---|
| Parameters | 2 bytes (bf16) |
| Gradients | 2 bytes (bf16) |
| Optimizer (AdamW) | 8 bytes (fp32 moments) |
| Activations | ~2 bytes (varies with seq length) |
| Total | ~14 bytes per parameter |
A 7B model requires ~98 GB for full fine-tuning. LoRA reduces this to ~16-24 GB (only adapter parameters need optimizer states).
| Method | Memory | Speed | Quality | When to Use |
|---|---|---|---|---|
| Full fine-tune | Highest (14B/param) | Slowest | Best | Unlimited compute budget |
| LoRA (r=32) | ~30% of full | 2-3x faster | 95-99% of full | Default for most tasks |
| QLoRA (4-bit + LoRA) | ~15% of full | 2x faster | 93-97% of full | Limited VRAM |
| QAT (quantize-aware) | ~40% of full | Slower | Best at inference | Deploying quantized |
| Feature | NeMo 2.0 | Axolotl | torchtune | TRL | DeepSpeed |
|---|---|---|---|---|---|
| LoRA | Yes | Yes | Yes | Yes | Yes |
| QLoRA | No | Yes | Yes | Yes | Yes |
| FSDP2 | No (Megatron) | Yes | Yes | Yes | No (own sharding) |
| torch.compile | Partial | Partial | Full | Partial | No |
| Flash Attention | Yes | Yes | Yes | Yes | Yes |
| Multi-node | Yes | Yes | Yes | Yes | Yes |
| Multimodal | Yes | Yes | Partial | Partial | Yes |
| RLHF/DPO | Yes | GRPO | DPO | Full suite | With TRL |
See references/framework-details.md for configuration deep dives. See references/axolotl-config.md for complete Axolotl YAML reference. See references/deepspeed-config.md for ZeRO configuration reference.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub slapglif/skymcp --plugin skymcp