From mint-lora-training-skills
Use when users need local CUDA GRPO fine-tuning for LLMs, including reward function design, TRL GRPOTrainer setup, LoRA training templates, dry runs, evaluation, training-curve interpretation, and troubleshooting.
How this skill is triggered — by the user, by Claude, or both
Slash command
/mint-lora-training-skills:grpo-finetuneThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill when the user wants to fine-tune an LLM using GRPO (Group Relative Policy Optimization) to learn a specific output format or improve task performance. This skill is a complete, self-contained pipeline that an agent can follow step-by-step from environment check to final evaluation.
Use this skill when the user wants to fine-tune an LLM using GRPO (Group Relative Policy Optimization) to learn a specific output format or improve task performance. This skill is a complete, self-contained pipeline that an agent can follow step-by-step from environment check to final evaluation.
HF_ENDPOINT)Don't use for: SFT (use SFTTrainer), DPO (use DPOTrainer), PPO (use PPOTrainer).
Before writing any code, confirm these with the user (use defaults if they don't care):
| Parameter | Example | Default |
|---|---|---|
| Model name | Qwen/Qwen2.5-0.5B-Instruct | Qwen/Qwen2.5-0.5B-Instruct |
| Dataset | gsm8k (HF hub name) | (must specify) |
| Output format | XML with ... and ... | XML with think + answer tags |
| Answer extraction | How to get the ground truth answer | text.split("####")[-1].strip() for GSM8K |
| Project directory | /data/my-grpo-project | /data/grpo-finetune |
| Training steps | 500 | 500 |
| GPU VRAM available | ~5GB free | Auto-detect |
If user only specifies dataset, use all defaults.
Run these checks first. Stop and fix any failures before proceeding.
STOP if macOS. GRPO training requires NVIDIA CUDA GPU. macOS (including Apple Silicon MPS) is NOT supported. Check OS first:
# Check OS — must be Linux
uname -s
# Expected: Linux
# If "Darwin" (macOS): STOP and tell user "GRPO training requires NVIDIA GPU with CUDA. macOS is not supported. Use a Linux server with NVIDIA GPU."
# Check GPU
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}, Free: {(torch.cuda.mem_get_info()[0]/1e9):.1f}GB')"
# Check packages
python3 -c "import trl; print(f'TRL: {trl.__version__}')"
python3 -c "import transformers; print(f'Transformers: {transformers.__version__}')"
python3 -c "import peft; print(f'PEFT: {peft.__version__}')"
# Verify GRPO import (may take 30-120s due to lazy loading)
python3 -c "
import os; os.environ['VLLM_AVAILABLE']='0'
from trl import GRPOTrainer, GRPOConfig
print('GRPO OK')
"
# Check GRPOConfig API (version-sensitive!)
python3 -c "
import os; os.environ['VLLM_AVAILABLE']='0'
from trl import GRPOConfig
import inspect
sig = inspect.signature(GRPOConfig.__init__)
params = [n for n in sig.parameters if n not in ('self','kwargs')]
print('Available params:', params)
" 2>&1 | grep -v 'UserWarning\|deprecated\|Modular'
If TRL import hangs: vLLM version mismatch. Set VLLM_AVAILABLE=0 and retry with 120s timeout.
If GRPOConfig missing params: See references/troubleshooting.md for version-specific fixes.
mkdir -p /data/grpo-finetune
cd /data/grpo-finetune
Create these files in order (or copy from templates/):
rewards.py — Reward functions (format + correctness)train_grpo.py — Training scriptevaluate.py — Evaluation scriptCreate rewards.py. This is the most critical file — it defines what the model learns.
Use two reward functions passed to GRPOTrainer:
Copy templates/rewards.py and customize the three marked sections:
Section A — Output format tags: Change ... / ... / <answer> / </answer> to your desired format.
Section B — Answer extraction: Change extract_answer() to parse your format. Default extracts from <answer>...</answer>.
Section C — Answer normalization: Change normalize_answer() for your comparison logic. Default handles numbers (commas, $, %, fractions).
# Rule 1: First param is always 'completions: list[str]'
# Rule 2: Other params = dataset column names (TRL passes them as kwargs)
# Rule 3: Return list[float], one per completion
# Rule 4: **kwargs catches unused columns
def format_reward(completions: list[str], **kwargs) -> list[float]:
"""Returns 0.0–1.0 based on output structure compliance."""
...
def correctness_reward(completions: list[str], answer: list[str], **kwargs) -> list[float]:
"""Returns 0.0 or 1.0 based on answer accuracy.
'answer' param name MUST match dataset column name."""
...
python3 rewards.py
# Should show:
# Good XML: 1.00, No XML: 0.00, Correct: 1.0, Incorrect: 0.0
| Check | Score | Purpose |
|---|---|---|
| Opening + closing think tags | 0.15 | Tags exist |
| Think content non-empty | 0.15 | Has reasoning |
| Opening + closing answer tags | 0.20 | Tags exist |
| Answer content non-empty | 0.20 | Has final answer |
| Think before answer (order) | 0.15 | Correct structure |
| No extra text outside tags | 0.15 | Clean output |
| Total | 1.00 |
Create train_grpo.py. Copy from templates/train_grpo.py and customize:
A. Dataset loading — Change load_dataset("gsm8k", "main") to your dataset. Write a format_example() that:
prompt column using the model's chat templateanswer)B. System prompt — Change to describe your task and output format.
C. LoRA target modules — For Qwen/Llama models: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
D. Reward functions list — Change reward_funcs=[format_reward, correctness_reward] to your functions.
Before training, check free VRAM and select config:
| Free VRAM | batch | gen | accum | max_len | Notes |
|---|---|---|---|---|---|
| 4-8 GB | 1 | 2 | 1 | 256 | 0.5B model only |
| 8-20 GB | 2 | 4 | 2 | 384 | 0.5B-1.5B model |
| 20-40 GB | 4 | 8 | 4 | 512 | 1.5B-3B model |
| 40+ GB | 4 | 8 | 4 | 512 | 7B+ model with LoRA |
Always use PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Run 2 steps to verify the entire pipeline works before committing hours to training:
cd /data/grpo-finetune
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
HF_ENDPOINT=https://hf-mirror.com \
python3 train_grpo.py \
--max_steps 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--num_generations 4 \
--max_completion_length 256 \
--eval_strategy no \
--report_to none \
2>&1 | tee dry_run.log
Expected output:
100%|██████████| 2/2
{'rewards/format_reward/mean': '0.4...', 'rewards/correctness_reward/mean': '0.0', 'kl': '0'}
If it fails: See references/troubleshooting.md. Common first-run errors:
got an unexpected keyword argument 'max_prompt_length' → Remove it from GRPOConfiggeneration_batch_size must be divisible by num_generations → Set generation_batch_size=num_generationsreport_to=None is not supported → Use report_to="none" (string)CUDA out of memory → Reduce batch/gen/max_completion_lengthVLLM_AVAILABLE=0, wait 120sDo not proceed to Step 6 until dry-run completes with exit code 0.
cd /data/grpo-finetune
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
HF_ENDPOINT=https://hf-mirror.com \
python3 train_grpo.py \
--max_steps 500 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--num_generations 4 \
--max_completion_length 384 \
--learning_rate 5e-6 \
--beta 0.04 \
--temperature 0.9 \
--logging_steps 10 \
--save_steps 100 \
--eval_strategy no \
--output_dir ./outputs/run_1 \
--report_to none \
2>&1 | tee training.log
Run in background if the agent supports it. Monitor progress:
# Check current step
grep -oP "\d+%\|[^|]+\| \d+/\d+" training.log | tail -1
# Check latest metrics
grep "'reward'" training.log | tail -1
| Metric | Healthy | Unhealthy |
|---|---|---|
rewards/format_reward/mean | Rising 0.4→0.9+ | Stuck below 0.3 after 100 steps |
rewards/correctness_reward/mean | Gradual upward trend | Always 0 after 200 steps |
kl | Under 0.05 | Spiking above 0.1 |
entropy | Gradual decrease | Rapid drop to 0 (mode collapse) |
loss | Near zero | Spiking or NaN |
| Model | Steps | A100 | 4090 | V100 |
|---|---|---|---|---|
| 0.5B | 500 | ~1.5h | ~2h | ~4h |
| 1.5B | 500 | ~3h | ~5h | ~8h |
| 3B | 500 | ~6h | ~10h | N/A |
Create evaluate.py from templates/evaluate.py. Run after training completes:
python3 evaluate.py \
--model_path ./outputs/run_1/final_model \
--num_samples 200 \
--temperature 0.6 \
--output_file eval_results.json \
2>&1 | tee eval.log
| Metric | How to Calculate |
|---|---|
| Format Score | Average of format_reward() over test set |
| Perfect Format % | Fraction with format_score >= 0.99 |
| Accuracy | Fraction of correct answers |
| Sample outputs | Show 3-5 examples (correct and incorrect) |
beta (KL penalty).beta, increase temperature, or add entropy bonus.| File | Purpose |
|---|---|
templates/rewards.py | Reward function template (format + correctness) with self-test |
templates/train_grpo.py | Complete training script with LoRA, chat template, configurable args |
templates/evaluate.py | Evaluation script: load model, generate, score, save JSON results |
references/troubleshooting.md | All known errors and fixes organized by training phase |
references/training-curves.md | Real training data for expected behavior reference |
references/reward-patterns.md | Reward function design patterns and edge cases |
GRPOConfig(
output_dir="./outputs/run_1",
# --- Training ---
max_steps=500,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
learning_rate=5e-6,
lr_scheduler_type="cosine",
warmup_steps=10, # NOT warmup_ratio (deprecated)
bf16=True,
gradient_checkpointing=True,
# --- GRPO ---
num_generations=4, # Completions per prompt
num_generations_eval=1, # 1 for fast eval
generation_batch_size=4, # MUST == num_generations
max_completion_length=384,
beta=0.04, # KL penalty (higher = more conservative)
temperature=0.9, # Sampling temperature
use_vllm=False, # Avoid version conflicts
scale_rewards="group", # Normalize within group
loss_type="grpo",
generation_kwargs={"do_sample": True, "top_p": 0.95},
# --- Logging ---
logging_steps=10,
report_to="none", # STRING, not None
save_steps=100,
save_total_limit=3,
# --- Dataset ---
remove_unused_columns=False,
dataloader_num_workers=0,
)
trainer = GRPOTrainer(
model=model, # Pre-loaded model (NOT model name string if using peft_config)
reward_funcs=[fn1, fn2], # List of reward callables
args=GRPOConfig(...),
train_dataset=train_ds, # Must have 'prompt' column
eval_dataset=test_ds,
processing_class=tokenizer, # NOT 'tokenizer='
peft_config=lora_config, # Pass LoRA config directly, don't wrap model
)
After completing all steps, verify:
rewards.py self-test passes with correct scoresfinal_model/ directory exists with model weightsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub taiyi-ai-lab/ai-workflow-skills --plugin grpo-finetune-skills