From external-gitcode-ascend-skills
Routes user intent to the correct MindSpeed-MM skill based on model type (VLM, generative, omni, audio). Provides a pipeline overview for multimodal training on Huawei Ascend NPU.
How this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:mindspeed-mm-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This Skill is the **routing entry point** for all MindSpeed-MM Skills. It determines the model type based on user intent, routes to the corresponding Skill, and provides a complete pipeline overview.
This Skill is the routing entry point for all MindSpeed-MM Skills. It determines the model type based on user intent, routes to the corresponding Skill, and provides a complete pipeline overview.
User Intent → Model Type Detection → Target Skill
"Train understanding model / VLM" → mindspeed-mm-vlm
"Train generative model / video / image" → mindspeed-mm-generative
"Train omni model" → See examples/qwen2.5omni/README.md
"Train speech / TTS model" → See examples/whisper/ or examples/cosyvoice3/README.md
Routing Criteria:
| Keywords | Model Type | Target |
|---|---|---|
| VLM, vision-language, image-text understanding, OCR, Qwen2VL, InternVL, GLM4V | Understanding (VLM) | mindspeed-mm-vlm |
| Video generation, image generation, t2v, t2i, i2v, Wan, CogVideoX, FLUX | Generative | mindspeed-mm-generative |
| Omni, speech + vision + text | Omni | examples/qwen2.5omni/ |
| Speech recognition, TTS, ASR, Whisper, CosyVoice | Audio | examples/whisper/ or examples/cosyvoice3/ |
| DPO, GRPO, preference alignment, reinforcement learning | Post-training | See Post-training section |
1. Environment Setup (mindspeed-mm-env-setup)
→ 2. Model Dependency Installation (mindspeed-mm-vlm Step 0)
→ 3. Weight Download + HF→MM Conversion (mindspeed-mm-weight-prep)
→ 4. Data Preprocessing (MLLM JSON)
→ 5. Training (pretrain_vlm.py)
→ 6. Inference Validation (inference_vlm.py)
→ 7. Evaluation (evaluate_vlm.py)
→ 8. Weight Export MM→HF (optional)
Inter-Stage Data Flow:
model_from_hf/Qwen2.5-VL-7B-Instruct/ ← Step 3 download
↓ mm-convert hf_to_mm
ckpt/mm_path/Qwen2.5-VL-7B-Instruct/ ← Step 3 output
↓ Used as the load path in model.json
↓
dataset/train.json + images/ ← Step 4 input (MLLM JSON format)
↓ Used directly, no binary preprocessing needed
↓
saved_ckpt/ ← Step 5 output
↓ mm-convert mm_to_hf (optional)
model_from_hf/.../converted/ ← Step 8 output
1. Environment Setup (mindspeed-mm-env-setup)
→ 2. Model Dependency Installation (mindspeed-mm-generative Step 0)
→ 3. Weight Download + HF→MM Conversion (mindspeed-mm-weight-prep)
→ 4. Data Preprocessing (video/image + caption JSON)
→ 5. Feature Extraction (VAE + TextEncoder) ← VLM does not have this step
→ 6. Training (pretrain_sora.py)
→ 7. Inference Generation (inference_sora.py)
→ 8. Weight Export MM→HF (optional)
Inter-Stage Data Flow:
weights/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/ ← Step 3 download
↓ mm-convert WanConverter hf_to_mm
weights/.../transformer/ ← Step 3 output (in-place conversion)
↓
dataset/videos/ + dataset/train.json ← Step 4 input
↓ Feature extraction script
dataset/features/ ← Step 5 output (VAE latents + text embeddings)
↓ Used as training data input
↓
saved_ckpt/ ← Step 6 output
↓ mm-convert WanConverter mm_to_hf (optional)
converted_weights/ ← Step 8 output
Key difference between VLM and generative models: Generative models require an additional feature extraction step before training (VAE encodes video/images into latents, TextEncoder encodes text into embeddings). VLM does not have this step.
| Model | Specs | Entry Script | Status |
|---|---|---|---|
| Qwen2VL | 2B/7B/72B | pretrain_vlm.py | Released |
| Qwen2.5VL | 3B/7B/32B/72B | pretrain_vlm.py | Released |
| Qwen3VL | 8B/30B/235B | pretrain_transformers.py | Released |
| InternVL2.5 | 4B/78B | pretrain_internvl.py | Released |
| InternVL3 | 8B/78B | pretrain_vlm.py | Released |
| InternVL3.5 | 30B | pretrain_transformers.py | Released |
| GLM4.1V | 9B | pretrain_vlm.py | Released |
| GLM4.5V | -- | pretrain_transformers.py | Prototype |
| DeepSeekVL2 | -- | pretrain_deepseekvl.py | Released |
| DeepSeekOCR | -- | finetune_ocr.py (custom) | Prototype |
| DeepSeekOCR2 | -- | finetune_ocr2.py (custom) | Prototype |
| JanusPro | -- | -- | -- |
| Ming | -- | finetune_vl.py (custom) | -- |
| Bagel | -- | pretrain_omni.py | -- |
| Model | Subtask | Entry Script | Status |
|---|---|---|---|
| Wan2.1 | t2v/i2v/v2v/flf2v | pretrain_sora.py | Released |
| Wan2.2 | t2v/i2v | pretrain_sora.py | Released |
| HunyuanVideo | t2v | pretrain_sora.py | Prototype |
| HunyuanVideo 1.5 | t2v | pretrain_sora.py | Prototype |
| CogVideoX | t2v | pretrain_sora.py | Released |
| FLUX | t2i | train_dreambooth_flux.py (diffusers) | Prototype |
| OpenSoraPlan 1.3 | t2v | pretrain_sora.py | Released |
| OpenSoraPlan 1.5 | t2v | pretrain_sora.py | Released |
| StepVideo | t2v | pretrain_sora.py | Prototype |
| LTX2 | t2v | mindspeed_mm/fsdp/train/trainer.py | -- |
| Lumina-mGPT | -- | pretrain_lumina.py | Released |
| Model | Entry Script | Status |
|---|---|---|
| Qwen2.5Omni | pretrain_vlm.py | Released |
| Qwen3Omni | pretrain_transformers.py | Released |
| Model | Entry Script | Status |
|---|---|---|
| Whisper | pretrain_whisper.py | -- |
| CosyVoice3 | mindspeed_mm/fsdp/tasks/cosyvoice3/train.py | -- |
| Qwen3TTS | mindspeed_mm/fsdp/train/trainer.py | -- |
| FunASR | mindspeed_mm/fsdp/tasks/funasr/trainer.py | -- |
| Task | Script | Applicable Models |
|---|---|---|
| DPO | posttrain_qwen2vl_dpo.py | Qwen2VL |
| DPO | posttrain_sora_dpo.py | Wan, Sora-like |
| GRPO | posttrain_flux_dancegrpo.py | FLUX |
| GRPO (verl) | verl_plugin/ | Qwen2.5VL |
MindSpeed-MM has three entry script patterns:
pretrain_vlm.py (VLM), pretrain_sora.py (generative) — most models use thesepretrain_internvl.py, pretrain_deepseekvl.py, pretrain_whisper.py, pretrain_lumina.py — dedicated scripts for specific modelspretrain_transformers.py or mindspeed_mm/fsdp/train/trainer.py or mindspeed_mm/fsdp/tasks/<model>/train.py — newer models (Qwen3VL, Qwen3Omni, LTX2, CosyVoice3, Qwen3TTS, FunASR)Always check the actual shell script in
examples/<model_name>/— do not assume from the model name.
New models should use the unified entry. Legacy models still use model-specific entries and are being migrated gradually.
The following parameters apply to all model types. For full parameter descriptions, see references/common-args.md.
| Parameter | Description | Typical Values |
|---|---|---|
--tensor-model-parallel-size | Tensor parallelism degree (TP) | 1/2/4/8 |
--pipeline-model-parallel-size | Pipeline parallelism degree (PP) | 1/2/4/8 |
--context-parallel-size | Context parallelism degree (CP) | 1/2 |
--expert-model-parallel-size | Expert parallelism degree (EP, for MoE models) | 1/2/4 |
| Parameter | Description |
|---|---|
--micro-batch-size | Number of samples per device per step |
--global-batch-size | Global batch size (= micro * DP * gradient_accum) |
--seq-length | Training sequence length |
| Parameter | Description |
|---|---|
--recompute-granularity | Recomputation granularity: full / selective |
--recompute-method | Recomputation method: uniform / block |
--use-distributed-optimizer | Use ZeRO-1 distributed optimizer |
--sequence-parallel | Sequence parallelism (reduces activation memory) |
| Parameter | Description |
|---|---|
--train-iters | Total training steps |
--lr | Initial learning rate |
--min-lr | Minimum learning rate |
--lr-decay-style | Learning rate decay strategy: cosine / linear |
--weight-decay | Weight decay |
--bf16 | Use BF16 mixed precision |
--use-flash-attn | Enable FlashAttention |
| Setting | Recommendation |
|---|---|
--ipc=host | Required for DataLoader shared memory |
--privileged | Required for NPU device access |
--num-workers | Set to 0 if Docker shm is insufficient |
MASTER_PORT | Change if port conflict with stale processes |
MindSpeed-MM supports two distributed training backends:
| Feature | Megatron | FSDP2 |
|---|---|---|
| Maturity | Mature and stable | Newer |
| Parallelism | Fine-grained TP/PP/CP/EP control | Automatic sharding |
| Configuration | Command-line arguments | --fsdp2-config-path specifies YAML |
| Supported Models | All models | Select models (Qwen3.5, CosyVoice3, Kimi-K2.5, etc.) |
| Advantage | Flexible and tunable | Simple configuration, easy to get started |
Selection Guidelines:
--fsdp2-config-path to specify the configuration file, replacing Megatron's TP/PP/CP parametersThe following parameters must be consistent between weight conversion and training:
| Parameter | Weight Conversion (mm-convert) | Training Script |
|---|---|---|
TP (tensor-model-parallel-size / tp_size) | Set | Must match |
PP (pipeline-model-parallel-size / pp_layers) | Set | Must match |
| Model architecture | Determined by HF config | Must match |
Inconsistent parameters will cause weight loading failures or shape mismatch errors.
Verify each item before starting deployment:
--privileged --ipc=host (or --shm-size=16g)python -c "import torch_npu; print(torch.npu.is_available())"npu-smi infopip show mindspeed-mmls MindSpeed-MM/megatron/Q: How do I determine which Skill to use?
Choose based on model type: use mindspeed-mm-vlm for VLM models, mindspeed-mm-generative for generative models. When in doubt, refer to the model index table above.
Q: What if different models have conflicting dependency versions?
MindSpeed-MM models have vastly different version requirements for transformers/diffusers/peft. It is strongly recommended to create a separate Docker container for each model. See the dependency conflict section in mindspeed-mm-env-setup.
Q: Where can I find training scripts and configurations for a specific model?
Example scripts and YAML configurations for each model are located in the MindSpeed-MM/examples/<model_name>/ directory.
Q: What is the difference between pretrain_vlm.py and pretrain_qwen2vl.py?
pretrain_vlm.py is the new unified entry point that differentiates models via YAML configuration. pretrain_qwen2vl.py is the legacy model-specific entry point. New models should use the unified entry; legacy models still use their dedicated entry points.
Q: Why do generative models need a feature extraction step?
Generative models (e.g., Wan, CogVideoX) do not directly ingest raw video/images during training. Instead, a VAE first encodes video into latent features, and a TextEncoder encodes text into embeddings. Training then loads these pre-extracted features directly. This avoids redundant encoding during training and significantly improves training efficiency.
Q: Training fails with Communication_Error_Bind_IP_Port
Stale process holding the port from a previous run. Kill zombie processes or change MASTER_PORT in the training script.
ps aux | grep torchrun | grep -v grep | awk '{print $2}' | xargs kill -9
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin migration-ascend-torchnpu-skillsTrains multimodal generative models (text-to-video/image) on Huawei Ascend NPU. Covers Megatron, FSDP2, DeepSpeed backends for models like Wan, HunyuanVideo, FLUX, SD3, etc.
Builds runnable Nemotron model-customization pipelines from existing repo steps and artifact contracts. Plans step DAGs, validates wiring, and generates YAML configs.
Trains or fine-tunes language/vision models using TRL or Unsloth on Hugging Face Jobs cloud GPUs. Supports SFT, DPO, GRPO, reward modeling, and GGUF export for local deployment.