From external-gitcode-ascend-skills
Scans PyTorch codebases for CUDA/NVIDIA dependencies and assesses migration feasibility to Ascend NPUs, organized by 7 domains (device layer, attention, custom ops, distributed, precision, third-party, compilation).
How this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:ascend-migration-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
对任意 PyTorch 项目进行系统化的 Ascend NPU 迁移可行性分析。扫描代码库中的 CUDA/NVIDIA 依赖,按域分类,逐项给出替代方案,并估算迁移工作量。
对任意 PyTorch 项目进行系统化的 Ascend NPU 迁移可行性分析。扫描代码库中的 CUDA/NVIDIA 依赖,按域分类,逐项给出替代方案,并估算迁移工作量。
对目标项目执行以下 7 步分析,每步对应一个迁移域。详细方法和替代方案见各域的 reference 文档。
Read references/01-dependency-audit.md.
扫描内容:
requirements.txt / setup.py / pyproject.toml 中的 NVIDIA 专属依赖flash-attn, triton, xformers, nvidia-ml-py, cuda-python, cupy 等产出:依赖兼容性矩阵(每个依赖标注:兼容 / 需替换 / 不支持 / 不涉及)
Read references/02-device-layer.md.
扫描内容(使用 grep/rg):
torch.cuda.* 全系列 API"cuda" 设备字符串backend="nccl" 分布式后端amp.autocast(device_type='cuda') 混合精度torch.compile / @torch.compile 使用torch.cuda.Stream / torch.cuda.Eventinit_device_mesh("cuda")产出:设备层替换清单(文件:行号 → 替代方案)
Read references/03-attention-mechanism.md.
扫描内容:
flash_attn (FA2) 导入与调用flash_attn_interface (FA3) 导入与调用xformers 注意力调用F.scaled_dot_product_attention 调用产出:注意力后端替换方案(每种 attention 类型 → 对应 NPU 替代)
Read references/04-custom-operators.md.
扫描内容:
.cu / .cuh / .cpp 自定义 CUDA 内核文件@triton.jit 装饰的 Triton kerneltorch.utils.cpp_extension.CUDAExtension 构建torch.autograd.Function 子类中的自定义前向/反向csrc/ / ops/ / kernels/ 目录产出:自定义算子迁移方案(每个 kernel 的功能描述 + 替代路径)
Read references/05-distributed.md.
扫描内容:
dist.init_process_group 后端选择dist.all_to_all / dist.all_gather / dist.broadcast 等init_device_mesh / DeviceMesh 使用torch.distributed.P2POp / batch_isend_irecv产出:分布式通信适配方案
Read references/06-precision-strategy.md.
扫描内容:
.float() / .double() 类型转换torch.bfloat16 / torch.float16 dtype 使用torch.complex128 / torch.float64 高精度使用autocast 上下文中的 dtype 设置dtype == torch.float32 assert 语句产出:精度策略调整清单
Read references/07-task-phase-matrix.md.
将前 6 步的发现按任务执行阶段拆分,识别:
产出:按阶段的依赖矩阵 + 最小迁移集 + 完整迁移工作量估算
分析完成后,输出结构化报告:
# {Project} Ascend NPU Migration Assessment
## Executive Summary
- Overall feasibility: [可行 / 需适配 / 困难 / 不可行]
- Estimated effort: [X 周]
- Blockers: [list]
## Dependency Matrix (per task phase)
| Dependency | Phase A | Phase B | Replacement | Effort |
|------------|---------|---------|-------------|--------|
## Detailed Findings (per domain)
### Domain 1: Third-Party Dependencies
### Domain 2: Device Layer
### Domain 3: Attention Mechanism
### Domain 4: Custom Operators
### Domain 5: Distributed Communication
### Domain 6: Precision Strategy
## Migration Roadmap
## Risk Matrix
| CUDA Pattern | Ascend Replacement | Confidence |
|---|---|---|
torch.cuda.empty_cache() | torch.npu.empty_cache() or transfer_to_npu | High |
backend="nccl" | backend="hccl" | High |
autocast('cuda') | autocast('npu') | High |
device="cuda" | device="npu" | High |
flash_attn.flash_attn_func | mindiesd.attention_forward(op_type="ascend_laser_attention") | High |
flash_attn.flash_attn_varlen_func | mindiesd.attention_forward(op_type="fused_attn_score") | Medium |
xformers.ops.memory_efficient_attention | mindiesd.attention_forward(op_type="fused_attn_score") | High |
F.scaled_dot_product_attention | torch_npu 原生支持 | High |
@triton.jit kernels | 需逐个分析,可能用 mindiesd/RainFusion/NPU原生算子替代 | Low-Medium |
torch.compile | 禁用或使用 torch_npu backend | Medium |
RMSNorm .float() | torch_npu.npu_rms_norm() | High |
LayerNorm .float() | 移除 .float(),NPU 原生 BF16 | High |
| RoPE complex128 | 降为 complex64 或用 mindiesd.rotary_position_embedding() | High |
init_device_mesh("cuda") | "npu" | Medium |
.cu / CUDAExtension | 需重写为 Ascend 算子 | Low |
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin migration-ascend-torchnpu-skillsMigrates CUDA-based AI models (PyTorch, TensorFlow, vLLM) to Huawei Ascend NPU. Covers environment setup, code analysis, automatic and manual adaptation, distributed training, and verification.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.