From external-gitcode-ascend-skills
Designs, implements, builds, tests, documents, and tunes AscendC custom operators for Ascend NPU within an ascend-kernel PyTorch custom-op project. Covers tiling design, code generation, compilation, precision/performance evaluation, and security review.
How this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:ascendcThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill drives a **new AscendC custom operator from a spec to a production-ready,
examples/layer_norm_profiler_reference/LAYER_NORM_PROFILER_PERF_GUIDE.mdexamples/layer_norm_profiler_reference/README.mdexamples/layer_norm_profiler_reference/benchmark_layer_norm_torch_npu_profiler.pyexamples/layer_norm_profiler_reference/layer_norm_perf_cases.jsonlexamples/layer_norm_profiler_reference/layer_norm_profiler_common.pyexamples/precision-debug/async-sync-missing.mdexamples/precision-debug/fp16-no-upcast.mdexamples/precision-debug/gm-offset-error.mdexamples/precision-debug/multicore-tiling-overlap.mdexamples/precision-debug/tail-tile-misalign.mdexamples/sample_perf_cases.jsonlexamples/sample_report.mdreferences/00-environment.mdreferences/01-project-init.mdreferences/02-design.mdreferences/03-testcase-gen.mdreferences/04-code-gen.mdreferences/04a-kernel-api.mdreferences/05-compile-debug.mdreferences/06-doc-gen.mdThis skill drives a new AscendC custom operator from a spec to a production-ready,
benchmarked operator inside an ascend-kernel project (a PyTorch custom-op project
that exposes operators as torch.ops.npu.<op> via csrc/ops/, csrc/register.cpp,
and build.sh). It is self-contained: every phase, template, and reference lives
under this skill directory.
Scope note: this skill targets the ascend-kernel /
csrc/opsPyTorch custom-op workflow (vector / row / index / sort / pool operators, FP16/BF16 up-cast, two-level tiling). It is not for theops-transformeraclnn/genop flow.
SKILL.md fully first (lifecycle, gates, anti-patterns).references/NN-*.md before acting.templates/, examples/, and scripts/; do not invent project
structure or APIs.<name>" (e.g. acosh, rms_norm).Phase 0 Environment + requirements
Phase 1 Project init -> references/01-project-init.md
Phase 2 Design (design.md) -> references/02-design.md
Phase 3 Test cases -> references/03-testcase-gen.md
Phase 4 Code generation -> references/04-code-gen.md (+ 04a-kernel-api.md)
Phase 5 Compile / debug -> references/05-compile-debug.md
Phase 6 Interface docs -> references/06-doc-gen.md
Phase 7 Precision eval -> references/07-precision-eval.md (fail -> 07b-precision-debug.md)
Phase 8 Performance eval -> references/08-performance-eval.md
Phase 9 Performance optim -> references/09-performance-optim.md
Phase 10 Code review -> references/10-code-review.md
(opt) Memory check -> references/11-mssanitizer.md
Input: operator name (snake_case) + functional/math spec.
Output: built & installed operator, design.md, unified test-case doc, PyTorch-style
README, precision report, performance report (and optimization/review reports if run).
Confirm the build/run environment before any development action.
echo $ASCEND_HOME_PATH. If set, use it as CANN_PATH. If unset, MUST
ask the user for the CANN install path. Activate per shell with
source ${CANN_PATH}/*/set_env.sh.echo $CONDA_DEFAULT_ENV. If non-empty and not base, use it. Otherwise
MUST ask the user for the conda env name; activate with conda activate <env>.float16, float32, may add bfloat16),
SoC (optional, default ascend910b, obtained via platform API at runtime).Details and the decision tree: references/00-environment.md.
Gate: CANN path resolved and activatable; conda env resolved and activatable; operator name and functional spec confirmed.
Locate or create the ascend-kernel project, then scaffold csrc/ops/<op>/.
scripts/detect_ascend_kernel_project.sh. If none, copy the bundled
template templates/ascend-kernel/ and chmod +x build.sh.csrc/ops/<op>/{op_host/<op>.cpp, op_kernel/<op>.cpp, CMakeLists.txt, design.md}
(placeholders).csrc/ops.h,
csrc/register.cpp, csrc/CMakeLists.txt.Read references/01-project-init.md.
Gate: project exists (build.sh, CMakeLists.txt, csrc/); csrc/ops/<op>/ skeleton
created with the four files.
design.md)Produce a complete design document; it is the direct input for code generation.
bufferCoefficient per dtype; describe the FP16/BF16 → FP32 up-cast path.templates/design-template.md → write to csrc/ops/<op>/design.md.Read references/02-design.md (tiling-by-op-type, UB allocation, API map, hardware constraints).
Gate: design.md has function signature, supported dtypes, API pseudocode, UB allocation
table with bufferCoefficient per dtype, tiling struct, and up-cast path.
Generate one unified test-case document reused by precision and performance later.
design.md; produce SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES,
BOUNDARY_VALUES, and the operator baseline (CPU reference + NPU call).(TEST_SHAPES + GENERAL_SHAPES) x SUPPORTED_DTYPES >= 30; keep single-shape
element count reasonable (<= ~200K for regular cases).templates/test-cases-template.md → write
csrc/ops/<op>/test/<op>-test-cases.md.Read references/03-testcase-gen.md.
Gate: test-case doc exists with dtypes, shapes, boundary values, and baseline; values
respect design.md constraints.
Generate op_host and op_kernel, then wire them into the framework.
templates/code-gen/ by operator type (elementwise / row /
index / index-per-elem / sort / pool); copy into csrc/ops/<op>/ and adapt.coreNum/ubSize (never hardcode),
bufferCoefficient, left-value EXEC_KERNEL_CMD args.BUFFER_NUM=2, Init core offsets, InitBuffer sizes, Compute logic, tail-tile
alignment, FP16/BF16 up-cast to FP32, DataCopyPad for GM↔UB, backup before Reduce.csrc/ops.h, m.def+m.impl to csrc/register.cpp,
host+kernel sources to csrc/CMakeLists.txt.Read references/04-code-gen.md and the API essentials in references/04a-kernel-api.md.
Gate: both sources generated; three registration points updated; checklist in the reference satisfied.
Build the project, install the wheel, generate a basic test, run it, and debug.
chmod +x build.sh && bash build.sh; confirm output/ascend_kernel*.whl.pip install output/ascend_kernel*.whl --force-reinstall --no-deps.tests/test_<op>.py; run functional test (python ...) then precision test
(pytest -v). Source the environment before every shell command.Read references/05-compile-debug.md.
Gate: wheel built and installed; functional test exits 0; precision pytest green.
Extract interface facts from source and emit a PyTorch-style README.
register.cpp (m.def), C++ signature from ops.h, algorithm/dtype/
constraints from design.md, TORCH_CHECK from op_host, example from the test file.csrc/ops/<op>/README.md. Default language is
English; switch to Chinese on request.Read references/06-doc-gen.md.
Gate: README has signature, params, dtypes, shape, constraints, example, returns; matches
register.cpp schema; displayed in chat.
Run a comprehensive precision suite and produce a report.
(shapes + boundary) x dtypes >= 30.test_<op>_precision.py and run_<op>_precision_report.py from
templates/precision/; run pytest then the report generator.MERE < Threshold and
MARE < 10 x Threshold. Standards table in
references/07a-precision-standards.md.Read references/07-precision-eval.md.
Gate: pytest green; JSON + Markdown report written; results displayed in chat.
Benchmark the custom operator against a baseline with torch_npu.profiler.
>= 8 cases) from the test-case doc + design.md; always run
a dual-path comparison (custom vs baseline; baseline must run on NPU — use a small-op
composition when no equivalent API exists).warmup=5, active=5; aggregate Total Time(us) from
ASCEND_PROFILER_OUTPUT/op_statistic.csv. Copy examples/layer_norm_profiler_reference/
as the starting point.Read references/08-performance-eval.md, references/08a-profiler-and-metrics.md, and references/08b-perf-case-jsonl.md.
Gate: dual-path report written; displayed in chat with table + summary + conclusions.
Investigate, modify, and verify — at most 3 rounds.
Read references/09-performance-optim.md.
Gate: precision still passes; performance compared to baseline; results displayed in chat.
Hypothesis-testing security review against the coding red lines (numeric, memory/pointer, resource, input validation, concurrency, operator interface, ABI compatibility).
Read references/10-code-review.md.
Run mssanitizer to detect illegal access / leaks / UB out-of-bounds. Read references/11-mssanitizer.md.
design.md.DataCopy for GM↔UB — use DataCopyPad.EXEC_KERNEL_CMD.cmake/ or csrc/utils/.ReduceSum/ReduceMax (reduction may modify it).std::min/max/abs/sqrt/exp etc. inside a kernel.repeatTime > 255 to high-dim split APIs (silent uint8 truncation).| Detected state | Phase not done | Resume at |
|---|---|---|
csrc/ops/<op>/ missing | 1 | Phase 1 |
design.md placeholder/empty | 2 | Phase 2 |
<op>-test-cases.md missing | 3 | Phase 3 |
| op_host still skeleton | 4 | Phase 4 |
| wheel not built / basic test failing | 5 | Phase 5 |
README.md missing | 6 | Phase 6 |
| no precision report / precision failing | 7 | Phase 7 |
| precision report present, no perf report | 8 | Phase 8 |
| Phase | Precondition | Reference | Key artifact |
|---|---|---|---|
| 0 Env + req | — | 00-environment | CANN + conda + name + spec |
| 1 Init | 0 | 01-project-init | csrc/ops/<op>/ skeleton |
| 2 Design | 1 | 02-design | design.md |
| 3 Test cases | 2 | 03-testcase-gen | <op>-test-cases.md |
| 4 Code-gen | 3 | 04-code-gen | op_host + op_kernel + registration |
| 5 Compile/debug | 4 | 05-compile-debug | installed wheel + green tests |
| 6 Docs | 5 | 06-doc-gen | README.md |
| 7 Precision | 6 | 07-precision-eval | precision report |
| 8 Performance | 7 | 08-performance-eval | performance report |
| 9 Optimize | 8 | 09-performance-optim | optim summary |
| 10 Review | 4+ | 10-code-review | review report |
This skill is self-contained (templates, references, examples, and scripts are bundled).
It only needs a working CANN toolkit and a PyTorch + torch_npu conda environment on a
host with Ascend NPUs to compile and run the operator.
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin migration-ascend-torchnpu-skillsOrchestrates end-to-end AscendC operator development: project init, design, test gen, code gen, compile, API doc, precision eval (≥30 cases), performance eval (msprof). For building new custom operators from scratch.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.