From msmodelslim
Model quantization on Ascend NPUs using msmodelslim. Use whenever the user wants to quantize an LLM or VLM (W4A8, W8A8, W4A16, W4A4, or other dtypes), write a quantization YAML config, run sensitive layer analysis, compress model weights for NPU serving, or debug quantization accuracy. Covers one-click quantization, custom YAML authoring, mixed precision for MoE models, VLM calibration, and adding new model adapters.
How this skill is triggered — by the user, by Claude, or both
Slash command
/msmodelslim:msmodelslim quantize / config / analyze / adapterquantize / config / analyze / adapterThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Model weight quantization on Ascend NPUs — from quick one-click runs to custom YAML configs, mixed precision, and accuracy recovery.
Model weight quantization on Ascend NPUs — from quick one-click runs to custom YAML configs, mixed precision, and accuracy recovery.
Before quantizing:
npu-smi info/model-download if needed. --model_path always points to a local directory, never an online model IDpip show msmodelslim torch_npu transformersWhen a user asks to quantize a model, follow this decision tree:
User states: model + target dtype (e.g., "quantize Qwen3-32B to W4A8")
│
▼
Is there a lab_practice YAML for this model + dtype?
│
┌────┴────┐
│ YES │ NO
▼ ▼
Use Build a custom YAML using the config guide.
one-click See [references/yaml-config-guide.md](references/yaml-config-guide.md)
command for templates and parameter selection.
│ │
▼ ▼
Quantize → Serve (with /vllm, quantization="ascend") → Evaluate (with /aisbench)
│
┌────┴────┐
│ PASS │ FAIL
▼ ▼
Done Run sensitive layer analysis → exclude problematic layers → retry
See [references/analysis.md](references/analysis.md)
When a pre-configured YAML exists in msmodelslim/lab_practice for the model + dtype:
msmodelslim quant \
--model_path ${MODEL_PATH} \
--save_path ${SAVE_PATH} \
--device npu \
--model_type <ModelName> \
--quant_type <TARGET_DTYPE> \
--trust_remote_code True
--quant_type auto-matches the best YAML from lab_practice. Values: w4a8, w4a8c8, w8a8, w8a8s, w8a8c8, w8a16, w16a16s--config_path can be used instead to point to a specific YAML file (takes priority over --quant_type)--trust_remote_code True for models with custom architectures (Qwen3, DeepSeek, GLM, etc.)When no pre-built config matches, or when the model needs mixed precision (MoE, VLM), write a YAML config and use --config_path.
Use references/yaml-config-guide.md as the authoritative reference for all YAML structure, processor types, parameters, and templates. It covers:
These are the most common decisions. The full rationale and edge cases are in the config guide.
Activation scope + symmetric (hardware-constrained on Ascend NPU):
| scope | symmetric | type | use case |
|---|---|---|---|
per_token | true | dynamic | Default for LLMs — one scale per token at runtime |
per_tensor | false | static | Throughput-optimized attention layers |
pd_mix | false | hybrid | Only with KV cache quantization (w8a8c8) |
Weight scope:
| scope | when |
|---|---|
per_channel | Standard for W8A8 and all W4A8 |
per_group + group_size | W4A4 only, when absolute minimum memory is needed |
Weight method (determined by dtype + scope):
| dtype | scope | method |
|---|---|---|
| int8 | per_channel | minmax |
| int4 | per_channel | ssz |
| int4 | per_group | autoround |
Outlier suppression (runs before quantization):
| dtype | preprocessor | subgraph types |
|---|---|---|
| W8A8 (dense) | iter_smooth (alpha=0.5) | norm-linear, linear-linear, ov, up-down |
| W4A8 (standard) | flex_smooth_quant | norm-linear (+ ov if cross-attention exists) |
| W4A4 / aggressive | quarot → flex_smooth_quant | quarot has no subgraph config |
| W4A16 (weight-only) | awq | norm-linear, linear-linear, ov, up-down |
The full loop is: quantize → serve → evaluate → (if fail) analyze → retry.
/vllm with quantization="ascend" in the LLM config/aisbench (GSM8K; threshold: ≤1 pp drop vs FP16 baseline)exclude, and retry quantization| dataset | use for |
|---|---|
mix_calib.jsonl (default) | General-purpose text models |
qwen3_cot.json | Reasoning/CoT models at W8A8 |
qwen3_cot_w4a4.json | Reasoning models at W4A4 or aggressive W4A8 |
autocodebench.jsonl | Code models |
For Vision-Language Models: built-in datasets are text-only and cannot calibrate vision components. Always supply a custom multimodal calibration dataset (64–256 samples, base64 image URIs in chat format). See the config guide for format details and vision component exclusion patterns.
Some layers are inherently sensitive to quantization:
*gate routers — never quantize (universal across all lab_practice configs)*lm_head* — exclude or use higher bit width*embed_tokens* — exclude for INT4Start without layer protection. Add exclusions only after evaluation shows accuracy degradation.
When quantizing a model with no existing lab_practice config, register it under third-party/msmodelslim/<model_family>/. See references/model-adapter.md for the directory layout, YAML template, and validation steps.
Save quantization commands to a shell script and execute it so output is captured in a timestamped log file. This makes debugging easier when quantization fails.
msmodelslim is installed in editable mode. Run pip show msmodelslim to find the source directoryexport ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 and pass --device npu:0,1,2,3--quant_type and --config_path are mutually exclusive — use one or the otherper_group scope — switch to autoround for per_group INT4ssz (per_channel) or autoround (per_group)Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub starmountain1997/g-claude --plugin msmodelslim