From thinking-frameworks-skills
Decomposes ML constructs (attention, norm, convolution, etc.) into linear-algebra primitives and explains why they work. Includes ablation thought experiments.
How this skill is triggered — by the user, by Claude, or both
Slash command
/thinking-frameworks-skills:ml-primitive-decoderThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
1. [Workflow](#workflow)
Most ML constructs that look opaque are actually a short stack of well-understood linear algebra primitives composed in a specific way. Attention is dot product + softmax + weighted sum. Layer norm is recenter + rescale + learnable affine. Diffusion is noise injection + denoising regression repeated. This skill names the primitives, then shows why their composition produces the behavior the construct is famous for.
The signature deliverable is: construct = primitive₁ + primitive₂ + primitive₃, and that's why it does X.
Quick example (Attention):
Construct: Self-attention layer, output = softmax(QKᵀ/√d) · V.
Primitives:
- Linear projections (Q, K, V each = a different learnable linear layer applied to the same input).
- Dot product (QKᵀ measures pairwise similarity between query and key vectors).
- Scaled softmax (divides scores by √d to keep variance bounded, then row-normalizes into mixing weights).
- Weighted sum (multiplying by V is a per-row weighted average over value vectors).
Why it works: Each token learns to advertise (K), ask (Q), and deliver (V). The dot product matches asks to advertisements; softmax turns raw matches into selection weights; the weighted sum is the selected content. The whole layer is content-addressable memory implemented in pure linear algebra.
Ablation:
- Remove softmax → linear combination, no selectivity (just averaging everything).
- Remove √d → softmax saturates as d grows, gradients vanish.
- Tie Q = K → tokens can only ask questions about themselves; no cross-token information transfer.
Copy this checklist and track your progress:
Decoding Progress:
- [ ] Step 1: Identify the construct and write its formula
- [ ] Step 2: List the primitives it's composed of
- [ ] Step 3: Explain what each primitive contributes
- [ ] Step 4: Show how composition produces the famous behavior
- [ ] Step 5: Run the ablation thought experiment (what breaks if you remove each piece?)
- [ ] Step 6: Verify with a tiny worked example or invitation
Step 1: Identify the construct and write its formula
Pin down what you're decoding. Get the formula in front of you (and the user). If the user asks "why does attention work" without specifying, it's almost always self-attention — but check. Multi-head attention, cross-attention, and flash attention are different decompositions.
For a catalog of common ML constructs and their formulas, see resources/constructs.md.
Step 2: List the primitives it's composed of
The primitives are a small fixed set (see The Primitive Catalog below). Decompose the formula into a list of these. Most constructs are 2-5 primitives.
The discipline: name only the load-bearing primitives. Trivial reshapes, broadcasts, and identity passes don't count. If a primitive doesn't change the behavior of the construct, skip it.
Step 3: Explain what each primitive contributes
For each primitive, write one sentence saying what it does in this construct specifically — not in general. "Dot product measures alignment" is too generic. "Dot product, applied to (Q, K) pairs, scores how well each token's question matches each other token's advertisement" is specific.
Step 4: Show how composition produces the famous behavior
This is the payoff. The construct is famous for some behavior — attention does content-addressable retrieval; layer norm stabilizes training; softmax produces a sharp-or-smooth distribution. Show how the combination of primitives produces that behavior. The composition is where the magic isn't.
A good Step 4 reads like: "Primitive A produces sub-effect 1. Primitive B turns sub-effect 1 into sub-effect 2. Primitive C turns sub-effect 2 into the famous behavior. Each step is mechanical; the famous behavior emerges from the chain."
Step 5: Run the ablation thought experiment
For each primitive, ask: "what would break if we removed this?" The answers are the highest-information part of the decoding — they tell the learner why each piece is there, which is much harder to forget than the formula.
See Ablation Thought Experiment below for the structure. For ablation tables per construct, see resources/ablations.md.
Step 6: Verify with a tiny worked example or invitation
Either compute one tiny instance of the construct end-to-end (a 3-token attention, a 2-feature layer norm) showing each primitive's intermediate output — or invite the user to do so. The worked example is what makes the decomposition feel real, not just declarative.
For tiny worked examples per construct, hand off to the worked-example-walkthrough skill.
Almost every ML construct is built from a small set of these:
| Primitive | What it does | Famous uses |
|---|---|---|
| Linear projection (matrix multiply) | Maps input vector to a new space; learnable | Q/K/V projections, FFN layers, embeddings |
| Dot product | Measures alignment between two vectors | Attention scores, cosine similarity, score functions |
| Outer product | Forms a rank-1 matrix from two vectors | Hebbian updates, low-rank adaptations (LoRA) |
| Softmax | Normalizes scores into a probability distribution; sharpens at high contrast | Attention weights, classification heads, mixture-of-experts gating |
| Element-wise nonlinearity (ReLU, GELU, sigmoid, tanh) | Introduces nonlinearity, gates information | Hidden layers, gating mechanisms |
| Weighted sum | Combines vectors with given weights | Attention output, mixture models, EMA |
| Centering / normalization | Subtract mean, divide by std | LayerNorm, BatchNorm, GroupNorm |
| Affine transform (γx + β) | Learnable shift + scale | LayerNorm tail, BatchNorm tail |
| Residual / skip connection (x + f(x)) | Adds an identity path around a block | Transformers, ResNets |
| Dropout / noise injection | Randomly zeros or perturbs values | Regularization, diffusion forward process |
| Convolution | Weight-shared local linear map | CNNs |
| Pooling (max, mean, attention) | Aggregates many features into one | Read-out layers, set/graph reductions |
| Embedding lookup | Indexes into a learnable table | Token embeddings, position embeddings |
| Loss (cross-entropy, MSE, contrastive, KL) | Scalar measure of wrongness | Training objective |
| Gradient descent step | Subtract scaled gradient from parameters | Every training loop |
Whenever you decode a construct, the primitives list should come from this catalog. If you find yourself naming something that isn't here, either it's a more complex construct (decompose further) or it's a new primitive worth adding.
The construct is a sequence of primitives applied in order. Decode as: input → P₁ → intermediate → P₂ → ... → output. Examples: attention block, transformer FFN, layer norm.
The construct is a base primitive wrapped with normalization, residuals, or activations. Decode as: base + decorator₁ + decorator₂. Examples: a transformer block (attention + residual + LN + FFN + residual + LN), ResNet block.
The construct is a loss function. Decode as: similarity measure + normalization + reduction. Examples: cross-entropy (log + sum), contrastive loss (similarity + softmax + cross-entropy).
The construct is a single step repeated many times. Decode the step, then explain the dynamics of repetition. Examples: gradient descent (gradient + step), diffusion (noise + denoise step), RNN (state update step).
For one full decode per pattern, see resources/constructs.md.
For every decoded construct, run an ablation table. For each primitive, answer: "what would break if you removed it (or replaced it with the simplest possible thing)?"
This is the most pedagogically valuable part of the skill. It forces the user to see why each piece is there.
Format:
| Remove | What happens | What insight it reveals |
|---|---|---|
| Primitive 1 | Concrete failure mode | What primitive 1 is contributing |
| Primitive 2 | Concrete failure mode | What primitive 2 is contributing |
| ... | ... | ... |
Example (Attention):
| Remove | What happens | What insight it reveals |
|---|---|---|
| Softmax | Output becomes a linear combination — every token mixed equally with every other; no selectivity | Softmax is what makes attention attention (not averaging) |
| √d scaling | At high d, dot products become large, softmax saturates, gradients vanish | √d controls the variance of scores so softmax stays in its useful regime |
| Separate Q vs K | Tokens can only attend based on self-similarity, not cross-similarity | Splitting Q and K is what allows asymmetric matching |
| V (use K instead) | Model can only retrieve what tokens advertise about themselves; no separate "content" channel | V is the actual payload — separate from the matching key |
| Multi-head | Single semantic relation per layer; loses ability to attend along multiple axes simultaneously | Heads are parallel attention computations on different subspaces |
For ablation tables for many constructs, see resources/ablations.md.
Always run the full workflow. The "why" is in Step 4 (composition) and Step 5 (ablation), not in Step 2 (primitive list).
Steps 1-3 may be enough. Skip ablation if the user just wants a quick read.
Don't fake it. Decode what you understand; flag what you don't. "I can decode the attention part but I'm not sure why this paper added the gating; let me look it up" is a much better response than confident-sounding nonsense.
Decompose into the catalog of primitives anyway. Truly novel ML constructs are rare; most are reshufflings of the catalog. If something genuinely doesn't fit, name it as a new primitive and explain why.
| Construct | Primitives (in order) | Famous behavior emerges because… |
|---|---|---|
| Self-attention | Linear proj × 3 + dot product + scaled softmax + weighted sum | Each token learns to ask, advertise, deliver; the math implements soft database lookup |
| LayerNorm | Center + rescale + learnable affine | Standardizes per-example activations; γ, β preserve expressiveness |
| Softmax | Exp + normalize | Positivity + sum-to-1 + amplification of contrast |
| Cross-entropy | Log + weighted sum | Punishes low predicted prob on the true class |
| Conv layer | Weight-shared linear projection | Translation equivariance + parameter efficiency from sharing |
| ResNet block | f(x) + identity (residual) + nonlinearity | Identity path means gradients flow + small learned perturbations to the input |
| Dropout | Random binary mask + rescale | Forces redundancy; prevents co-adaptation |
| Diffusion (one step) | Noise injection (forward) + linear projection + nonlinearity (denoise) | Slow noise schedule lets simple regression learn complex distributions |
| SGD step | Gradient + scaled subtraction | Local descent on the loss surface |
| Adam step | Gradient + moving averages + per-coord rescale + scaled subtraction | Per-parameter adaptive learning rate from gradient statistics |
| Embedding | Index + lookup table | Discrete tokens get learnable continuous coordinates |
| Contrastive (InfoNCE) | Dot product + scaled softmax + cross-entropy | Pulls aligned pairs together, pushes others apart, all in one loss |
For the full decode of each construct, see resources/constructs.md. For ablation tables for each construct, see resources/ablations.md.
npx claudepluginhub lyndonkl/claude --plugin thinking-frameworks-skillsGuides a learner to invent a math/ML concept themselves through Socratic questioning, ending with them stating the formal definition. Use when teaching foundational concepts or when the user wants deep understanding.
Delivers technical depth on Yann LeCun's work: CNNs, LeNet, backpropagation, JEPA (I-JEPA, V-JEPA, MC-JEPA), AMI, self-supervised learning (SimCLR, MAE, BYOL), EBMs with full PyTorch code.
Deep-dives into ML/AI topics by fetching official docs and GitHub sources via KB or web tools, for explaining concepts, comparing approaches, or surveying frameworks like 'how does X work?' or 'X vs Y'.