Skill

compile-trace-aot

Debug PyTorch AOT Autograd stage - functionalization, decompositions, IR transformations, joint forward+backward graph (when requires_grad=True), partitioning/recomputation, and post-grad passes. Use for tracing AOT stage and understanding decomposition application.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/torch-compile:compile-trace-aot

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

How to trace and debug AOT Autograd: functionalization, joint graph creation, partitioning, and post-grad passes.

SKILL.md

649 lines · ~3.8k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Tracing AOT Autograd Stage - Training Transformations

How to trace and debug AOT Autograd: functionalization, joint graph creation, partitioning, and post-grad passes.

Stage Overview
When AOT Runs
Logging Setup
Output Files and Interpretation
Tracing Functionalization
Tracing Joint Graph
Tracing Partitioning
Post-Grad FX Passes
Debugging Workflows
Common Issues

Stage Overview

AOT Autograd = Ahead-of-Time autograd lowering (training-specific transformations)

What it does:

Functionalization: Removes mutations and aliases
Joint Graph: Combines forward + backward in one FX graph
Partitioning: Splits joint graph into separate forward/backward
Post-Grad Passes: Optimizes both graphs after partitioning

Pipeline Position:

Dynamo → [Pre-Grad] → AOT Autograd → [Post-Grad] → Inductor
                      ↓
               Functionalization
               Joint Graph
               Partitioning

Key Location: torch/_functorch/aot_autograd.py

When AOT Runs

Training vs Inference

Training Path (needs_autograd=True):

Any output requires grad, OR
Any input requires grad with mutations
Creates both forward and backward graphs

Inference Path (needs_autograd=False):

No gradients needed
Skips joint graph, partitioning
Only forward compilation

How to Know If AOT Ran

Check logs:

TORCH_LOGS="aot" python script.py

Output shows:

[AOT] Compiling forward graph: model__0_forward_0
[AOT] Compiling backward graph: model__0_backward_0

If AOT didn't run (inference):

# No AOT messages, goes straight to Inductor

Logging Setup

Basic Logging

Minimal (AOT compilation info):

TORCH_LOGS="aot" python script.py

Standard (with graphs):

TORCH_LOGS="aot,aot_graphs" python script.py

Comprehensive (including joint graph):

TORCH_LOGS="aot,aot_graphs,aot_joint_graph,post_grad_graphs" python script.py

Available Loggers

Logger	What It Shows	When to Use
`aot`	Basic AOT compilation tracking	Verify AOT ran
`aot_graphs`	Forward/backward graphs after partitioning	Understanding graph structure
`aot_joint_graph`	Combined forward+backward before split	Debugging partitioning
`post_grad_graphs`	FX graphs before/after post-grad passes	Pattern matching effects

Programmatic Setup

import os
os.environ['TORCH_LOGS'] = 'aot,aot_graphs,aot_joint_graph'

import torch._inductor.config as config
config.debug = True

Output Files and Interpretation

File Naming Convention

Format: {model_name}_{aot_id}__{graph_type}_{nth_graph}

Examples:

model__0__forward_0.py          # First forward graph
model__0__backward_0.py         # First backward graph
model__0__joint_0.py            # Joint graph (if logged)
model__0__forward_transformed_0.py   # After post-grad passes

Graph Structure

Joint Graph (before partitioning):

graph():
    # Forward inputs (primals)
    %arg0 : Tensor = placeholder[target=arg0]
    %arg1 : Tensor = placeholder[target=arg1]

    # Forward computation
    %mul : Tensor = call_function[target=aten.mul](args = (%arg0, 2))
    %add : Tensor = call_function[target=aten.add](args = (%mul, %arg1))

    # Backward inputs (tangents)
    %tangent : Tensor = placeholder[target=tangent]

    # Backward computation
    %mul_grad : Tensor = call_function[target=aten.mul](args = (%tangent, 2))

    # Outputs: forward results + gradients
    return (add, mul_grad)

Forward Graph (after partitioning):

graph():
    %x : Tensor = placeholder[target=x]
    %weight : Tensor = placeholder[target=weight]
    %mul : Tensor = call_function[target=aten.mul](args = (%x, %weight))
    %add : Tensor = call_function[target=aten.add](args = (%mul, 1))
    return (add, mul)  # Output + saved activations for backward

Backward Graph (after partitioning):

graph():
    %saved_mul : Tensor = placeholder[target=saved_mul]  # From forward
    %grad_output : Tensor = placeholder[target=grad_output]
    %grad_mul : Tensor = call_function[target=aten.mul](args = (%grad_output, 1))
    %grad_weight : Tensor = call_function[target=aten.mul](args = (%grad_mul, %saved_mul))
    return (grad_weight,)

What to Look For

In Joint Graph:

Node metadata: meta["partitioner_tag"] = "is_forward" or "is_backward"
Forward vs backward separation
Which activations are saved

In Partitioned Graphs:

Forward outputs include saved activations
Backward inputs match saved activations
Gradient flow correctness

Tracing Functionalization

What Functionalization Does

Creates Core ATen IR - removes mutations and aliases to produce functional graph.

Before (Full ATen IR):

def f(x):
    x.mul_(2)      # In-place mutation
    return x.add(1)

After (Core ATen IR):

def f(x):
    x_new = x * 2  # Functional
    return x_new + 1
# x.mul_() mutation tracked in metadata, applied at runtime

How to Trace

Logging (captured in AOT graphs):

TORCH_LOGS="aot,aot_graphs" python script.py

What to check:

Graph has no in-place ops (no mul_, add_, etc.)
Mutations tracked in output metadata
Wrapper code copies mutations back

Verifying Mutation Handling

Check graph nodes:

grep "mul_\|add_\|sub_" /tmp/torchinductor_$USER/model__*__forward_0.py
# Should find none (all converted to out-of-place)

Check metadata (in Python):

# Graph outputs include mutation info
# Look for: return (output, mutated_input)

Tracing Joint Graph

What Is Joint Graph

Joint Graph = Forward + Backward traced together in single FX graph

Purpose:

Trace backward pass via autograd.grad()
Identify what to save for backward
Enable cross-stage optimizations

How to Trace

TORCH_LOGS="aot_joint_graph" python script.py

Output: model__*__joint_*.py file

Interpreting Joint Graph

Node Tags (check metadata):

# Forward nodes:
%mul : Tensor = call_function[...]  # meta["partitioner_tag"] = "is_forward"

# Backward nodes:
%grad_mul : Tensor = call_function[...]  # meta["partitioner_tag"] = "is_backward"

Graph Flow:

Inputs (primals) → Forward computation → Outputs
                         ↓ (saved activations)
Tangents (grad outputs) → Backward computation → Gradients

What to look for:

Forward/backward separation
Activation saving decisions
Gradient flow correctness

Tracing Partitioning

What Partitioning Does

Input: Joint graph (forward + backward) Output: Separate forward and backward graphs

Strategies:

Default: Simple forward/backward split
Min-Cut: Optimizes memory via recomputation

How to Trace

TORCH_LOGS="aot,aot_graphs,aot_joint_graph" python script.py

Compare:

Joint graph: model__*__joint_*.py
Forward graph: model__*__forward_*.py
Backward graph: model__*__backward_*.py

Verifying Partition

Check forward outputs:

# Forward should output:
# 1. User-visible outputs
# 2. Saved activations for backward
return (output, saved_activation_1, saved_activation_2, ...)

Check backward inputs:

# Backward should receive:
# 1. Saved activations from forward
# 2. Gradient w.r.t. outputs (tangents)
def backward(saved_act_1, saved_act_2, grad_output):
    ...

Verify correspondence:

# Forward outputs should match backward inputs
grep "return" model__*__forward_*.py
grep "placeholder" model__*__backward_*.py

Recomputation Analysis

What is recomputed:

Operations recalculated in backward instead of saved
Trade memory for compute time

How to identify:

# Compare joint vs backward graph
# If operation appears in both, it's recomputed
diff <(grep "call_function" joint.py | grep "is_forward") \
     <(grep "call_function" backward.py)

Post-Grad FX Passes

When They Run

After: Partitioning Before: Inductor lowering

On: Both forward and backward graphs separately

How to Trace

TORCH_LOGS="post_grad_graphs" python script.py

Output shows:

Graph before passes
Each pass applied
Graph after passes

Common Passes

Pass	What It Does	How to Verify
Group Batch Fusion	Batches operations together	Look for fused ops
B2B GEMM	Fuses back-to-back matrix multiplies	Check for combined mm ops
Remove Noop	Eliminates no-op operations	Count nodes before/after
Pattern Matching	Various graph rewrites	Compare transformed graph

Verifying Pass Effects

Before Post-Grad:

%mm1 : Tensor = call_function[target=aten.mm](args = (%x, %w1))
%mm2 : Tensor = call_function[target=aten.mm](args = (%mm1, %w2))

After Post-Grad (B2B GEMM fusion):

%fused_mm : Tensor = call_function[target=fused_mm_template](
    args = (%x, %w1, %w2)
)

Debugging Workflows

Workflow 1: Verify AOT Ran

Goal: Confirm AOT Autograd executed

Steps:

Enable logging:
```
TORCH_LOGS="aot" python script.py
```

Check for AOT messages:

[AOT] Compiling forward graph: ...
[AOT] Compiling backward graph: ...

If missing:
- Check if model needs gradients
- Verify training mode: model.train()
- Check inputs: x.requires_grad = True

Workflow 2: Debug Incorrect Gradients

Symptom: Wrong gradients after compilation

Steps:

Compare with eager:

# Eager mode
loss = model(x)
loss.backward()
grad_eager = x.grad.clone()

# Compiled
model_compiled = torch.compile(model)
loss = model_compiled(x)
loss.backward()
grad_compiled = x.grad.clone()

torch.testing.assert_close(grad_eager, grad_compiled)

Check joint graph:

TORCH_LOGS="aot_joint_graph" python script.py
# Verify backward computation looks correct

Check partitioning:

TORCH_LOGS="aot_graphs" python script.py
# Verify forward saves correct activations
# Verify backward receives correct inputs

Isolate issue:
- Simplify model to minimal reproduction
- Check specific operation gradients

Workflow 3: Debug Memory Issues

Symptom: OOM during backward pass

Steps:

Check what's being saved:

TORCH_LOGS="aot_graphs" python script.py
# Look at forward graph outputs
# Count number of saved activations

Enable recomputation:

from torch._functorch.aot_autograd import aot_function
from functools import partial
from functorch.compile import min_cut_rematerialization_partition

# Use min-cut partitioner for memory optimization
# (Usually automatic, but can force via config)

Analyze activation memory:

# Count tensors in forward output
grep "return" model__*__forward_*.py
# Each returned tensor (except user output) is saved

Workflow 4: Verify Post-Grad Optimization

Goal: Confirm expected fusion happened

Steps:

Enable logging:

TORCH_LOGS="post_grad_graphs" python script.py

Compare before/after:

# Count operations
grep "call_function" model__*__forward_0.py | wc -l
grep "call_function" model__*__forward_transformed_0.py | wc -l

Verify specific pattern:
- B2B GEMM: Look for mm → mm fusion
- Attention: Check for fused attention pattern

Common Issues

Issue: AOT Not Running (Inference Mode)

Symptom: No AOT log messages, straight to Inductor

Cause: Model in inference mode (no gradients needed)

Debug:

# Check if gradients needed
print(any(p.requires_grad for p in model.parameters()))
print(x.requires_grad)

Fix:

model.train()  # Enable training mode
x.requires_grad = True  # Or make input require grad

Issue: Saved Activations Too Large

Symptom: High memory usage, OOM

Debug:

TORCH_LOGS="aot_graphs" python script.py
# Check forward output size
grep "return" model__*__forward_*.py

Solutions:

Use gradient checkpointing: torch.utils.checkpoint.checkpoint()
Enable recomputation (usually automatic)
Reduce batch size

Issue: Backward Graph Missing Operations

Symptom: Incomplete backward computation

Debug:

TORCH_LOGS="aot_joint_graph" python script.py
# Check if backward nodes present in joint graph
grep "is_backward" model__*__joint_*.py

Common causes:

Operation doesn't require grad
Detached tensors breaking gradient flow
In-place operations disrupting autograd

Fix:

# Ensure gradient flow not broken
# Check for .detach() calls
# Verify requires_grad=True

Issue: Post-Grad Fusion Not Happening

Symptom: Expected fusion didn't occur

Debug:

TORCH_LOGS="post_grad_graphs" python script.py
# Compare before/after, verify pattern exists

Common causes:

Pattern not exactly matching expected form
Operations not adjacent in graph
Unsupported operation variants

Fix:

Verify pattern manually in before graph
Check for intermediate operations breaking pattern
Ensure using supported op variants

Quick Reference

Essential Commands

# Basic AOT tracing
TORCH_LOGS="aot,aot_graphs" python script.py

# With joint graph
TORCH_LOGS="aot,aot_graphs,aot_joint_graph" python script.py

# Include post-grad passes
TORCH_LOGS="aot,aot_graphs,post_grad_graphs" python script.py

# Full AOT debug
TORCH_LOGS="aot,aot_graphs,aot_joint_graph,post_grad_graphs" python script.py

Output Files

# View forward graph
cat /tmp/torchinductor_$USER/model__*__forward_0.py

# View backward graph
cat /tmp/torchinductor_$USER/model__*__backward_0.py

# View joint graph (if logged)
cat /tmp/torchinductor_$USER/model__*__joint_0.py

# Compare before/after post-grad
diff model__*__forward_{0,transformed_0}.py

Key Checks

# Verify functionalization (no in-place ops)
grep "mul_\|add_\|sub_" model__*__forward_0.py
# Should return nothing

# Check partitioning (forward outputs match backward inputs)
grep "return" model__*__forward_0.py
grep "placeholder" model__*__backward_0.py

# Count saved activations
grep "return" model__*__forward_0.py | grep -o "%" | wc -l

Next Stage

After AOT Stage: Load compile-trace-inductor skill - Tracing Inductor lowering through codegen

Reference: See compile-overview skill for complete pipeline context.

compile-trace-aot

Invocation

Context Preview

SKILL.md

compile-trace-aot

Invocation

Context Preview

SKILL.md

Tracing AOT Autograd Stage - Training Transformations

Table of Contents

Stage Overview

When AOT Runs

Training vs Inference

How to Know If AOT Ran

Logging Setup

Basic Logging

Available Loggers

Programmatic Setup

Output Files and Interpretation

File Naming Convention

Graph Structure

What to Look For

Tracing Functionalization

What Functionalization Does

How to Trace

Verifying Mutation Handling

Tracing Joint Graph

What Is Joint Graph

How to Trace

Interpreting Joint Graph

Tracing Partitioning

What Partitioning Does

How to Trace

Verifying Partition

Recomputation Analysis

Post-Grad FX Passes

When They Run

How to Trace

Common Passes

Verifying Pass Effects

Debugging Workflows

Workflow 1: Verify AOT Ran

Workflow 2: Debug Incorrect Gradients

Workflow 3: Debug Memory Issues

Workflow 4: Verify Post-Grad Optimization

Common Issues

Issue: AOT Not Running (Inference Mode)

Issue: Saved Activations Too Large

Issue: Backward Graph Missing Operations

Issue: Post-Grad Fusion Not Happening

Quick Reference

Essential Commands

Output Files

Key Checks

Next Stage

Similar Skills

Tracing AOT Autograd Stage - Training Transformations

Table of Contents

Stage Overview

When AOT Runs

Training vs Inference

How to Know If AOT Ran

Logging Setup

Basic Logging

Available Loggers

Programmatic Setup

Output Files and Interpretation

File Naming Convention

Graph Structure

What to Look For

Tracing Functionalization

What Functionalization Does

How to Trace

Verifying Mutation Handling

Tracing Joint Graph

What Is Joint Graph

How to Trace

Interpreting Joint Graph

Tracing Partitioning

What Partitioning Does