Skill

xla

From xla

Comprehensive reference for XLA (Accelerated Linear Algebra) compiler - covering architecture, operation semantics, HLO IR, compilation pipeline, GPU/CPU/TPU backends, PJRT API, MLIR integration, custom calls, autotuning, SPMD partitioning, debugging tools, and build system.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/xla:xla

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

XLA is an open-source machine learning (ML) compiler for GPUs, CPUs, and ML accelerators. It takes models from popular ML frameworks such as PyTorch, TensorFlow, and JAX, and optimizes them for high-performance execution across different hardware platforms.

Supporting Files

SKILL.md

162 lines · ~1.9k tokens

Stats

Parent stars0

MaintenanceGood

Last CommitMay 6, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

XLA (Accelerated Linear Algebra)

Key Objectives

Improve execution speed: Compile subgraphs to reduce overhead, fuse pipelined operations, specialize for known tensor shapes
Improve memory usage: Analyze and schedule memory, eliminate intermediate storage buffers
Reduce reliance on custom ops: Fuse low-level ops to match hand-tuned custom op performance
Improve portability: Easy to write new backends for novel hardware

Supported Frameworks and Hardware

Frameworks: JAX, TensorFlow, PyTorch
Hardware: NVIDIA GPUs (CUDA), AMD GPUs (ROCm), CPUs (x86/ARM), TPUs, custom accelerators
Project: Part of the OpenXLA ecosystem

Compilation Pipeline Overview

ML Framework → StableHLO → HLO → Target-Independent Optimizations → Target-Specific Optimizations → Code Generation

StableHLO Input: ML frameworks produce StableHLO operations
HLO Conversion: StableHLO converted to internal HLO dialect
Optimizations: CSE, fusion, buffer analysis, layout assignment
Backend Processing: Target-specific HLO optimizations and code generation
Code Generation: LLVM IR → PTX (GPU), native code (CPU), or device-specific binary

Quick Reference

Basic Computation Example

#include "xla/client/xla_builder.h"

xla::XlaBuilder builder("add_vectors");

// Create parameters
xla::XlaOp x = xla::Parameter(&builder, 0,
    xla::ShapeUtil::MakeShape(xla::F32, {1024}), "x");
xla::XlaOp y = xla::Parameter(&builder, 1,
    xla::ShapeUtil::MakeShape(xla::F32, {1024}), "y");

// Build computation
xla::XlaOp result = xla::Add(x, y);

// Build and compile
auto computation = builder.Build().value();

HLO Text Format Example

HloModule matmul_example

ENTRY main {
  %p0 = f32[1024,512]{1,0} parameter(0)
  %p1 = f32[512,2048]{1,0} parameter(1)
  ROOT %dot = f32[1024,2048]{1,0} dot(%p0, %p1),
         lhs_contracting_dims={1}, rhs_contracting_dims={0}
}

Common Operations

// Element-wise operations
XlaOp Add(XlaOp lhs, XlaOp rhs);
XlaOp Mul(XlaOp lhs, XlaOp rhs);
XlaOp Sub(XlaOp lhs, XlaOp rhs);
XlaOp Div(XlaOp lhs, XlaOp rhs);

// Data manipulation
XlaOp Reshape(XlaOp operand, ArraySlice<int64> dimensions);
XlaOp Broadcast(XlaOp operand, ArraySlice<int64> broadcast_sizes);
XlaOp Slice(XlaOp operand, ArraySlice<int64> start, ArraySlice<int64> limit, ArraySlice<int64> strides);
XlaOp Transpose(XlaOp operand, ArraySlice<int64> permutation);
XlaOp ConcatInDim(ArraySlice<XlaOp> operands, int64_t dimension);

// Linear algebra
XlaOp Dot(XlaOp lhs, XlaOp rhs);
XlaOp DotGeneral(XlaOp lhs, XlaOp rhs, DotDimensionNumbers dnums);
XlaOp Conv(XlaOp lhs, XlaOp rhs, ArraySlice<int64> strides, Padding padding);

// Collective operations
XlaOp AllReduce(XlaOp operand, XlaComputation computation, ReplicaGroupVector groups);
XlaOp AllGather(XlaOp operand, int64_t dim, int64_t count, ReplicaGroupVector groups);

// Control flow
XlaOp While(XlaComputation condition, XlaComputation body, XlaOp init);
XlaOp Conditional(XlaOp pred, XlaOp true_val, XlaComputation true_comp,
                  XlaOp false_val, XlaComputation false_comp);

Common Tools

# Dump HLO from JAX
XLA_FLAGS=--xla_dump_to=/tmp/hlo_dump python my_program.py

# Run HLO module
run_hlo_module --platform=CUDA --reference_platform=Interpreter computation.hlo

# Optimize and inspect HLO
hlo-opt --platform=CUDA --stage=hlo input.hlo
hlo-opt --passes=algebraic-simplifier input.hlo

# Deviceless GPU compilation
hlo-opt --platform=CUDA --stage=llvm \
  --xla_gpu_target_config_filename=gpu_specs/a100_pcie_80.txtpb input.hlo

Documentation Structure

Overview and Architecture

01-overview-and-architecture - XLA overview, objectives, and compiler architecture
02-shapes-and-layout - Shapes, layout, tiling, memory spaces, and indexing
03-broadcasting - Broadcasting semantics, rules, and composition

Operation Semantics

04-operation-semantics-elementwise - Element-wise unary operations (Abs, Sin, Cos, Exp, etc.)
05-operation-semantics-binary - Binary operations (Add, Mul, Div, And, Or, etc.)
06-operation-semantics-collective - Collective operations (AllReduce, AllGather, AllToAll, etc.)
07-operation-semantics-control-flow - Control flow (While, Conditional, Reduce, Sort, etc.)
08-operation-semantics-convolution - Convolutions, FFT, and TriangularSolve
09-operation-semantics-data-manipulation - Data manipulation (Reshape, Slice, Broadcast, Gather, etc.)
10-operation-semantics-linear-algebra - Linear algebra (Dot, Cholesky, BatchNorm)
11-operation-semantics-io-and-other - Custom calls, I/O, RNG, tokens, and misc operations

Compiler Infrastructure

12-hlo-ir - HLO IR: module structure, instruction set, text format, verification
13-compilation-pipeline - Compilation pipeline stages from StableHLO to native code
14-hlo-passes - HLO optimization and transformation passes
15-gpu-backend - GPU backend architecture, pipeline, and runtime
16-gpu-emitters - GPU code generation: emitters, partitioning, vectorization
17-cpu-backend - CPU backend architecture and code generation
18-tpu-backend - TPU backend, memory model, and SparseCore

Integration and Extension

19-developing-new-backend - How to develop a new XLA backend
20-pjrt-api - PJRT uniform device API and plugin mechanism
21-mlir-integration - MLIR-HLO dialect integration and TableGen
22-custom-calls - Custom calls and XLA FFI binding
23-async-operations - Async HLO instructions and syntax sugar
24-autotuning - Autotuning framework and persisted results
25-tools - XLA tools: run_hlo_module, hlo-opt, ptx-opt, isolate_hlo
26-build-system - Building XLA from source with Bazel
27-debugging - Debugging, HLO dumps, error codes, determinism
28-aliasing - Input/output buffer aliasing and donation
29-spmd-partitioner - SPMD partitioning, sharding, and GSPMD
30-symbolic-expression - Symbolic expressions, indexing analysis, and dynamic shapes

xla

Invocation

Context Preview

Supporting Files

SKILL.md

xla

Invocation

Context Preview

Supporting Files

SKILL.md

XLA (Accelerated Linear Algebra)

Key Objectives

Supported Frameworks and Hardware

Compilation Pipeline Overview

Quick Reference

Basic Computation Example

HLO Text Format Example

Common Operations

Common Tools

Documentation Structure

Overview and Architecture

Operation Semantics

Compiler Infrastructure

Integration and Extension

Similar Skills

XLA (Accelerated Linear Algebra)

Key Objectives

Supported Frameworks and Hardware

Compilation Pipeline Overview

Quick Reference

Basic Computation Example

HLO Text Format Example

Common Operations

Common Tools

Documentation Structure

Overview and Architecture

Operation Semantics

Compiler Infrastructure

Integration and Extension

Similar Skills