From cppcheatsheet
Enforces readability rules for C, C++, Rust, and CUDA code: short focused functions, flat control flow, clear naming, and idiomatic patterns. Use when writing, reviewing, or refactoring.
How this skill is triggered — by the user, by Claude, or both
Slash command
/cppcheatsheet:readable-cppThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Apply these rules when writing, reviewing, or refactoring C, C++, Rust, or CUDA code. Inspired by *The Art of Readable Code* by Dustin Boswell and Trevor Foucher.
Apply these rules when writing, reviewing, or refactoring C, C++, Rust, or CUDA code. Inspired by The Art of Readable Code by Dustin Boswell and Trevor Foucher.
Core principle: Code should be easy to understand. The time it takes someone else (or future you) to understand the code is the ultimate metric.
if inside a loop inside an if, extract the inner block into a helper function with a descriptive name.continue or break to skip iterations rather than wrapping the body in a conditional.// Bad: nested and hard to follow
for (auto& user : users) {
if (user.is_active()) {
for (auto& order : user.orders()) {
if (order.is_pending()) {
process(order);
}
}
}
}
// Good: flat, each function name explains what it does
auto active_users = get_active_users(users);
for (auto& user : active_users) {
process_pending_orders(user.orders());
}
fetch_page not get, num_retries not n.tmp, data, result, val, info, handle — unless the scope is tiny (2-3 lines).max_items not limit. If a boolean, use is_, has_, should_, can_ prefixes.num, max, min, err are fine; svc_mgr_cfg is not).if (length > 10) not if (10 < length).if/else blocks: positive case first, simpler case first, or the more interesting case first.if/else.// Bad
if (!(age >= 18 && has_id && !is_banned)) {
deny();
}
// Good
bool is_eligible = age >= 18 && has_id && !is_banned;
if (!is_eligible) {
deny();
}
if (retries > MAX_RETRIES) not if (retries > 3).// TODO:, // HACK:, // XXX: with explanation.// Bad: rustfmt expands this to 4 lines per field — noisy and repetitive
fn from_dict(cfg: &Bound<'_, PyDict>) -> PyResult<Self> {
Ok(Self {
rom: cfg.get_item("rom")?.ok_or_else(|| missing("rom"))?.extract()?,
// ... each field becomes 4 lines after rustfmt
})
}
// Good: extract a helper so each field stays one clean line
fn get_required<T: FromPyObject>(cfg: &Bound<'_, PyDict>, key: &str) -> PyResult<T> {
cfg.get_item(key)?
.ok_or_else(|| PyKeyError::new_err(key.to_string()))?
.extract()
}
fn from_dict(cfg: &Bound<'_, PyDict>) -> PyResult<Self> {
Ok(Self {
rom: get_required(cfg, "rom")?,
actions: get_required(cfg, "actions")?,
})
}
// Bad: clang-format wraps this into a hard-to-scan block
auto result = container.find(key)->second.get_value().transform(func).value_or(default_val);
// Good: name the intermediate step
auto& entry = container.find(key)->second;
auto result = entry.get_value().transform(func).value_or(default_val);
free() calls across multiple return paths. A single cleanup section is easier to audit.__attribute__((cleanup)) (GCC/Clang) when available for automatic cleanup.// Good: single cleanup path
int process_file(const char *path) {
int ret = -1;
FILE *fp = fopen(path, "r");
if (!fp) return -1;
char *buf = malloc(BUF_SIZE);
if (!buf) goto cleanup_file;
// ... do work ...
ret = 0;
cleanup_buf:
free(buf);
cleanup_file:
fclose(fp);
return ret;
}
const Liberallyconst when the function doesn't modify the pointed-to data: const char *msg.const when they don't change after initialization.<stdint.h> types (uint32_t, int64_t) for data that crosses boundaries (files, network, hardware).size_t for sizes and counts, ptrdiff_t for pointer differences.int and unsigned for simple loop counters and local arithmetic.do { ... } while(0) for statement-like macros.#define SQUARE(x) ((x) * (x)).static inline functions over macros when possible (type safety, debuggability)._Generic (C11) for type-safe "overloading" instead of macro tricks.std::unique_ptr for exclusive ownership, std::shared_ptr only when shared ownership is genuinely needed.new/delete in application code — let smart pointers and containers handle it.const&.std::move only when you truly want to transfer ownership — don't std::move from things you'll use again.std::array over C arrays, std::string over char*, std::vector over malloc/realloc.std::optional over sentinel values, std::variant over type-unsafe unions.for loops: for (const auto& item : container).auto [key, value] = *map.begin();.std::format (C++20) or fmt::format over sprintf / string concatenation.if constexpr over SFINAE when possible.constexpr and const Aggressivelyconstexpr when they can be evaluated at compile time.constexpr variables instead of #define for constants.const on member functions that don't modify state.consteval (C++20) for functions that must be compile-time evaluated.std::expected (C++23) or std::optional for expected failures.std::runtime_error, not std::exception.noexcept on functions that cannot throw (destructors, move operations).&T, &mut T) over cloning. Clone only when ownership transfer is genuinely needed.&str over String in function parameters when you don't need ownership..iter().filter().map().collect()) over manual loops with indices.for item in &collection instead of for i in 0..collection.len().enumerate(), zip(), chain(), chunks() — the iterator API is rich..unwrap() in production code — use ?, unwrap_or, unwrap_or_else, or pattern matching.// Bad: manual indexing
let mut names = Vec::new();
for i in 0..users.len() {
if users[i].is_active {
names.push(users[i].name.clone());
}
}
// Good: idiomatic iterator chain
let names: Vec<_> = users.iter()
.filter(|u| u.is_active)
.map(|u| u.name.clone())
.collect();
enum with data variants instead of class hierarchies or tagged unions.match exhaustively — the compiler ensures you handle all cases.if let / while let for single-variant matching instead of full match.Result<T, E> over panicking — make errors part of the type signature.struct UserId(u64)) to prevent mixing up same-typed values.Option<T> instead of sentinel values or null pointers.#[must_use] on functions whose return values shouldn't be ignored.From/Into traits for type conversions over manual conversion functions.pub surfaces small — expose only what's needed.pub(crate) for crate-internal visibility instead of full pub.reduce_sum not kernel1 or myKernel.reduce_sum_kernel for __global__, warp_reduce for __device__ helpers.threads_per_block, num_blocks not tpb, nb, or bare 256.// Bad: opaque names, magic numbers
__global__ void k1(float *a, float *b, int n) {
int i = blockIdx.x * 256 + threadIdx.x;
if (i < n) b[i] = a[i] * 2.0f;
}
k1<<<(n+255)/256, 256>>>(d_in, d_out, n);
// Good: clear intent, named constants
constexpr int THREADS_PER_BLOCK = 256;
__global__ void scale_kernel(const float *input, float *output,
float scale_factor, int num_elements) {
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_elements) {
output[idx] = input[idx] * scale_factor;
}
}
const int num_blocks = (num_elements + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
scale_kernel<<<num_blocks, THREADS_PER_BLOCK>>>(d_input, d_output, 2.0f, num_elements);
cudaMalloc/cudaMemcpy with application logic — wrap them in RAII classes or helper functions..cu kernel files from .cpp host logic files, or at minimum group host and device code into clearly labeled sections.// Good: RAII wrapper hides allocation/deallocation
template <typename T>
class DeviceBuffer {
T *ptr_ = nullptr;
size_t size_ = 0;
public:
explicit DeviceBuffer(size_t count) : size_(count) {
check_cuda(cudaMalloc(&ptr_, count * sizeof(T)));
}
~DeviceBuffer() { cudaFree(ptr_); }
DeviceBuffer(const DeviceBuffer&) = delete;
DeviceBuffer& operator=(const DeviceBuffer&) = delete;
DeviceBuffer(DeviceBuffer&& o) noexcept : ptr_(o.ptr_), size_(o.size_) { o.ptr_ = nullptr; }
T *get() { return ptr_; }
const T *get() const { return ptr_; }
size_t size() const { return size_; }
void copy_from_host(const T *host_data) {
check_cuda(cudaMemcpy(ptr_, host_data, size_ * sizeof(T), cudaMemcpyHostToDevice));
}
void copy_to_host(T *host_data) const {
check_cuda(cudaMemcpy(host_data, ptr_, size_ * sizeof(T), cudaMemcpyDeviceToHost));
}
};
check_cuda macro or inline function — not raw if blocks after every call.cudaGetLastError() + cudaDeviceSynchronize() during development.// Good: concise, catches file/line info
inline void check_cuda(cudaError_t err, const char *file, int line) {
if (err != cudaSuccess) {
fprintf(stderr, "CUDA error at %s:%d — %s\n",
file, line, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
}
#define check_cuda(err) check_cuda((err), __FILE__, __LINE__)
// Usage
check_cuda(cudaMalloc(&d_ptr, size));
my_kernel<<<grid, block>>>(d_ptr, n);
check_cuda(cudaGetLastError());
check_cuda(cudaDeviceSynchronize());
if.row, col, depth — not x, y, z.// Bad: index computed inline, entire body wrapped
__global__ void process(float *data, int width, int height) {
if (blockIdx.x * blockDim.x + threadIdx.x < width &&
blockIdx.y * blockDim.y + threadIdx.y < height) {
int idx = (blockIdx.y * blockDim.y + threadIdx.y) * width +
(blockIdx.x * blockDim.x + threadIdx.x);
data[idx] = data[idx] * 2.0f;
}
}
// Good: named indices, early return
__global__ void process(float *data, int width, int height) {
const int col = blockIdx.x * blockDim.x + threadIdx.x;
const int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col >= width || row >= height) return;
const int idx = row * width + col;
data[idx] = data[idx] * 2.0f;
}
shared_tile not smem or s.extern __shared__).// Good: clear phases, descriptive names
__global__ void tiled_matmul_kernel(const float *A, const float *B,
float *C, int N) {
__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];
const int row = blockIdx.y * TILE_SIZE + threadIdx.y;
const int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float accumulator = 0.0f;
for (int tile_idx = 0; tile_idx < N / TILE_SIZE; ++tile_idx) {
// Phase 1: Load tiles from global memory
tile_A[threadIdx.y][threadIdx.x] = A[row * N + tile_idx * TILE_SIZE + threadIdx.x];
tile_B[threadIdx.y][threadIdx.x] = B[(tile_idx * TILE_SIZE + threadIdx.y) * N + col];
__syncthreads();
// Phase 2: Compute partial dot product from tiles
for (int k = 0; k < TILE_SIZE; ++k) {
accumulator += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = accumulator;
}
__device__ helper functions.__forceinline__ __device__ for small helpers that you want inlined without relying on compiler heuristics.// Good: kernel reads like pseudocode, details in helpers
__forceinline__ __device__
float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xffffffff, val, offset);
}
return val;
}
__forceinline__ __device__
float block_reduce_sum(float val) {
__shared__ float warp_sums[32];
const int lane = threadIdx.x % warpSize;
const int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) warp_sums[warp_id] = val;
__syncthreads();
val = (threadIdx.x < blockDim.x / warpSize) ? warp_sums[lane] : 0.0f;
if (warp_id == 0) val = warp_reduce_sum(val);
return val;
}
__global__ void reduce_sum_kernel(const float *input, float *output, int n) {
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
const float val = (idx < n) ? input[idx] : 0.0f;
const float block_sum = block_reduce_sum(val);
if (threadIdx.x == 0) atomicAdd(output, block_sum);
}
const on kernel parameters for read-only device pointers — documents intent and enables compiler optimizations.__restrict__ when pointers don't alias — but add a comment explaining the non-aliasing guarantee.cudaMallocManaged), comment the expected access pattern (host-only init, device-only compute, etc.) — the implicit page migration behavior is not obvious.// Good: const + restrict with clear intent
__global__ void vector_add_kernel(
const float *__restrict__ a, // read-only, no alias with output
const float *__restrict__ b, // read-only, no alias with output
float *__restrict__ output, // write-only
int num_elements)
{
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= num_elements) return;
output[idx] = a[idx] + b[idx];
}
__syncthreads() on its own line, never buried inside a conditional branch that not all threads take — this is undefined behavior and hard to spot.__syncthreads() stating what invariant it establishes: "all threads have loaded their tile", "partial sums are written to shared memory".__shfl_sync, __ballot_sync) with explicit masks over __syncthreads().// Bad: magic numbers, unclear intent
foo<<<(n+127)/128, 128, 0, stream>>>(d_ptr, n);
// Good: named, computed, self-documenting
void launch_scale_kernel(float *d_data, float factor, int n, cudaStream_t stream) {
constexpr int BLOCK_SIZE = 256;
const int grid_size = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
scale_kernel<<<grid_size, BLOCK_SIZE, 0, stream>>>(d_data, factor, n);
check_cuda(cudaGetLastError());
}
compute_stream, transfer_stream — not s1, s2.// Good: dependency graph documented, streams named by purpose
// Dependency graph:
// upload (transfer_stream) --> compute (compute_stream) --> download (transfer_stream)
// Event 'upload_done' gates compute start.
// Event 'compute_done' gates download start.
cudaStream_t transfer_stream, compute_stream;
cudaEvent_t upload_done, compute_done;
// Stage 1: async upload
cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, transfer_stream);
cudaEventRecord(upload_done, transfer_stream);
// Stage 2: compute waits for upload
cudaStreamWaitEvent(compute_stream, upload_done);
process_kernel<<<grid, block, 0, compute_stream>>>(d_input, d_output, n);
cudaEventRecord(compute_done, compute_stream);
// Stage 3: download waits for compute
cudaStreamWaitEvent(transfer_stream, compute_done);
cudaMemcpyAsync(h_output, d_output, size, cudaMemcpyDeviceToHost, transfer_stream);
// Sync before host reads the result
cudaStreamSynchronize(transfer_stream);
npx claudepluginhub crazyguitar/cppcheatsheet --plugin cppcheatsheetApplies C++ Core Guidelines to write, review, or refactor C++ code. Enforces modern, safe, and idiomatic practices for C++17/20/23.
Enforces C++ Core Guidelines for writing, reviewing, and refactoring modern C++ (C++17/20/23) code to ensure type safety, resource safety, immutability, and idiomatic practices.
Fetches C/C++ examples from cppcheatsheet.com to write correct code, answer questions, prep for interviews, and cover modern features, CUDA, system programming, debugging, and Rust interop.