PyTorch & CUDA ML Operations Cheatsheet: The Complete Reference

PyTorch has become the leading research and production framework for deep learning. Harnessing the true power of PyTorch requires an understanding of tensor dimensions, efficient memory movement between CPU and GPU, and optimal hardware utilization through NVIDIA CUDA acceleration.

This reference sheet covers tensor manipulations, device management, automatic mixed precision, and CUDA performance debugging.

- **Tensor Operations**: Initialize, reshape, and broadcast tensors while avoiding memory duplication. - **CUDA Device Allocation**: Allocate tensors directly on GPU accelerators and handle asynchronous execution safely. - **Mixed Precision**: Reduce memory bandwidth usage using PyTorch Automatic Mixed Precision (`torch.amp`). - **Memory Profiling**: Monitor GPU memory allocations and prevent Out-Of-Memory (OOM) faults using `torch.cuda` diagnostic tools.

Before diving into this cheatsheet, check out my previous deep-dive on FastAPI & Pydantic v2 Boilerplate Cheatsheet: The Complete Reference to see how we structured these patterns in practice.

Tensor Initialization & Manipulation

Tensors are the multi-dimensional arrays at the heart of PyTorch. Managing memory during manipulation is critical for speed.

import torch
import numpy as np

# 1. Device-aware initialization
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.zeros((3, 4), dtype=torch.float32, device=device)

# 2. Convert from NumPy (shares underlying memory buffer)
np_array = np.ones((5, 5))
tensor_from_np = torch.from_numpy(np_array) # Modifications affect both!

# 3. Reshaping and Dimension Manipulation
y = torch.randn(2, 3, 4)

# Reshape without copying data (returns a view)
y_view = y.view(6, 4) 

# Permute dimensions (changes order of dimensions)
y_permuted = y.permute(2, 0, 1) # Shape becomes (4, 2, 3)

# Add/remove singleton dimensions
z = torch.randn(3, 1, 4)
z_squeezed = z.squeeze(1)   # Shape (3, 4)
z_unsqueezed = z.unsqueeze(0) # Shape (1, 3, 1, 4)

Managing CUDA Devices

Moving tensors between CPU and GPU involves communication overhead. Keep data transfers to a minimum and allocate directly on the target device whenever possible.

# Check for CUDA availability
cuda_available = torch.cuda.is_available()
device_count = torch.cuda.device_count()

if cuda_available:
    # Set default active GPU device
    torch.cuda.set_device(0)
    current_device = torch.cuda.current_device()
    device_name = torch.cuda.get_device_name(current_device)
    print(f"Active GPU: {device_name} (ID: {current_device})")

# Pin memory on CPU for faster asynchronous transfers to GPU
cpu_tensor = torch.randn(1000, 1000).pin_memory()
gpu_tensor = cpu_tensor.to("cuda", non_blocking=True)

Mixed Precision & Gradient Scaling

Automatic Mixed Precision (AMP) performs operations in half-precision (FP16/BF16) where possible, speeding up execution and saving GPU memory, while using full-precision (FP32) for critical parameters to preserve model accuracy.

import torch.nn as nn
import torch.optim as optim

model = MyDeepLearningModel().to("cuda")
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Initialize Gradient Scaler to prevent underflow in FP16 gradients
scaler = torch.cuda.amp.GradScaler()

for inputs, targets in dataloader:
    inputs, targets = inputs.to("cuda"), targets.to("cuda")
    optimizer.zero_grad()
    
    # Forward pass under autocast environment
    with torch.autocast(device_type="cuda", dtype=torch.float16):
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        
    # Backward pass with scaled loss
    scaler.scale(loss).backward()
    
    # Unscale gradients and update weights
    scaler.step(optimizer)
    
    # Update scaler state for next iteration
    scaler.update()

Debugging GPU Memory & OOM Faults

Out-Of-Memory errors commonly halt training. These diagnostic commands help monitor allocations and safely reclaim unused blocks.

# Clear PyTorch's internal cache memory pool (deallocates unused GPU memory blocks)
torch.cuda.empty_cache()

# Monitor allocations
allocated_memory = torch.cuda.memory_allocated(device=None) # In bytes
reserved_memory = torch.cuda.memory_reserved(device=None)   # In bytes

print(f"Allocated: {allocated_memory / 1e6:.2f} MB")
print(f"Reserved (Cached): {reserved_memory / 1e6:.2f} MB")

# Generate a detailed structural report of current memory footprint
print(torch.cuda.memory_summary(device=None, abbreviated=False))

Advanced Model Distributed Operations

When scaling models across multiple GPUs, use the Distributed Data Parallel (DDP) package for optimal multi-threaded weight synchronization.

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_distributed(rank, world_size):
    # Initialize the process group
    dist.init_process_group(
        backend="nccl",
        init_method="tcp://127.0.0.1:29500",
        world_size=world_size,
        rank=rank
    )

def cleanup_distributed():
    dist.destroy_process_group()

# Wrap your neural network inside DDP for automated gradient syncing
# ddp_model = DDP(model, device_ids=[rank])