Kernel optimization

Kernel optimization is about making GPU kernels run faster and more efficiently by improving how they use compute, memory bandwidth, and on-chip resources. For LLM inference, this often means reducing memory movement, increasing hardware utilization, and mapping workloads more carefully to the GPU.

📄️ Kernel optimization for LLM Inference

Kernel optimization for LLM inference improves GPU utilization and performance by writing or generating optimized kernels tailored to the compute patterns of LLMs.

📄️ GPU architecture fundamentals

Understand GPU architecture fundamentals for kernel optimization, including threads, warps, streaming multiprocessors, memory hierarchy, and tensor cores.

📄️ Choosing the right kernel optimization tool

Compare the main tools for kernel optimization in LLM inference, from cuBLAS and cuDNN to TVM, XLA, Triton, custom CUDA kernels, Mojo and MAX.

📄️ FlashAttention

FlashAttention is a fast, memory-efficient attention algorithm for Transformers that accelerates LLM training and inference and helps achieve longer context windows.