Kernel optimization
Kernel optimization is about making GPU kernels run faster and more efficiently by improving how they use compute, memory bandwidth, and on-chip resources. For LLM inference, this often means reducing memory movement, increasing hardware utilization, and mapping workloads more carefully to the GPU.
📄️ Kernel optimization for LLM Inference
Kernel optimization for LLM inference improves GPU utilization and performance by writing or generating optimized kernels tailored to the compute patterns of LLMs.
📄️ GPU architecture fundamentals
Understand GPU architecture fundamentals for kernel optimization, including threads, warps, streaming multiprocessors, memory hierarchy, and tensor cores.
📄️ Choosing the right kernel optimization tool
Compare the main tools for kernel optimization in LLM inference, from cuBLAS and cuDNN to TVM, XLA, Triton, custom CUDA kernels, Mojo and MAX.
📄️ FlashAttention
FlashAttention is a fast, memory-efficient attention algorithm for Transformers that accelerates LLM training and inference and helps achieve longer context windows.