Paged Attention

Paged Attention is a memory management technique for optimizing GPU usage during LLM inference by partitioning the key-value cache into smaller, non-contiguous blocks called pages. This approach minimizes memory fragmentation, allowing for dynamic adjustment of sequence lengths and batch sizes while maximizing throughput and GPU efficiency.