KV Cache, FlashAttention, and Attention Variants

Let’s talk KV cache: why it exists, why it’s usually disabled during training, how it connects to FlashAttention, and how related attention variants like MHA, MQA, and GQA fit into the picture.

Let’s talk KV cache: why it exists, why it’s usually disabled during training, how it connects to FlashAttention, and how related attention variants like MHA, MQA, and GQA fit into the picture.