KV Cache, FlashAttention, and Attention Variants

KV Cache and Attention

Let’s talk KV cache: why it exists, why it’s usually disabled during training, how it connects to FlashAttention, and how related attention variants like MHA, MQA, and GQA fit into the picture.