Abstract:
Large language models (LLMs) often concentrate their attention on a small set of tokens—referred to as attention sinks. Common examples include the first token, a prompt-independent sink, and punctuation tokens, which are prompt-dependent. Although these tokens often lack inherent semantic meaning, their presence is critical for model performance, particularly under model compression and KV-caching. Yet, the function, semantic role, and origin of attention sinks—especially those beyond the first token—remain poorly understood.
In this talk, I’ll present a comprehensive investigation revealing that attention sinks catch a sequence of tokens, tag them with a shared perturbation, and release them back into the residual stream, where they are later retrieved based on the tags they carry. Probing experiments show that these tags encode semantically meaningful information, such as the truth of a statement.
This mechanism persists in models with query-key normalization—where prompt-dependent, non-BOS sinks have become more common—and DeepSeek-distilled models, where it spans more heads and accounts for greater variance in the embeddings. To support future theoretical work, we introduce a minimal task that is solvable via the catch, tag, release mechanism, and in which the mechanism naturally emerges through training.
Bio:
Vardan Papyan is an Assistant Professor in the Department of Mathematics at the University of Toronto, cross-appointed with the Department of Computer Science. He completed his postdoctoral studies at the Department of Statistics at Stanford University, under the guidance of David Donoho, and his PhD at the Department of Computer Science at the Technion – Israel Institute of Technology, under the supervision of Michael Elad.