Events

Statistics and Data Science Seminar

Attention Sinks: A ‘Catch, Tag, Release’ Mechanism for Embeddings
On October 31, 2025 at 11:00 am till 12:00 pm
Vardan Papyan, University of Toronto
E18-304

Abstract:
Large language models (LLMs) often concentrate their attention on a small set of tokens—referred to as attention sinks. Common examples include the first token, a prompt-independent sink, and punctuation tokens, which are prompt-dependent. Although these tokens often lack inherent semantic meaning, their presence is critical for model performance, particularly under model compression and KV-caching. Yet, the function, semantic role, and origin of attention sinks—especially those beyond the first token—remain poorly understood.

In this talk, I’ll present a comprehensive investigation revealing that attention sinks catch a sequence of tokens, tag them with a shared perturbation, and release them back into the residual stream, where they are later retrieved based on the tags they carry. Probing experiments show that these tags encode semantically meaningful information, such as the truth of a statement.

This mechanism persists in models with query-key normalization—where prompt-dependent, non-BOS sinks have become more common—and DeepSeek-distilled models, where it spans more heads and accounts for greater variance in the embeddings. To support future theoretical work, we introduce a minimal task that is solvable via the catch, tag, release mechanism, and in which the mechanism naturally emerges through training.

Bio:
Vardan Papyan is an Assistant Professor in the Department of Mathematics at the University of Toronto, cross-appointed with the Department of Computer Science. He completed his postdoctoral studies at the Department of Statistics at Stanford University, under the guidance of David Donoho, and his PhD at the Department of Computer Science at the Technion – Israel Institute of Technology, under the supervision of Michael Elad.

Find out more »: Attention Sinks: A ‘Catch, Tag, Release’ Mechanism for Embeddings
Back to the future – data efficient language modeling
On November 7, 2025 at 11:00 am till 12:00 pm
Tatsunori Hashimoto, Stanford University
E18-304

Abstract:
Compute scaling has dominated the conversation with modern language models, leading to an impressive array of algorithms that optimize performance for a given training (and sometimes inference) compute budget. But as compute has grown cheaper and more abundant, data is starting to become a bottleneck, and our ability to exchange computing for data efficiency may be crucial to future model scaling. In this talk, I will discuss some of our recent work on synthetic data and algorithmic approaches to data efficiency, and show that in both cases, classical statistical perspectives based on nonparametric modeling and ensembling bring new insights and empirical benefits to modern questions of scaling and data efficiency.

Biography:
Tatsunori Hashimoto is an Assistant Professor in the Computer Science Department at Stanford University. Work from his group spans many areas within statistical machine learning and language models including language model post-training, uncertainty quantification, and data selection. He received his Ph.D. at MIT under the supervision of Tommi Jaakkola and David Gifford, and is the recipient of the NSF CAREER, Samsung AI researcher of the year award, a Kavli fellowship as well as best paper awards at ICML, ICLR, and CHI.

Find out more »: Back to the future – data efficient language modeling
Private statistical estimation via robustness and stability
On November 14, 2025 at 11:00 am till 12:00 pm
Sewoong Oh, University of Washington
E18-304

Abstract:
Privacy enhancing technologies, such as differentially private stochastic gradient descent (DP-SGD), allow us to access private data without worrying about leaking sensitive information. This is crucial in the modern era of data-centric AI, where all public data has been exhausted and the next frontier models rely on access to high-quality data. A central component in these technologies is private statistical estimation, such as mean estimation and linear regression. We present a series of results where robust statistics and stable algorithms have played critical roles in advancing the state-of-the-art in differentially private statistical estimation. Focusing only on statistical efficiency, we will start with the High-dimensional Propose-Test-Release algorithm (HPTR) that gives optimal sample complexity for a broad range of problems but takes exponential time. Next, we will present how to achieve such an optimal sample complexity in linear-time, for an example of linear regression, with the Insufficient Statistics Perturbation (ISP) algorithm.

Bio: Sewoong Oh is a Professor at the Paul G. Allen School of Computer Science & Engineering at the University of Washington. Previous to joining University of Washington in 2019, he was at the department of Industrial and Enterprise Systems Engineering at University of Illinois at Urbana-Champaign since 2012. He received his PhD from the department of Electrical Engineering at Stanford University in 2011. Following his PhD, he worked as a postdoctoral researcher at Laboratory for Information and Decision Systems (LIDS) at MIT. Sewoong’s research interest is in foundations of machine learning in topics including private, secure, and robust machine learning and data-centric AI. He was co-awarded the ACM SIGMETRICS best paper award in 2015, NSF CAREER award in 2016, ACM SIGMETRICS rising star award in 2017, and GOOGLE Faculty Research Awards in 2017 and 2020.

Find out more »: Private statistical estimation via robustness and stability
The Implicit Geometry of Deep Representations: Insights From Log-Bilinear Softmax Models
On November 21, 2025 at 11:00 am till 12:00 pm
Christos Thrampoulidis, University of British Columbia
E18-304

Abstract:
Training data determines what neural networks can learn—but can we predict the geometry of learned representations directly from data statistics?

We present a framework that addresses this question for sufficiently large, well-trained neural networks. The key idea is a coarse but predictive abstraction of such networks as log-bilinear softmax models, whose implicit regularization we can analyze. Within this framework, we show how label imbalance shapes representation geometry and, for language models, how word and context representations organize into structures characterized by a sparse-plus-low-rank decomposition of co-occurrence statistics.

Log-bilinear softmax models arise as a canonical non-convex extension of well-understood convex linear models, yet their gradient-descent implicit bias had until now remained unknown. We describe recent progress in characterizing this bias, enabled by an inconspicuous Hadamard-based initialization that effectively diagonalizes the softmax nonlinearity and yields tractable reduced dynamics. This provides the first definitive link between implicit bias theory and neural collapse geometries.

Bio:
Christos Thrampoulidis is an Associate Professor at the Department of Electrical and Computer Engineering (ECE) at the University of British Columbia (UBC). From July 2018 until December 2020, he served as an Assistant Professor (tenure-track) at the ECE Department at the University of California, Santa Barbara. Prior to that, he spent two years as a Postdoctoral Researcher at the Research Laboratory of Electronics (RLE) at MIT. Dr. Thrampoulidis received a M.Sc. and a Ph.D. degree in EE in 2012 and 2016, respectively, both from Caltech, with a minor in Applied and Computational Mathematics. In 2011, he received a Diploma in ECE from the University of Patras, Greece.

Find out more »: The Implicit Geometry of Deep Representations: Insights From Log-Bilinear Softmax Models
Trees and V’s: Inference for Ensemble Models
On October 4, 2024 at 11:00 am till 12:00 pm
Giles Hooker, Wharton School – UPenn
E18-304

Abstract: This talk discusses uncertainty quantification and inference using ensemble methods. Recent theoretical developments inspired by random forests have cast bagging-type methods as U-statistics when bootstrap samples are replaced by subsamples, resulting in a central limit theorem and hence the potential for inference. However, to carry this out requires estimating a variance for which all proposed estimators exhibit substantial upward bias. In this talk, we convert subsamples without replacement to subsamples with replacement resulting in V-statistics for which we prove a novel central limit theorem. We also show that in this context, the asymptotic variance can be expressed as the variance of a conditional expectation which is approximated by sampling from the empirical distribution and allows for valid bias corrections. We finish by illustrating the use of these tools in combining or comparing statistical models. Bio: Giles Hooker is Professor of Statistics and Data Science at the University of Pennsylvania. His work has focused on statistical methods using dynamical systems models, functional data analysis, and statistical aspects of fair and interpretable machine learning. He is the author of Dynamic Data Analysis: Modeling Data with Differential Equations and Functional Data Analysis in R and Matlab. Much of his work has been inspired by collaborations particularly in ecology, human movement, and citizen science data. Professor Hooker earned a PhD in Statistics from Stanford University before doing a post-doctoral fellowship at McGill University. Prior to joining Penn, he served as Professor of Statistics and Data Science at Cornell University and Professor of Statistics at UC Berkeley. He also holds a visiting appointment at the Australian National University.

Find out more »: Trees and V’s: Inference for Ensemble Models
A Flexible Defense Against the Winner’s Curse
On October 25, 2024 at 11:00 am till 12:00 pm
Tijana Zrnic, Stanford University
E18-304

Abstract: Across science and policy, decision-makers often need to draw conclusions about the best candidate among competing alternatives. For instance, researchers may seek to infer the effectiveness of the most successful treatment or determine which demographic group benefits most from a specific treatment. Similarly, in machine learning, practitioners are often interested in the population performance of the model that empirically performs best. However, cherry-picking the best candidate leads to the winner’s curse: the observed performance for the winner is biased upwards, rendering conclusions based on standard measures of uncertainty invalid. We introduce a novel approach for valid inference on the winner. Our method is flexible: it handles arbitrary dependence between candidates and is entirely nonparametric. It automatically adapts to the level of selection bias; in particular, it recovers standard, uncorrected inference when the winner stands out and becomes increasingly conservative when there are multiple competitive candidates. The robust underpinnings of the method allow easy extensions to important related problems, such as inference on the top k winners, inference on the value and identity of the population winner, and inference on near-winners.

Bio:

Tijana Zrnic is a Ram and Vijay Shriram Data Science Postdoctoral Fellow at Stanford University, where she is hosted by Emmanuel Candès in the Department of Statistics. Her research establishes foundations to ensure data-driven technologies have a positive impact. Tijana earned her PhD in Electrical Engineering and Computer Sciences from UC Berkeley in 2023, where she was advised by Moritz Hardt and Michael Jordan. Her doctoral research explored prediction and statistical inference in feedback loops, including topics such as performative prediction, prediction-powered inference, and mitigating selection bias. Before her PhD, Tijana completed a BEng in Electrical and Computer Engineering at the University of Novi Sad in Serbia.

Find out more »: A Flexible Defense Against the Winner’s Curse
Inference for ATE & GLM’s when p/n→δ∈(0,∞)
On February 7, 2025 at 11:00 am till 12:00 pm
Rajarshi Mukherjee, Harvard University
E18-304

Abstract
In this talk we will discuss statistical inference of average treatment effect in measured confounder settings as well as parallel questions of inferring linear and quadratic functionals in generalized linear models under high dimensional proportional asymptotic settings i.e. when p/n→δ∈(0,∞) where p, n denote the dimension of the covariates and the sample size respectively . The results rely on the knowledge of the variance covariance matrix Σ of the covariates under study and we show that whereas √n-consistent asymptotically normal inference is possible for any δ by using method of moments type estimators that do not rely on estimating high dimensional nuisance parameters followed by a debiasing strategy. Without the knowledge of Σ we first develop √n-consistent estimators by using simple estimators of Σ when δ < 1. Subsequently for δ ≥ 1, we develop consistent estimators of the quantities of interest and argue that √n-consistent estimation might not be possible without further assumptions on Σ. Finally we verify our results in numerical simulations. This talk is based on joint work with Xingyu Chen and Lin Liu from Shanghai Jiao Tong University.

Bio
Rajarshi Mukherjee is an Associate Professor in the Department of Biostatistics at Harvard T.H. Chan School of Public Health. Previously, he was an Assistant Professor in the Division of Biostatistics at UC Berkeley following his time as a Stein Fellow in the Department of Statistics at Stanford University. He obtained his PhD in Biostatistics from Harvard University, advised by Prof. Xihong Lin.

He is generally interested in understanding broad aspects of causal inference in observational studies in modern data settings with a focus on learning about fundamental challenges in the statistical analysis of environmental mixtures and their effects on the cognitive development of children and cogntitive decline in aging populations. His research is also motivated by learning through applications in large-scale genetic association studies, developing statistical methods to quantify the effects of climate change on human health, and understanding the effects of homelessness on human health.

Find out more »: Inference for ATE & GLM’s when p/n→δ∈(0,∞)
Towards a ‘Chemistry of AI’: Unveiling the Structure of Training Data for more Scalable and Robust Machine Learning
On February 21, 2025 at 11:00 am till 12:00 pm
David Alvarez-Melis, Harvard University
E18-304

Abstract: Recent advances in AI have underscored that data, rather than model size, is now the primary bottleneck in large-scale machine learning performance. Yet, despite this shift, systematic methods for dataset curation, augmentation, and optimization remain underdeveloped. In this talk, I will argue for the need for a Chemistry of AI–a paradigm that, like the emerging Physics of AI, embraces a principles-first, rigorous, empiricist approach but shifts the focus from models to data. This perspective treats datasets as structured, dynamic entities that can be transformed through optimization and seeks to characterize their fundamental properties, composition, and interactions. I will then highlight some of our recent work that takes initial steps toward establishing this framework, including principled methods for dataset synthesis and surprising recent findings in dataset distillation.

Bio:

David Alvarez-Melis is an Assistant Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences, where he leads the Data-Centric Machine Learning (DCML) group. He is also a Researcher at Microsoft Research New England and an Associate Faculty at the Kempner Institute for Natural and Artificial Intelligence. He holds a Ph.D. in Computer Science from MIT and degrees in Mathematics from NYU and ITAM. David’s research seeks to make machine learning more broadly applicable (especially to data-poor applications) and trustworthy (e.g., robust and interpretable) through a data-centric approach that draws on methods from statistics, optimization and applied mathematics, and which takes inspiration from problems arising in the application of machine learning to the natural sciences.

Find out more »: Towards a ‘Chemistry of AI’: Unveiling the Structure of Training Data for more Scalable and Robust Machine Learning
Saddle-to-saddle dynamics in diagonal linear networks
On December 8, 2023 at 11:00 am till 12:00 pm
Nicolas Flammarion (EPFL)
E18-304

Abstract: When training neural networks with gradient methods and small weight initialisation, peculiar learning curves are observed: the training initially shows minimal progress, which is then followed by a sudden transition where a new feature is rapidly learned. This pattern is commonly known as incremental learning. In this talk, I will demonstrate that we can comprehensively understand this phenomenon within the context of a simplified network architecture. In this setting, we can establish that the gradient flow trajectory transitions from one saddle point of the training loss to another. The specific saddle points visited, as well as the timing of these transitions, can be determined using a recursive algorithm that is reminiscent of the Homotopy method used in computing the Lasso path.

Bio: Nicolas Flammarion is a tenure-track assistant professor in computer science at EPFL. Prior to that, he was a postdoctoral fellow at UC Berkeley, hosted by Michael I. Jordan. He received his PhD in 2017 from Ecole Normale Superieure in Paris, where he was advised by Alexandre d’Aspremont and Francis Bach. In 2018 he received the prize of the Fondation Mathematique Jacques Hadamard for the best PhD thesis in the field of optimization and in 2021, he was one of the recipients of the NeurIPS Outstanding Paper Award. His research focuses on learning problems at the intersection of machine learning, statistics, and optimization. He aims to develop algorithmic and theoretical tools that improve our understanding of machine learning and increase its robustness and usability.

Find out more »: Saddle-to-saddle dynamics in diagonal linear networks
Efficient Algorithms for Semirandom Planted CSPs at the Refutation Threshold
On February 16, 2024 at 11:00 am till 12:00 pm
Pravesh Kothari, Princeton University
E18-304

Abstract: We present an efficient algorithm to solve semi-random planted instances of any Boolean constraint satisfaction problem (CSP). The semi-random model is a hybrid between worst-case and average-case input models, where the input is generated by (1) choosing an arbitrary planted assignment x∗, (2) choosing an arbitrary clause structure, and (3) choosing literal negations for each clause from an arbitrary distribution shifted by x∗ so that x∗ satisfies each constraint. For an n variable semi-random planted instance of a k-arity CSP, our algorithm runs in polynomial time and outputs an assignment that satisfies all but an o(1)-a fraction of constraints, provided that the instance has at least Õ (nk/2) constraints. This matches, up to polylog(n) factors, the clause threshold for algorithms that solve fully random planted CSPs [FPV15], as well as algorithms that refute random and semi-random CSPs. Our result shows that despite having a worst-case clause structure, the randomness in the literal patterns makes semi-random planted CSPs significantly easier than worst-case, where analogous results require O(n^k) constraints.

Perhaps surprisingly, our algorithm follows a different conceptual framework compared to the recent resolution of semi-random CSP refutation. This turns out to be inherent and, at a technical level, can be attributed to the need for relative spectral approximation of certain random matrices – reminiscent of the classical spectral sparsification – which ensures that an SDP can certify the uniqueness of the planted assignment. In contrast, in the refutation setting, it suffices to obtain a weaker guarantee of absolute upper bounds on the spectral norm of related matrices.

Joint work with Venkatesan Guruswami, Jun-Ting Hsieh, and, Peter Manohar Bio: Pravesh Kothari is an Assistant Professor of Computer Science at Princeton University and a Visiting Professor in the School of Mathematics at the Institute for Advanced Study, Princeton. Earlier, he was an Assistant Professor at Carnegie Mellon University’s CS Department, a Postdoctoral Research Instructor at Princeton CS and the School of Math at the IAS, and obtained his Ph.D. from UT Austin in 2016. Kothari’s recent work has focused on algorithms for semi-random optimization problems via sum-of-squares semidefinite programs with connections to random matrices, extremal combinatorics, and coding theory. His research has been recognized with a Simons Award for graduate students in Theoretical Computer Science, a Google Research Scholar Award, an IIT Kanpur Young Alumnus Award, an NSF CAREER Award, and an Alfred P. Sloan Research Fellowship.

Find out more »: Efficient Algorithms for Semirandom Planted CSPs at the Refutation Threshold