TBD
Statistics and Data Science Seminar
-
Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent
On October 3, 2025 at 11:00 am till 12:00 pm E18-304Abstract:
Transformers have demonstrated remarkable chain-of-thought reasoning capabilities, yet, the underlying mechanisms by which they acquire and extrapolate these capabilities remain limited. This talk presents a theoretical analysis of transformers trained via gradient descent for symbolic reasoning and state tracking tasks with increasing problem complexity. Our analysis reveals the coordination of multi-head attention to solve multiple subtasks in a single autoregressive path, and the bootstrapping of inherently sequential reasoning through recursive self-training curriculum. Our optimization-based guarantees demonstrate that even shallow multi-head transformers, with chain-of-thought, can be trained to effectively solve problems that would otherwise require deeper architectures.Biography:
Dr. Yuejie Chi is the Charles C. and Dorothea S. Dilley Professor of Statistics and Data Science at Yale University, with a secondary appointment in Computer Science. She received her Ph.D. and M.A. from Princeton University, and B. Eng. (Hon.) from Tsinghua University, all in Electrical Engineering. Her research interests lie in the theoretical and algorithmic foundations of data science, generative AI, reinforcement learning, and signal processing, motivated by applications in scientific and engineering domains. Among others, Dr. Chi received the Presidential Early Career Award for Scientists and Engineers (PECASE), SIAM Activity Group on Imaging Science Best Paper Prize, IEEE Signal Processing Society Young Author Best Paper Award, and the inaugural IEEE Signal Processing Society Early Career Technical Achievement Award for contributions to high-dimensional structured signal processing. She is an IEEE Fellow (Class of 2023) for contributions to statistical signal processing with low-dimensional structures. -
Do Large Language Models (Really) Need Statistical Foundations?
On October 10, 2025 at 11:00 am till 12:00 pm E18-304Abstract:
In this talk, we advocate for developing statistical foundations for large language models (LLMs). We begin by examining two key characteristics that necessitate statistical perspectives for LLMs: (1) the probabilistic, autoregressive nature of next-token prediction, and (2) the inherent complexity and black box nature of Transformer architectures. To demonstrate how statistical insights can advance LLM development and applications, we present two examples. First, we demonstrate statistical inconsistencies and biases arising from the current approach to aligning LLMs with human preference. We propose a regularization term for aligning LLMs that is both necessary and sufficient to ensure consistent alignment. Second, we introduce a novel statistical framework for analyzing the efficacy of watermarking schemes, with a focus on a watermarking scheme developed by OpenAI for which we derive optimal detection rules that outperform existing ones. Time permitting, we will explore how statistical principles can inform rigorous evaluation for LLMs. Collectively, these findings demonstrate how statistical insights can effectively address several pressing challenges emerging from LLMs.Biography:
Weijie Su is an Associate Professor in the Wharton Statistics and Data Science Department at the University of Pennsylvania. He is a co-director of Penn Research in Machine Learning (PRiML) Center. Prior to joining Penn, he received his Ph.D. in statistics from Stanford University in 2016 and bachelor’s degree in mathematics from Peking University in 2011. His research interests span statistical foundations of generative AI, privacy-preserving machine learning, high-dimensional statistics, and optimization. He serves as an associate editor of Journal of the American Statistical Association, Journal of Machine Learning Research, Annals of Applied Statistics, Harvard Data Science Review, Foundations and Trends in Statistics, Operations Research, and Journal of the Operations Research Society of China, and he is currently guest editing a special issue on Statistics for Large Language Models and Large Language Models for Statistics in Stat. His work has been recognized with several awards, such as the Stanford Anderson Dissertation Award, NSF CAREER Award, Sloan Research Fellowship, IMS Peter Hall Prize, SIAM Early Career Prize in Data Science, ASA Noether Early Career Award, ICBS Frontiers of Science Award in Mathematics, IMS Medallion Lectureship, and Outstanding Young Talent Award in the 2025 China Annual Review of Mathematics. He is a Fellow of the IMS. -
Hard-Constrained Neural Networks
On October 17, 2025 at 11:00 am till 12:00 pm E18-304Abstract:
Incorporating prior knowledge and domain-specific input-output requirements, such as safety or stability, as hard constraints into neural networks is a key enabler for their deployment in high-stakes applications. However, existing methods often rely on soft penalties, which are insufficient, especially on out-of-distribution samples. In this talk, I will introduce hard-constrained neural networks (HardNet), a general framework for enforcing hard, input-dependent constraints by appending a differentiable enforcement layer to any neural network. This approach enables end-to-end training and, crucially, is proven to preserve the network’s universal approximation capabilities, ensuring model expressivity is not sacrificed. We demonstrate the versatility and effectiveness of HardNet across various applications: learning with piecewise constraints, learning optimization solvers with guaranteed feasibility, and optimizing control policies in safety-critical systems. This framework can be used even for problems where the constraints themselves are not fully known and must be learned from data in a parametric form, for which I will present two key applications: data-driven control with inherent Lyapunov stability and learning chaotic dynamical systems with guaranteed boundedness. Together, these results demonstrate a unified methodology for embedding formal constraints into deep learning, paving the way for more reliable AI.
Bio:
Navid Azizan is the Alfred H. (1929) and Jean M. Hayes Assistant Professor at MIT, where he holds dual appointments in the Department of Mechanical Engineering (Control, Instrumentation & Robotics) and the Schwarzman College of Computing’s Institute for Data, Systems & Society (IDSS) and is a Principal Investigator in the Laboratory for Information & Decision Systems (LIDS). Previously, he held the Esther and Harold E. Edgerton (1927) Chair. His research interests broadly lie in machine learning, systems and control, mathematical optimization, and network science. His research lab focuses on various aspects of reliable intelligent systems, with an emphasis on principled learning and optimization algorithms, with applications to autonomy and sociotechnical systems. He obtained his PhD in Computing and Mathematical Sciences (CMS) from the California Institute of Technology (Caltech), in 2020, his MSc in electrical engineering from the University of Southern California in 2015, and his BSc in electrical engineering with a minor in physics from Sharif University of Technology in 2013. Prior to joining MIT, he completed a postdoc at Stanford University in 2021. Additionally, he was a research scientist intern at Google DeepMind in 2019. His work has been recognized by several awards, including Research Awards from Google, Amazon, MathWorks, and IBM, and Best Paper awards and nominations at conferences including ACM Greenmetrics and the Learning for Dynamics & Control (L4DC). He was named in the list of Outstanding Academic Leaders in Data by the CDO Magazine for two consecutive years in 2024 and 2023, received the 2020 Information Theory and Applications (ITA) “Sun” (Gold) Graduation Award, and was named an Amazon Fellow in AI in 2017 and a PIMCO Fellow in Data Science in 2018. His mentorship has been recognized with the Frank E. Perkins Award for Excellence in Graduate Advising (MIT Institute Award) in 2025 and the UROP Outstanding Mentor Award in 2023. Early in his academic journey, he was the first-place winner and a gold medalist at the 2008 National Physics Olympiad in Iran. He founded and co-organized the popular “Control Meets Learning” Virtual Seminar Series during the pandemic. -
Learning to Price Electricity for Optimal Demand Response
On October 24, 2025 at 11:00 am till 12:00 pm E18-304Abstract:
The time at which renewable (e.g., solar or wind) energy resources produce electricity cannot generally be controlled. In many settings, however, consumers have some flexibility in their energy consumption needs, and there is growing interest in demand-response programs that leverage this flexibility to shift energy consumption to better match renewable production — thus enabling more efficient utilization of these resources. We study optimal demand response in a setting where consumers use home energy management systems (HEMS) to autonomously adjust their electricity consumption. Our core assumption is that HEMS operationalize flexibility by querying the consumer for their preferences and computing the “indifference set” of all energy consumption profiles that can be used to satisfy these preferences. Then, given an indifference set, HEMS can respond to grid signals while guaranteeing user-defined comfort and functionality; e.g., if a consumer sets a temperature range, a HEMS can precool and preheat to align with peak renewable production, thus improving efficiency without sacrificing comfort. We show that while price-based mechanisms are not generally optimal for demand response, they become asymptotically optimal in large markets under a mean-field limit. Furthermore, we show that optimal dynamic prices can be efficiently computed in large markets by only querying HEMS about their planned consumption under different price signals. We leverage this result to build an online contextual pricing algorithm, and show it to enable considerable reduction in peak system load in simulators calibrated to a number of major US cities.
Bio:
Stefan Wager is an associate professor of Operations, Information, and Technology at the Stanford Graduate School of Business, an associate professor of Statistics (by courtesy), and the Philip F. Maritz Faculty Scholar for 2025-26. His research lies at the intersection of causal inference, optimization, and statistical learning. He is particularly interested in developing new solutions to problems in statistics, economics and decision making that leverage recent advances in machine learning. He is currently serving as an associate editor for several publications including Biometrika, Management Science, Operations Research, and the Journal of the American Statistical Association. He has worked with or consulted for several Silicon Valley companies, including Dropbox, Facebook, Google, and Uber. -
Attention Sinks: A ‘Catch, Tag, Release’ Mechanism for Embeddings
On October 31, 2025 at 11:00 am till 12:00 pm E18-304Abstract:
Large language models (LLMs) often concentrate their attention on a small set of tokens—referred to as attention sinks. Common examples include the first token, a prompt-independent sink, and punctuation tokens, which are prompt-dependent. Although these tokens often lack inherent semantic meaning, their presence is critical for model performance, particularly under model compression and KV-caching. Yet, the function, semantic role, and origin of attention sinks—especially those beyond the first token—remain poorly understood.In this talk, I’ll present a comprehensive investigation revealing that attention sinks catch a sequence of tokens, tag them with a shared perturbation, and release them back into the residual stream, where they are later retrieved based on the tags they carry. Probing experiments show that these tags encode semantically meaningful information, such as the truth of a statement.
This mechanism persists in models with query-key normalization—where prompt-dependent, non-BOS sinks have become more common—and DeepSeek-distilled models, where it spans more heads and accounts for greater variance in the embeddings. To support future theoretical work, we introduce a minimal task that is solvable via the catch, tag, release mechanism, and in which the mechanism naturally emerges through training.
Bio:
Vardan Papyan is an Assistant Professor in the Department of Mathematics at the University of Toronto, cross-appointed with the Department of Computer Science. He completed his postdoctoral studies at the Department of Statistics at Stanford University, under the guidance of David Donoho, and his PhD at the Department of Computer Science at the Technion – Israel Institute of Technology, under the supervision of Michael Elad. -
Back to the future – data efficient language modeling
On November 7, 2025 at 11:00 am till 12:00 pm E18-304Abstract:
Compute scaling has dominated the conversation with modern language models, leading to an impressive array of algorithms that optimize performance for a given training (and sometimes inference) compute budget. But as compute has grown cheaper and more abundant, data is starting to become a bottleneck, and our ability to exchange computing for data efficiency may be crucial to future model scaling. In this talk, I will discuss some of our recent work on synthetic data and algorithmic approaches to data efficiency, and show that in both cases, classical statistical perspectives based on nonparametric modeling and ensembling bring new insights and empirical benefits to modern questions of scaling and data efficiency.
Biography:
Tatsunori Hashimoto is an Assistant Professor in the Computer Science Department at Stanford University. Work from his group spans many areas within statistical machine learning and language models including language model post-training, uncertainty quantification, and data selection. He received his Ph.D. at MIT under the supervision of Tommi Jaakkola and David Gifford, and is the recipient of the NSF CAREER, Samsung AI researcher of the year award, a Kavli fellowship as well as best paper awards at ICML, ICLR, and CHI.