Statistics and Data Science Seminar


list_alt
  • Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data

    On April 18, 2025 at 11:00 am till 12:00 pm
    Dennis Shen, University of Southern California
    E18-304

    Abstract: One dominant approach to evaluate the causal effect of a treatment is through panel data analysis, whereby the behaviors of multiple units are observed over time. The information across time and units motivates two general approaches: (i) horizontal regression (i.e., unconfoundedness), which exploits time series patterns, and (ii) vertical regression (e.g., synthetic controls), which exploits cross-sectional patterns. Conventional wisdom often considers the two approaches to be different. We establish this position to be partly false for estimation but generally true for inference. In the absence of any assumptions, we show that both approaches yield algebraically equivalent point estimates for several standard estimators. However, the source of randomness assumed by each approach leads to a distinct estimand and quantification of uncertainty even for the same point estimate. This emphasizes that researchers should carefully consider where the randomness stems from in their data, as it has direct implications for the accuracy of inference.

    Bio:
    Dennis Shen is an assistant professor in the Data Sciences and Operations Department at the USC Marshall School of Business. Before joining USC, he was a FODSI postdoctoral fellow at the Simons Institute at UC Berkeley. He also served as a technical consultant for Uber Technologies and TauRx Therapeutics. He has received several recognition for his work, including the INFORMS George B. Dantzig Dissertation Award (2nd to his esteemed colleague, Somya) and MIT George Sprowls PhD Thesis Award in Artificial Intelligence & Decision-making.

    Find out more »: Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data
  • Causal Inference on Outcomes Learned from Text

    On April 11, 2025 at 11:00 am till 12:00 pm
    Jann Spiess, Stanford University
    E18-304

    Abstract:

    (with Iman Modarressi and Amar Venugopal; arxiv.org/abs/2503.00725

    We propose a machine-learning tool that yields causal inference on text in randomized trials. Based on a simple econometric framework in which text may capture outcomes of interest, our procedure addresses three questions: First, is the text affected by the treatment? Second, which outcomes is the effect on? And third, how complete is our description of causal effects? To answer all three questions, our approach uses large language models (LLMs) that suggest systematic differences across two groups of text documents and then provides valid inference based on costly validation. Specifically, we highlight the need for sample splitting to allow for statistical validation of LLM outputs, as well as the need for human labeling to validate substantive claims about how documents differ across groups. We illustrate the tool in a proof-of-concept application using abstracts of academic manuscripts.

    Bio:
    Jann is an econometrician in the OIT group at Stanford GSB. His current research focuses broadly on two related themes: (1) high-dimensional and robust causal inference, including work on using machine learning to improve inferences from randomized trials, robust inference in panel data, synthetic control, matching estimation, highly over-parametrized models, and high-dimensional outcome data; and (2) data-driven decision-making with misaligned objectives, including work on algorithmic fairness, human–AI interaction, the regulation of algorithms, and the design of pre-analysis plans. He holds a PhD in economics from Harvard University.

    Find out more »: Causal Inference on Outcomes Learned from Text
  • The value of information in model assisted decision-making

    On April 4, 2025 at 11:00 am till 12:00 pm
    Jessica Hullman, Northwestern University
    E18-304

    Abstract: The widespread adoption of AI and machine learning models in in society has brought increased attention to how model predictions impact decision processes in a variety of domains. I will describe tools that apply statistical decision theory and information economics to address pressing question at the human-AI interface. These include: how to evaluate when a decision-maker appropriately relies on model predictions, when a human or AI agent could better exploit available contextual information, and how to evaluate (and design) prediction explanations. I will also discuss some cases where statistical theory falls short of providing insight into how people may use predictions for decisions.

    Bio: Jessica Hullman is Ginni Rometty Professor of Computer Science and a Faculty Fellow at the Institute for Policy Research at Northwestern University. Her research develops theoretical frameworks and interface tools for helping people combine their knowledge with statistical models. Her work draws on foundation models of decision-making under uncertainty such as Bayesian decision theory while addressing real world applied problems at the interface between humans and statistical models. Hullman’s current research pursues methods for designing explanations and quantifying uncertainty for AI-assisted decision-making, as well as evaluating AI-human team performance. Her work has led to multiple best paper and honorable mention awards at top visualization and HCI venues, a Microsoft Faculty award, and NSF CAREER, among other honors.

    Find out more »: The value of information in model assisted decision-making
  • Structured Topic Modeling: Leveraging Sparsity and Graphs for Improved Inference

    On March 21, 2025 at 11:00 am till 12:00 pm
    Claire Donnat, University of Chicago
    E18-304

    Abstract:
    Classical topic modeling approaches, such as Latent Dirichlet Allocation (LDA) and probabilistic Latent Semantic Indexing (pLSI), decompose a document-term matrix into a mixture of topics, offering a powerful tool for uncovering latent thematic structures from document corpora or compositional data at large. However, these methods generally assume document independence, overlooking potential relationships or additional structural information that could improve inference—especially in contexts with short documents or large vocabulary sizes.
    In this talk, we will consider two new structured approaches to topic modeling that enhance inference. The first extends pLSI by incorporating weak sparsity to manage large vocabularies effectively. The second leverages document-level relationships encoded as a graph, introducing a graph-aligned singular value decomposition of the empirical frequency matrix to improve the estimation of document-topic and topic-word matrices. This method is especially advantageous in applications where document similarities are well-defined, such as spatial transcriptomics, microbiome studies, and scientific abstract analysis.
    By establishing high-probability error bounds for estimating topic proportions and word distributions, our work attempts to begin bridging the gap between topic modeling and structured inference. Our examples demonstrate that this flexible, theoretically grounded framework can be effectively applied across diverse data modalities.

    Bio:
    Claire Donnat is an Assistant Professor of Statistics at the University of Chicago, specializing in high-dimensional data analysis, network-based methods, and structured inference. Her work sits at the intersection of theory and applications, focusing on statistical frameworks for multimodal data integration and unsupervised learning. She applies these methods to problems in neuroscience, spatial transcriptomics, and plant microbiology.

    Find out more »: Structured Topic Modeling: Leveraging Sparsity and Graphs for Improved Inference
  • Feature Learning and Scaling Laws in Two-layer Neural Networks: A high dimensional analysis

    On March 14, 2025 at 11:00 am till 12:00 pm
    Murat A. Erdogdu, University of Toronto
    E18-304

    Abstract: This talk will focus on gradient-based optimization of two-layer neural networks. We consider a high-dimensional setting where the number of samples and the input dimension are both large and show that, under different model assumptions, neural networks learn useful features and adapt to the model more efficiently than classical methods. Further, we derive scaling laws of the learning dynamics for the gradient descent, highlighting the power-law dependencies on the optimization time, and the model width.

    Bio: Murat A. Erdogdu is currently an assistant professor at the University of Toronto in departments of Computer Science and Statistics. He is also a faculty member of the Vector Institute, and a CIFAR Chair in AI. Before, he was a postdoctoral researcher at Microsoft Research – New England. His research interests include machine learning theory, statistics, and optimization. He obtained his Ph.D. from the Department of Statistics at Stanford University and he has an M.S. degree in Computer Science, also from Stanford.

    Find out more »: Feature Learning and Scaling Laws in Two-layer Neural Networks: A high dimensional analysis
  • Finite-Particle Convergence Rates for Stein Variational Gradient Descent

    On March 7, 2025 at 11:00 am till 12:00 pm
    Krishna Balasubramanian, University of California – Davis
    E18-304

    Abstract:
    Stein Variational Gradient Descent (SVGD) is a deterministic, interacting particle-based algorithm for nonparametric variational inference, yet its theoretical properties remain challenging to fully understand. This talk presents two complementary perspectives on SVGD. First, we introduce Gaussian-SVGD, a framework that projects SVGD onto the family of Gaussian distributions using a bilinear kernel. We establish rigorous convergence results for both mean-field dynamics and finite-particle systems, proving linear convergence to equilibrium in strongly log-concave settings. This framework also unifies recent algorithms for Gaussian Variational Inference (GVI) under a single theoretical lens. Second, we examine the finite-particle convergence rates of nonparametric SVGD in Kernelized Stein Discrepancy (KSD) and Wasserstein-2 metrics. By decomposing the time derivative of relative entropy, we derive near-optimal convergence rates with polynomial dependence on dimensionality for certain kernel families. We also outline a framework to compare deterministic SVGD algorithms to the more standard randomized MCMC algorithms.

    Bio:
    Krishna Balasubramanian is an Associate Professor in the Department of Statistics, University of California, Davis. He is also affiliated with the Graduate Group in Applied Mathematics, Graduate Program in Electrical and Computer Engineering, the Center for Data Science and Artificial Intelligence Research (CeDAR) and the TETRAPODS Institute of Data Science at UC Davis. He was a visiting scientist at the Simons Institute for the Theory of Computing, UC Berkeley in Fall 2021 and 2022. Previously, he completed his PhD in Computer Science from Georgia Institute of Technology and was a postdoctoral researcher in the Department of Operations Research and Financial Engineering, Princeton University, and the Department of Statistics at UW-Madison. Krishna’s research interests include stochastic optimization and sampling, deep learning, nonparametric, geometric and topological Statistics. He has received a Facebook fellowship award, ICML best paper runner-up award​and INFORMS ICS Prize. He serves as an associate editor for the ​Annals of Statistics, IEEE Transactions on Information Theory, Journal of Machine Learning Research, and as a senior area chair for top machine learning conferences including the International Conference on Machine Learning (ICML), Advances in Neural Information Processing Systems (NeurIPS), International Conference on Learning Representations (ICLR), and Conference on Learning Theory (COLT).

    Find out more »: Finite-Particle Convergence Rates for Stein Variational Gradient Descent
  • Two Approaches Towards Adaptive Optimization

    On February 28, 2025 at 11:00 am till 12:00 pm
    Ashia Wilson, MIT
    E18-304

    Abstract:
    This talk will address to recent projects I am excited about. The first describes efficient methodologies for hyper-parameter estimation in optimization algorithms. I will describe two approaches for how to adaptively estimate these parameters that often lead to significant improvement in convergence. The second describes a new method, called Metropolis-Adjusted Preconditioned Langevin Algorithm for sampling from a convex body. Taking an optimization perspective, I focus on the mixing time guarantees of these algorithms — an essential theoretical property for MCMC methods — under natural conditions over the target distribution and the geometry of the domain.

    Bio:
    Ashia Wilson is a Lister Brothers Career Development Assistant Professor at MIT. Her research focuses on designing scalable, reliable and socially responsible AI systems using tools from dynamical systems theory, statistics, and optimization.
    She obtained her B.A. from Harvard with a concentration in applied mathematics and a minor in philosophy, and Ph.D.  from UC Berkeley in statistics. Before joining MIT, she held a postdoctoral position in the machine learning and statistics group at Microsoft Research.
    Ashia has received best paper and spotlights awards for her work from conferences and workshops such as Fairness Accountability and Transparency (FAccT), Neurips, and OptML.

    Find out more »: Two Approaches Towards Adaptive Optimization
  • Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection

    On December 6, 2024 at 11:00 am till 12:00 pm
    Jing Lei, Carnegie Mellon University
    E18-304

    Abstract: 
    We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. By integrating concepts and tools from cross-validation and differential privacy, we develop a test statistic that is asymptotically normal even in high-dimensional settings, and allows for arbitrarily many ties in the population mean vector. The key technical ingredient is a central limit theorem for globally dependent data characterized by stability.  We also propose a practical method for selecting the tuning parameter that adapts to the signal landscape.

    Bio:
    Jing Lei is Professor of Statistics & Data Science at Carnegie Mellon University. He received his Bachelor of Science degree from the School of Mathematical Sciences at Peking University in China, and obtained PhD in statistics from UC Berkeley in 2010 before joining Carnegie Mellon in 2011.  Jing’s research focuses on providing rigorous insights into popular algorithms in practical contexts.  He has done pioneering and foundational work on predictive inference, including conformal prediction and cross-validation.  He developed advanced theory and methods for high-dimensional matrix data, including sparse PCA and network data, with successful applications in single-cell multi-omics data analysis. He is also among the first researchers to study differential privacy in a statistical context. He is a fellow of the Institute of Mathematical Statistics (IMS) and the American Statistical Association (ASA).  Jing received an NSF CAREER Award and the Gottfried E. Noether Young Researcher Award in 2016. In 2022, he was a recipient of the Leo Breiman Junior Award. In 2024, he was awarded the IMS Medallion Lectureship.

    Find out more »: Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection
  • Statistical Inference with Limited Memory

    On November 22, 2024 at 11:00 am till 12:00 pm
    Ofer Shayevitz, Tel Aviv University
    E18-304

    Abstract: 

    In statistical inference problems, we are typically given a limited number of samples from some underlying distribution, and we wish to estimate some property of that distribution, under a given measure of risk. We are usually interested in characterizing and achieving the best possible risk as a function of the number of available samples. Thus, it is often implicitly assumed that samples are co-located, and that communication bandwidth as well as computational power are not a bottleneck, essentially making the number of samples the sole limiting factor. However, in modern applications such as wireless sensor networks, data may be distributed between multiple remotely-located agents who may be subject to stringent communication constraints, and have limited memory / computational capabilities. In such cases, the bottleneck for inference may become bits rather than samples — either the number of available communication bits, or the number of bits that can be stored in memory. In this talk, we will focus on the latter case, and ask: How does the risk behave as a function of the algorithm’s memory size, when the number of available samples is large? We will formalize this question in a finite-state machine setting, and discuss several techniques for algorithmic construction and lower bound derivation, highlighted through some of our recent work on memory-limited hypothesis testing, bias estimation, uniformity testing, and entropy estimation.

    Based on joint work with Tomer Berg and Or Ordentlich.

    Bio: 

    Ofer Shayevitz received the B.Sc. degree from the Technion Institute of Technology, Haifa, Israel, and the M.Sc. and Ph.D. degrees from the Tel-Aviv University, Tel Aviv, Israel, all in electrical engineering. He is currently an Associate Professor in the Department of EE – Systems at Tel Aviv University. Ofer’s research spans a wide cross-section of problems in information theory, statistical inference, and discrete mathematics. He is the recipient of the European Research Council (ERC) Starting Grant, several Israel Science Foundation (ISF) grants, and the Marie Curie Grant. He is also actively involved in the hi-tech industry, and regularly consults to various companies. Before joining Tel Aviv University, Ofer was a postdoctoral fellow in the Information Theory and Applications (ITA) Center at the University of California, San Diego, and worked as a quantitative analyst with the D.E. Shaw group in New York. Prior to his graduate studies, he has held several R&D positions in the fields of digital communication and statistical signal processing.

    Find out more »: Statistical Inference with Limited Memory
  • Evaluating a black-box algorithm: stability, risk, and model comparisons

    On November 15, 2024 at 11:00 am till 12:00 pm
    Rina Foygel Barber, University of Chicago
    E18-304

    Abstract:
    When we run a complex algorithm on real data, it is standard to use a holdout set, or a cross-validation strategy, to evaluate its behavior and performance. When we do so, are we learning information about the algorithm itself, or only about the particular fitted model(s) that this particular data set produced? In this talk, we will establish fundamental hardness results on the problem of empirically evaluating properties of a black-box algorithm, such as its stability and its average risk, in the distribution-free setting.

    This work is joint with Yuetian Luo and Byol Kim.

    Bio:
    Rina Foygel Barber is the Louis Block Professor in the Department of Statistics at the University of Chicago. Before starting at U of C, she was a NSF postdoctoral fellow in the Department of Statistics at Stanford University, and received her PhD in Statistics at the University of Chicago.  Her research focuses on the theoretical foundations of statistical problems in estimation, prediction, and inference. In many modern settings, classical methods may not be reliable due to high dimensionality, failure of model assumptions, or other challenges. She works on distribution-free inference methods such as conformal prediction, and on developing hardness results to establish what types of questions can or cannot be solved with distribution-free methods. She is also interested in multiple testing methods, in algorithmic stability, and shape-constrained inference. She also collaborates on modeling and optimization problems in image reconstruction for medical imaging.

    Find out more »: Evaluating a black-box algorithm: stability, risk, and model comparisons