Statistics and Data Science Seminar


list_alt
  • TBD

    On November 21, 2025 at 11:00 am till 12:00 pm
    Christos Thrampoulidis, University of British Columbia
    E18-304
  • TBD

    On December 5, 2025 at 11:00 am till 12:00 pm
    Michael Albergo, Harvard University
    E18-304
  • Trees and V’s: Inference for Ensemble Models

    On October 4, 2024 at 11:00 am till 12:00 pm
    Giles Hooker, Wharton School – UPenn
    E18-304

    Abstract: This talk discusses uncertainty quantification and inference using ensemble methods. Recent theoretical developments inspired by random forests have cast bagging-type methods as U-statistics when bootstrap samples are replaced by subsamples, resulting in a central limit theorem and hence the potential for inference. However, to carry this out requires estimating a variance for which all proposed estimators exhibit substantial upward bias. In this talk, we convert subsamples without replacement to subsamples with replacement resulting in V-statistics for which we prove a novel central limit theorem. We also show that in this context, the asymptotic variance can be expressed as the variance of a conditional expectation which is approximated by sampling from the empirical distribution and allows for valid bias corrections. We finish by illustrating the use of these tools in combining or comparing statistical models. Bio: Giles Hooker is Professor of Statistics and Data Science at the University of Pennsylvania. His work has focused on statistical methods using dynamical systems models, functional data analysis, and statistical aspects of fair and interpretable machine learning. He is the author of Dynamic Data Analysis: Modeling Data with Differential Equations and Functional Data Analysis in R and Matlab. Much of his work has been inspired by collaborations particularly in ecology, human movement, and citizen science data. Professor Hooker earned a PhD in Statistics from Stanford University before doing a post-doctoral fellowship at McGill University. Prior to joining Penn, he served as Professor of Statistics and Data Science at Cornell University and Professor of Statistics at UC Berkeley. He also holds a visiting appointment at the Australian National University.

    Find out more »: Trees and V’s: Inference for Ensemble Models
  • A Flexible Defense Against the Winner’s Curse

    On October 25, 2024 at 11:00 am till 12:00 pm
    Tijana Zrnic, Stanford University
    E18-304

    Abstract: Across science and policy, decision-makers often need to draw conclusions about the best candidate among competing alternatives. For instance, researchers may seek to infer the effectiveness of the most successful treatment or determine which demographic group benefits most from a specific treatment. Similarly, in machine learning, practitioners are often interested in the population performance of the model that empirically performs best. However, cherry-picking the best candidate leads to the winner’s curse: the observed performance for the winner is biased upwards, rendering conclusions based on standard measures of uncertainty invalid. We introduce a novel approach for valid inference on the winner. Our method is flexible: it handles arbitrary dependence between candidates and is entirely nonparametric. It automatically adapts to the level of selection bias; in particular, it recovers standard, uncorrected inference when the winner stands out and becomes increasingly conservative when there are multiple competitive candidates. The robust underpinnings of the method allow easy extensions to important related problems, such as inference on the top k winners, inference on the value and identity of the population winner, and inference on near-winners.

    Bio:

    Tijana Zrnic is a Ram and Vijay Shriram Data Science Postdoctoral Fellow at Stanford University, where she is hosted by Emmanuel Candès in the Department of Statistics. Her research establishes foundations to ensure data-driven technologies have a positive impact. Tijana earned her PhD in Electrical Engineering and Computer Sciences from UC Berkeley in 2023, where she was advised by Moritz Hardt and Michael Jordan. Her doctoral research explored prediction and statistical inference in feedback loops, including topics such as performative prediction, prediction-powered inference, and mitigating selection bias. Before her PhD, Tijana completed a BEng in Electrical and Computer Engineering at the University of Novi Sad in Serbia.

    Find out more »: A Flexible Defense Against the Winner’s Curse
  • Inference for ATE & GLM’s when p/n→δ∈(0,∞)

    On February 7, 2025 at 11:00 am till 12:00 pm
    Rajarshi Mukherjee, Harvard University
    E18-304

    Abstract
    In this talk we will discuss statistical inference of average treatment effect in measured confounder settings as well as parallel questions of inferring linear and quadratic functionals in generalized linear models under high dimensional proportional asymptotic settings i.e. when p/n→δ∈(0,∞) where p, n denote the dimension of the covariates and the sample size respectively . The results rely on the knowledge of the variance covariance matrix Σ of the covariates under study and we show that whereas √n-consistent asymptotically normal inference is possible for any δ by using method of moments type estimators that do not rely on estimating high dimensional nuisance parameters followed by a debiasing strategy. Without the knowledge of Σ we first develop √n-consistent estimators by using simple estimators of Σ when δ < 1. Subsequently for δ ≥ 1, we develop consistent estimators of the quantities of interest and argue that √n-consistent estimation might not be possible without further assumptions on Σ. Finally we verify our results in numerical simulations. This talk is based on joint work with Xingyu Chen and Lin Liu from Shanghai Jiao Tong University.

    Bio
    Rajarshi Mukherjee is an Associate Professor in the Department of Biostatistics at Harvard T.H. Chan School of Public Health. Previously, he was an Assistant Professor in the Division of Biostatistics at UC Berkeley following his time as a Stein Fellow in the Department of Statistics at Stanford University. He obtained his PhD in Biostatistics from Harvard University, advised by Prof. Xihong Lin.

    He is generally interested in understanding broad aspects of causal inference in observational studies in modern data settings with a focus on learning about fundamental challenges in the statistical analysis of environmental mixtures and their effects on the cognitive development of children and cogntitive decline in aging populations. His research is also motivated by learning through applications in large-scale genetic association studies, developing statistical methods to quantify the effects of climate change on human health, and understanding the effects of homelessness on human health.

    Find out more »: Inference for ATE & GLM’s when p/n→δ∈(0,∞)
  • Towards a ‘Chemistry of AI’: Unveiling the Structure of Training Data for more Scalable and Robust Machine Learning

    On February 21, 2025 at 11:00 am till 12:00 pm
    David Alvarez-Melis, Harvard University
    E18-304

    Abstract: Recent advances in AI have underscored that data, rather than model size, is now the primary bottleneck in large-scale machine learning performance. Yet, despite this shift, systematic methods for dataset curation, augmentation, and optimization remain underdeveloped. In this talk, I will argue for the need for a Chemistry of AI–a paradigm that, like the emerging Physics of AI, embraces a principles-first, rigorous, empiricist approach but shifts the focus from models to data. This perspective treats datasets as structured, dynamic entities that can be transformed through optimization and seeks to characterize their fundamental properties, composition, and interactions. I will then highlight some of our recent work that takes initial steps toward establishing this framework, including principled methods for dataset synthesis and surprising recent findings in dataset distillation.

    Bio:

    David Alvarez-Melis is an Assistant Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences, where he leads the Data-Centric Machine Learning (DCML) group. He is also a Researcher at Microsoft Research New England and an Associate Faculty at the Kempner Institute for Natural and Artificial Intelligence. He holds a Ph.D. in Computer Science from MIT and degrees in Mathematics from NYU and ITAM. David’s research seeks to make machine learning more broadly applicable (especially to data-poor applications) and trustworthy (e.g., robust and interpretable) through a data-centric approach that draws on methods from statistics, optimization and applied mathematics, and which takes inspiration from problems arising in the application of machine learning to the natural sciences.

    Find out more »: Towards a ‘Chemistry of AI’: Unveiling the Structure of Training Data for more Scalable and Robust Machine Learning
  • Saddle-to-saddle dynamics in diagonal linear networks

    On December 8, 2023 at 11:00 am till 12:00 pm
    Nicolas Flammarion (EPFL)
    E18-304

    Abstract: When training neural networks with gradient methods and small weight initialisation, peculiar learning curves are observed: the training initially shows minimal progress, which is then followed by a sudden transition where a new feature is rapidly learned. This pattern is commonly known as incremental learning. In this talk, I will demonstrate that we can comprehensively understand this phenomenon within the context of a simplified network architecture. In this setting, we can establish that the gradient flow trajectory transitions from one saddle point of the training loss to another. The specific saddle points visited, as well as the timing of these transitions, can be determined using a recursive algorithm that is reminiscent of the Homotopy method used in computing the Lasso path.

    Bio: Nicolas Flammarion is a tenure-track assistant professor in computer science at EPFL. Prior to that, he was a postdoctoral fellow at UC Berkeley, hosted by Michael I. Jordan. He received his PhD in 2017 from Ecole Normale Superieure in Paris, where he was advised by Alexandre d’Aspremont and Francis Bach. In 2018 he received the prize of the Fondation Mathematique Jacques Hadamard for the best PhD thesis in the field of optimization and in 2021, he was one of the recipients of the NeurIPS Outstanding Paper Award. His research focuses on learning problems at the intersection of machine learning, statistics, and optimization. He aims to develop algorithmic and theoretical tools that improve our understanding of machine learning and increase its robustness and usability.

    Find out more »: Saddle-to-saddle dynamics in diagonal linear networks
  • Efficient Algorithms for Semirandom Planted CSPs at the Refutation Threshold

    On February 16, 2024 at 11:00 am till 12:00 pm
    Pravesh Kothari, Princeton University
    E18-304

    Abstract: We present an efficient algorithm to solve semi-random planted instances of any Boolean constraint satisfaction problem (CSP). The semi-random model is a hybrid between worst-case and average-case input models, where the input is generated by (1) choosing an arbitrary planted assignment x∗, (2) choosing an arbitrary clause structure, and (3) choosing literal negations for each clause from an arbitrary distribution shifted by x∗ so that x∗ satisfies each constraint. For an n variable semi-random planted instance of a k-arity CSP, our algorithm runs in polynomial time and outputs an assignment that satisfies all but an o(1)-a fraction of constraints, provided that the instance has at least Õ (nk/2) constraints. This matches, up to polylog(n) factors, the clause threshold for algorithms that solve fully random planted CSPs [FPV15], as well as algorithms that refute random and semi-random CSPs. Our result shows that despite having a worst-case clause structure, the randomness in the literal patterns makes semi-random planted CSPs significantly easier than worst-case, where analogous results require O(n^k) constraints.

    Perhaps surprisingly, our algorithm follows a different conceptual framework compared to the recent resolution of semi-random CSP refutation. This turns out to be inherent and, at a technical level, can be attributed to the need for relative spectral approximation of certain random matrices – reminiscent of the classical spectral sparsification – which ensures that an SDP can certify the uniqueness of the planted assignment. In contrast, in the refutation setting, it suffices to obtain a weaker guarantee of absolute upper bounds on the spectral norm of related matrices.

    Joint work with Venkatesan Guruswami, Jun-Ting Hsieh, and, Peter Manohar Bio: Pravesh Kothari is an Assistant Professor of Computer Science at Princeton University and a Visiting Professor in the School of Mathematics at the Institute for Advanced Study, Princeton. Earlier, he was an Assistant Professor at Carnegie Mellon University’s CS Department, a Postdoctoral Research Instructor at Princeton CS and the School of Math at the IAS, and obtained his Ph.D. from UT Austin in 2016. Kothari’s recent work has focused on algorithms for semi-random optimization problems via sum-of-squares semidefinite programs with connections to random matrices, extremal combinatorics, and coding theory. His research has been recognized with a Simons Award for graduate students in Theoretical Computer Science, a Google Research Scholar Award, an IIT Kanpur Young Alumnus Award, an NSF CAREER Award, and an Alfred P. Sloan Research Fellowship.

    Find out more »: Efficient Algorithms for Semirandom Planted CSPs at the Refutation Threshold
  • Geometric EDA for Random Objects

    On March 17, 2023 at 11:00 am till 12:00 pm
    Paromita Dubey, University of Southern California
    E18-304

    Abstract: In this talk I will propose new tools for the exploratory data analysis of data objects taking values in a general separable metric space. First, I will introduce depth profiles, where the depth profile of a point ω in the metric space refers to the distribution of the distances between ω and the data objects. I will describe how depth profiles can be harnessed to define transport ranks, which capture the centrality of each element in the metric space with respect to the data cloud. Next, I will discuss the properties of transport ranks and show how they can be an effective device for detecting and visualizing patterns in samples of random objects. Together with practical illustrations I will establish large sample properties of the estimators of the depth profiles and the transport ranks which hold for a wide class of metric spaces. Finally, I will describe a new two sample test geared towards populations of random objects by utilizing the depth profiles corresponding to the data objects. I will demonstrate the efficacy of this new approach on distributional data comprising of a sample of age-at-death distributions for various countries, for compositional data through energy usage for the U.S. states and for neuroimaging network data. This talk is based on joint work with Yaqing Chen and Hans-Georg Müller.

    Bio: Paromita Dubey is an Assistant Professor in the Data Sciences and Operations department at the USC Marshall School of Business since 2021. Her research centers around developing novel statistical frameworks for non-Euclidean data, examples being distribution and network valued data. She is also working on addressing challenges in the analysis of dynamic time-evolving data, particularly in non-Euclidean and high dimensional settings. Aside from theoretical challenges, she enjoys working collaboratively to develop statistical frameworks arising in application oriented challenges in population genetics, environmental sciences and social sciences.

    Find out more »: Geometric EDA for Random Objects
  • Sampling from the SK measure via algorithmic stochastic localization

    On October 28, 2022 at 11:00 am till 12:00 pm
    Ahmed El Alaoui, Cornell University
    E18-304

    Abstract: I will present an algorithm which efficiently samples from the Sherrington-Kirkpatrick (SK) measure with no external field at high temperature. The approach is based on the stochastic localization process of Eldan, together with a subroutine for computing the mean vectors of a family of SK measures tilted by an appropriate external field. This approach is general and can potentially be applied to other discrete or continuous non-log-concave problems.

    We show that the algorithm outputs a sample within vanishing rescaled Wasserstein distance to the SK measure, for all inverse temperatures beta < 1/2. In a recent development, Celentano (2022) shows that our algorithm succeeds for all beta < 1, i.e., in the entire high temperature phase. Conversely, we show that in the low temperature phase beta >1, no ‘stable’ algorithm can approximately sample from the SK measure. In this case we show that the SK measure is unstable to perturbations in a certain sense. This settles the computational tractability of sampling from SK for all temperatures except the critical one. This is based on a joint work with Andrea Montanari and Mark Sellke.

    Bio: Ahmed El Alaoui joined the Statistics and Data Science faculty at Cornell University as an assistant professor in January 2021. He received his PhD in 2018 in Electrical Engineering and Computer Sciences from UC Berkeley, advised by Michael I. Jordan. He was afterwards a postdoctoral researcher at Stanford University, hosted by Andrea Montanari. He is currently a Simons-Berkeley research fellow at the Simons Institute for the Theory of Computing at UC Berkeley. His research interests revolve around high-dimensional phenomena in statistics and probability theory, statistical physics, algorithms, and problems where these areas meet.

    Find out more »: Sampling from the SK measure via algorithmic stochastic localization