Statistics and Data Science Seminar


list_alt
  • Quantile and Probability Curves without Crossing

    No dates for this event
    Victor Chernozhukov (MIT Econ)

    The most common approach to estimating conditional quantile curves is to fit a curve, typically linear, pointwise for each quantile. Linear functional forms, coupled with pointwise fitting, are used for a number of reasons including parsimony of the resulting approximations and good computational properties. The resulting fits, however, may not respect a logical monotonicity requirement — that the quantile curve be increasing as a function of probability. This paper studies the natural monotonization of these empirical curves induced by sampling from the estimated non-monotone model, and then taking the resulting conditional quantile curves that by construction are monotone in the probability. This construction of monotone quantile curves may be seen as a bootstrap and also as a monotonic rearrangement of the original non-monotone function. It is shown that the monotonized curves are closer to the true curves in finite samples, for any sample size. Under correct specification, the rearranged conditional quantile curves have the same asymptotic distribution as the original non-monotone curves. Under misspecification, however, the asymptotics of the rearranged curves may partially differ from the asymptotics of the original non-monotone curves. An analogous procedure is developed to monotonize the estimates of conditional distribution functions. The results are derived by establishing the compact (Hadamard) differentiability of the monotonized quantile and probability curves with respect to the original curves in discontinuous directions, tangentially to a set of continuous functions. In doing so, the compact differentiability of the rearrangement-related operators is established.

    Find out more »: Quantile and Probability Curves without Crossing
  • Automated Data Summarization for Scalability in Bayesian Inference

    On November 22, 2019 at 11:00 am till 12:00 pm
    Tamara Broderick (MIT)
    E18-304

    Abstract:

    Many algorithms take prohibitively long to run on modern, large data sets. But even in complex data sets, many data points may be at least partially redundant for some task of interest. So one might instead construct and use a weighted subset of the data (called a coreset) that is much smaller than the original dataset. Typically running algorithms on a much smaller data set will take much less computing time, but it remains to understand whether the output can be widely useful. (1) In particular, can running an analysis on a smaller coreset yield answers close to those from running on the full data set? (2) And can useful coresets be constructed automatically for new analyses, with minimal extra work from the user? We answer in the affirmative for a wide variety of problems in Bayesian inference. We demonstrate how to construct Bayesian coresets as an automatic, practical pre-processing step. We prove that our method provides geometric decay in relevant approximation error as a function of coreset size. Empirical analysis shows that our method reduces approximation error by orders of magnitude relative to uniform random subsampling of data. Though we focus on Bayesian methods here, we also show that our construction can be applied in other domains.

    Biography:

    Tamara Broderick is an Associate Professor in the Department of Electrical Engineering and Computer Science at MIT. She is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), the MIT Statistics and Data Science Center, and the Institute for Data, Systems, and Society (IDSS). She completed her Ph.D. in Statistics at the University of California, Berkeley in 2014. Previously, she received an AB in Mathematics from Princeton University (2007), a Master of Advanced Study for completion of Part III of the Mathematical Tripos from the University of Cambridge (2008), an MPhil by research in Physics from the University of Cambridge (2009), and an MS in Computer Science from the University of California, Berkeley (2013). Her recent research has focused on developing and analyzing models for scalable Bayesian machine learning. She has been awarded an AISTATS Notable Paper Award (2019), NSF CAREER Award (2018), a Sloan Research Fellowship (2018), an Army Research Office Young Investigator Program award (2017), Google Faculty Research Awards, an Amazon Research Award, the ISBA Lifetime Members Junior Researcher Award, the Savage Award (for an outstanding doctoral dissertation in Bayesian theory and methods), the Evelyn Fix Memorial Medal and Citation (for the Ph.D. student on the Berkeley campus showing the greatest promise in statistical research), the Berkeley Fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize (for the graduating Princeton senior with the highest academic average).

    The MIT Statistics and Data Science Center hosts guest lecturers from around the world in this weekly seminar.

    Find out more »: Automated Data Summarization for Scalability in Bayesian Inference
  • TAP free energy, spin glasses, and variational inference

    On February 15, 2019 at 11:00 am till 12:00 pm
    Zhou Fan (Yale University)

    Abstract:
    We consider the Sherrington-Kirkpatrick model of spin glasses with ferromagnetically biased couplings. For a specific choice of the couplings mean, the resulting Gibbs measure is equivalent to the Bayesian posterior for a high-dimensional estimation problem known as Z2 synchronization. Statistical physics suggests to compute the expectation with respect to this Gibbs measure (the posterior mean in the synchronization problem), by minimizing the so-called Thouless-Anderson-Palmer (TAP) free energy, instead of the mean field (MF) free energy. We prove that this identification is correct, provided the ferromagnetic bias is larger than a constant (i.e. the noise level is small enough in synchronization). Namely, we prove that the scaled l_2 distance between any low energy local minimizers of the TAP free energy and the mean of the Gibbs measure vanishes in the large size limit. Our proof technique is based on upper bounding the expected number of critical points of the TAP free energy using the Kac-Rice formula. This is joint work with Song Mei and Andrea Montanari.

    Biography:
    Zhou Fan is an Assistant Professor in the Department of Statistics and Data Science at Yale University. His research interests include random matrix theory, high dimensional and multivariate statistics, inference in random graphs and networks, discrete algorithms, and applications in genetics and computational biology. Zhou received his Ph.D. in Statistics at Stanford University, working with Iain M. Johnstone and Andrea Montanari. Prior to this, Zhou developed statistical and software tools for molecular dynamics simulations at D. E. Shaw Research. MIT Statistics and Data Science Center host guest lecturers from around the world in this weekly seminar.

    Find out more »: TAP free energy, spin glasses, and variational inference
  • Why Aren’t Network Statistics Accompanied By Uncertainty Statements?

    On March 1, 2019 at 11:00 am till 12:00 pm
    Eric Kolaczyk (Boston University)
    E18-304

    Abstract:
    Over 500K scientific articles have been published since 1999 with the word network in the title. And the vast majority of these report network summary statistics of one type or another. However, these numbers are rarely accompanied by any quantification of uncertainty. Yet any error inherent in the measurements underlying the construction of the network, or in the network construction procedure itself, necessarily must propagate to any summary statistics reported. Perhaps surprisingly, there is little in the way of formal statistical methodology for this problem. I summarize results from our recent work, for the case of estimating the density of low-order subgraphs. Under a simple model of network error, we show that consistent estimation of such densities is impossible when the rates of error are unknown and only a single network is observed. We then develop method-of-moment estimators of subgraph density and error rates for the case where a minimal number of network replicates are available (i.e., just 2 or 3). These estimators are shown to be asymptotically normal as the number of vertices increases to infinity. We also provide confidence intervals for quantifying the uncertainty in these estimates, implemented through a novel bootstrap algorithm. We illustrate the use of our estimators in the context of gene coexpression networks — the correction for measurement error is found to have potentially substantial impact on standard summary statistics. This is joint work with Qiwei Yao and Jinyuan Chang.

    Biography:
    Eric Kolaczyk is a Professor of Statistics and Director of the Program in Statistics in the Department of Mathematics & Statistics at Boston University. He is also a university Data Science Faculty Fellow, and affiliated with the Division of Systems Engineering and the Programs in Bioinformatics and in Computational Neuroscience. His current research interests revolve mainly around the statistical analysis of network-indexed data, including both theory/methods development and collaborative research. He has published several books on the topic of network analysis. As an associate editor, he has served on the boards of JASA and JRSS-B in statistics, IEEE IP and TNSE in engineering, and SIMODS in mathematics. Currently he is the co-chair of the NAS Roundtable on Data Science Education. He is an elected fellow of the AAAS, ASA, and IMS, an elected senior member of the IEEE, and an elected member of the ISI.

    MIT Statistics and Data Science Center host guest lecturers from around the world in this weekly seminar.

    Find out more »: Why Aren’t Network Statistics Accompanied By Uncertainty Statements?
  • Influence maximization in stochastic and adversarial settings

    On September 18, 2015 at 11:00 am till 12:00 pm

    We consider the problem of influence maximization in fixed networks, for both stochastic and adversarial contagion models. In the stochastic setting, nodes are infected in waves according to linear threshold or independent cascade models. We establish upper and lower bounds for the influence of a subset of nodes in the network, where the influence is defined as the expected number of infected nodes at the conclusion of the epidemic. We quantify the gap between our upper and lower bounds in the case of the linear threshold model and illustrate the gains of our upper bounds for independent cascade models in relation to existing results. Importantly, our lower bounds are monotonic and submodular, implying that a greedy algorithm for influence maximization is guaranteed to produce a maximizer within a 1-1/e factor of the truth. In the adversarial setting, an adversary is allowed to specify the edges through which contagion may spread, and the player chooses sets of nodes to infect in successive rounds. We establish upper and lower bounds on the pseudo-regret for possibly stochastic strategies of the adversary and player. This is joint work with Justin Khim and Varun Jog.

    Find out more »: Influence maximization in stochastic and adversarial settings
  • Fundamental statistical limits in causal inference

    On May 9, 2025 at 11:00 am till 12:00 pm
    Sivaraman Balakrishnan, Carnegie Mellon University
    E18-304

    Abstract: Despite tremendous methodological advances in causal inference, there remain significant gaps in our understanding of the fundamental statistical limits of estimating various causal estimands from observational data. In this talk I will survey some recent work that aims to make some progress towards closing these gaps. Particularly, I will discuss the fundamental limits of estimating various important causal estimands under classical smoothness assumptions, under new “structure-agnostic” assumptions, in a discrete setup, and under partial smoothness assumptions. Via these fundamental limits we will also attempt to understand the optimality/sub-optimality of simple, practical procedures for estimating causal effects from observational data.

    The talk will be based on joint works with Yanjun Han, Edward Kennedy, James Robins, Larry Wasserman and Tiger Zeng.

    Bio: Sivaraman is a Professor with a joint appointment in the Department of Statistics and Data Science, and the Machine Learning Department at Carnegie Mellon. This year he is a long-term visitor at the Simons Institute, and a visiting scholar in the Department of Statistics at Berkeley. Prior to Carnegie Mellon he was a postdoctoral researcher at UC Berkeley working with Martin Wainwright and Bin Yu. His research interests are broadly in statistical machine learning and algorithmic statistics. Particular areas that he is currently most fascinated by include optimal transport, causal inference, robust statistics and minimax hypothesis testing.

    Find out more »: Fundamental statistical limits in causal inference
  • Tractable Agreement Protocols

    On May 2, 2025 at 11:00 am till 12:00 pm
    Aaron Roth, University of Pennsylvania
    E18-304

    Abstract: As ML models become increasingly powerful, it is an attractive proposition to use them in important decision making pipelines, in collaboration with human decision makers. But how should a human being and a machine learning model collaborate to reach decisions that are better than either of them could achieve on their own? If the human and the ML model were perfect Bayesians, operating in a setting with a commonly known and correctly specified prior, Aumann’s classical agreement theorem would give us one answer: they could engage in conversation about the task at hand, and their conversation would be guaranteed to converge to (accuracy-improving) agreement. This classical result however would require making many implausible assumptions, both about the knowledge and computational power of both parties. We show how to recover similar (and more general) results using only computationally and statistically tractable assumptions, which substantially relax full Bayesian rationality. We further give weak-learning conditions under which this collaboration will result in “information aggregation” — i.e. predictions that are as accurate as could have been made by a model that had access to -both- party’s observations, even though neither party in the interaction actually has access to these pooled observations.

    Joint work with Natalie Collina, Varun Gupta, and Surbhi Goel, based on a paper that will appear in STOC 2025, and with Natalie Collina, Ira Globus-Harris, Varun Gupta, Surbhi Goel, and Mirah Shi based on a new preprint.

    Bio: Aaron Roth is the Henry Salvatori Professor of Computer and Cognitive Science, in the Computer and Information Sciences department at the University of Pennsylvania, with a secondary appointment in the Wharton statistics department. He is affiliated with the Warren Center for Network and Data Science, and co-director of the Networked and Social Systems Engineering (NETS) program.  He is also an Amazon Scholar at Amazon AWS. He is the recipient of the Hans Sigrist Prize, a Presidential Early Career Award for Scientists and Engineers (PECASE), an Alfred P. Sloan Research Fellowship, an NSF CAREER award, and research awards from Yahoo, Amazon, and Google.  His research focuses on the algorithmic foundations of data privacy, algorithmic fairness, game theory, learning theory, and machine learning.  Together with Cynthia Dwork, he is the author of the book “The Algorithmic Foundations of Differential Privacy.” Together with Michael Kearns, he is the author of “The Ethical Algorithm”.

    Find out more »: Tractable Agreement Protocols
  • How should we do linear regression?

    On April 25, 2025 at 11:00 am till 12:00 pm
    Richard Samworth, University of Cambridge
    E18-304

    Abstract: In the context of linear regression, we construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal asymptotic variance in the downstream estimation of the regression coefficients. Our semiparametric approach targets the best decreasing approximation of the derivative of the log-density of the noise distribution. At the population level, this fitting process is a nonparametric extension of score matching, corresponding to a log-concave projection of the noise distribution with respect to the Fisher divergence. The procedure is computationally efficient, and we prove that our procedure attains the minimal asymptotic covariance among all convex M-estimators. As an example of a non-log-concave setting, for Cauchy errors, the optimal convex loss function is Huber-like, and our procedure yields an asymptotic efficiency greater than 0.87 relative to the oracle maximum likelihood estimator of the regression coefficients that uses knowledge of this error distribution; in this sense, we obtain robustness without sacrificing much efficiency.

    Bio: Richard Samworth obtained his PhD in Statistics from the University of Cambridge in 2004, and has remained in Cambridge since, becoming a full professor in 2013 and the Professor of Statistical Science in 2017.  His main research interests are in high-dimensional and nonparametric statistics; he has developed methods and theory for shape-constrained inference, missing data, subgroup selection, data perturbation techniques (subsampling, the bootstrap, random projections, knockoffs), changepoint estimation and independence testing, amongst others. Richard currently holds a European Research Council Advanced Grant.  He received the COPSS Presidents’ Award in 2018, was elected a Fellow of the Royal Society in 2021 and served as co-editor of the Annals of Statistics (2019-2021).

    Find out more »: How should we do linear regression?
  • Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data

    On April 18, 2025 at 11:00 am till 12:00 pm
    Dennis Shen, University of Southern California
    E18-304

    Abstract: One dominant approach to evaluate the causal effect of a treatment is through panel data analysis, whereby the behaviors of multiple units are observed over time. The information across time and units motivates two general approaches: (i) horizontal regression (i.e., unconfoundedness), which exploits time series patterns, and (ii) vertical regression (e.g., synthetic controls), which exploits cross-sectional patterns. Conventional wisdom often considers the two approaches to be different. We establish this position to be partly false for estimation but generally true for inference. In the absence of any assumptions, we show that both approaches yield algebraically equivalent point estimates for several standard estimators. However, the source of randomness assumed by each approach leads to a distinct estimand and quantification of uncertainty even for the same point estimate. This emphasizes that researchers should carefully consider where the randomness stems from in their data, as it has direct implications for the accuracy of inference.

    Bio:
    Dennis Shen is an assistant professor in the Data Sciences and Operations Department at the USC Marshall School of Business. Before joining USC, he was a FODSI postdoctoral fellow at the Simons Institute at UC Berkeley. He also served as a technical consultant for Uber Technologies and TauRx Therapeutics. He has received several recognition for his work, including the INFORMS George B. Dantzig Dissertation Award (2nd to his esteemed colleague, Somya) and MIT George Sprowls PhD Thesis Award in Artificial Intelligence & Decision-making.

    Find out more »: Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data
  • Causal Inference on Outcomes Learned from Text

    On April 11, 2025 at 11:00 am till 12:00 pm
    Jann Spiess, Stanford University
    E18-304

    Abstract:

    (with Iman Modarressi and Amar Venugopal; arxiv.org/abs/2503.00725

    We propose a machine-learning tool that yields causal inference on text in randomized trials. Based on a simple econometric framework in which text may capture outcomes of interest, our procedure addresses three questions: First, is the text affected by the treatment? Second, which outcomes is the effect on? And third, how complete is our description of causal effects? To answer all three questions, our approach uses large language models (LLMs) that suggest systematic differences across two groups of text documents and then provides valid inference based on costly validation. Specifically, we highlight the need for sample splitting to allow for statistical validation of LLM outputs, as well as the need for human labeling to validate substantive claims about how documents differ across groups. We illustrate the tool in a proof-of-concept application using abstracts of academic manuscripts.

    Bio:
    Jann is an econometrician in the OIT group at Stanford GSB. His current research focuses broadly on two related themes: (1) high-dimensional and robust causal inference, including work on using machine learning to improve inferences from randomized trials, robust inference in panel data, synthetic control, matching estimation, highly over-parametrized models, and high-dimensional outcome data; and (2) data-driven decision-making with misaligned objectives, including work on algorithmic fairness, human–AI interaction, the regulation of algorithms, and the design of pre-analysis plans. He holds a PhD in economics from Harvard University.

    Find out more »: Causal Inference on Outcomes Learned from Text