Automated Data Summarization for Scalability in Bayesian Inference
November 22 @ 11:00 am - 12:00 pm
Tamara Broderick (MIT)
Many algorithms take prohibitively long to run on modern, large data sets. But even in complex data sets, many data points may be at least partially redundant for some task of interest. So one might instead construct and use a weighted subset of the data (called a “coreset”) that is much smaller than the original dataset. Typically running algorithms on a much smaller data set will take much less computing time, but it remains to understand whether the output can be widely useful. (1) In particular, can running an analysis on a smaller coreset yield answers close to those from running on the full data set? (2) And can useful coresets be constructed automatically for new analyses, with minimal extra work from the user? We answer in the affirmative for a wide variety of problems in Bayesian inference. We demonstrate how to construct “Bayesian coresets” as an automatic, practical pre-processing step. We prove that our method provides geometric decay in relevant approximation error as a function of coreset size. Empirical analysis shows that our method reduces approximation error by orders of magnitude relative to uniform random subsampling of data. Though we focus on Bayesian methods here, we also show that our construction can be applied in other domains.
Tamara Broderick is an Associate Professor in the Department of Electrical Engineering and Computer Science at MIT. She is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), the MIT Statistics and Data Science Center, and the Institute for Data, Systems, and Society (IDSS). She completed her Ph.D. in Statistics at the University of California, Berkeley in 2014. Previously, she received an AB in Mathematics from Princeton University (2007), a Master of Advanced Study for completion of Part III of the Mathematical Tripos from the University of Cambridge (2008), an MPhil by research in Physics from the University of Cambridge (2009), and an MS in Computer Science from the University of California, Berkeley (2013). Her recent research has focused on developing and analyzing models for scalable Bayesian machine learning. She has been awarded an AISTATS Notable Paper Award (2019), NSF CAREER Award (2018), a Sloan Research Fellowship (2018), an Army Research Office Young Investigator Program award (2017), Google Faculty Research Awards, an Amazon Research Award, the ISBA Lifetime Members Junior Researcher Award,
the Savage Award (for an outstanding doctoral dissertation in Bayesian theory and methods), the Evelyn Fix Memorial Medal and Citation (for the Ph.D. student on the Berkeley campus showing the greatest promise in statistical research), the Berkeley Fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize (for the graduating Princeton senior with the highest academic average).
The MIT Statistics and Data Science Center hosts guest lecturers from around the world in this weekly seminar.