Loading Events
  • This event has passed.
Stochastics and Statistics Seminar

Variable selection using presence-only data with applications to biochemistry

February 9, 2018 @ 11:00 am - 12:00 pm

Garvesh Raskutti (University of Wisconsin)


Abstract:  In a number of problems, we are presented with positive and unlabelled data, referred to as presence-only responses. The application I present today involves studying the relationship between protein sequence and function and presence-only data arises since for many experiments it is impossible to obtain a large set of negative (non-functional) sequences. Furthermore, if the number of variables is large and the goal is variable selection (as in this case), a number of statistical and computational challenges arise due to the non-convexity of the objective. In this talk, I present an algorithm (PUlasso) with provable guarantees for doing variable selection and classification with presence-only data. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, our algorithm has two computational speed-ups to the standard EM algorithm. I provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of slogp/n, where s is the sparsity of the true parameter. I also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate p settings in terms of classification performance. Finally, I demonstrate that our PUlasso algorithm performs well on a biochemistry example.

Biography:  Since Fall 2013, he has been an Assistant Professor at University of Wisconsin-Madison in the Department of Statistics. He is also an affiliate for the Departments of Computer Science, Electrical and Computer Engineering and the Wisconsin Institute of Discovery Optimization Group. His research interests include statistical machine learning, optimization, graphical and network modeling and information theory with applications to systems biology and neuroscience. Prior to starting at UW, he completed a Masters of Engineering at University of Melbourne in 2008 under the joint supervision of Rodney S. Tucker and Kerry Hinton, and a PhD at UC Berkeley in 2012 under the joint supervision of Martin Wainwright and Bin Yu. He was also a post-doctoral fellow at SAMSI.

MIT Statistics + Data Science Center
Massachusetts Institute of Technology
77 Massachusetts Avenue
Cambridge, MA 02139-4307