Active learning with seed examples and search queries
Abstract: Active learning is a framework for supervised learning that explicitly models, and permits one to control and optimize, the costs of labeling data. The hope is that by carefully selecting which examples to label in an adaptive manner, the number of labels required to learn an accurate classifier is substantially reduced. However, in many learning settings (e.g., when some classes are rare), it is difficult to identify which examples are most informative to label, and existing active learning algorithms are prone to labeling uninformative examples. Based…
SDSCon 2017 – Statistics and Data Science Center Conference
As part of the MIT Institute for Data, Systems, and Society (IDSS), the Statistics and Data Science Center (SDSC) is a MIT-wide focal point for advancing academic programs and research activities in statistics and data science. SDSC Day will be a celebration and community-building event for those interested in statistics. Discussions will cover applications of statistics and data science across a wide range of fields and approaches.
Testing properties of distributions over big domains
Abstract: We describe an emerging research direction regarding the complexity of testing global properties of discrete distributions, when given access to only a few samples from the distribution. Such properties might include testing if two distributions have small statistical distance, testing various independence properties, testing whether a distribution has a specific shape (such as monotone decreasing, k-modal, k-histogram, monotone hazard rate,...), and approximating the entropy. We describe bounds for such testing problems whose sample complexities are sublinear in the size…
Some related phase transitions in phylogenetics and social network analysis
Abstract: Spin systems on trees have found applications ranging from the reconstruction of phylogenies to the analysis of networks with community structure. A key feature of such processes is the interplay between the growth of the tree and the decay of correlations along it. How the resulting threshold phenomena impact estimation depends on the problem considered. I will illustrate this on two recent results: 1) the critical threshold of ancestral sequence reconstruction by maximum parsimony on general phylogenies and 2)…
Invariance and Causality
Abstract: Why are we interested in the causal structure of a process? In classical prediction tasks, for example, it seems that no causal knowledge is required. In many situations, however, we are interested in a system's behavior after parts of this system have been changed. Here, causal models become important because they are usually considered invariant under those changes. A causal prediction (which uses only direct causes of the target variable as predictors) remains valid even if we intervene on…
Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe
Abstract: We consider the problem of bandit optimization, inspired by stochastic optimization and online learning with bandit feedback. In this problem, the objective is to minimize a global, not necessarily cumulative, convex loss function. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we analyze the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques ranging from bandits to convex optimization. We identify slow and fast of…
New provable techniques for learning and inference in probabilistic graphical models
Abstract: A common theme in machine learning is succinct modeling of distributions over large domains. Probabilistic graphical models are one of the most expressive frameworks for doing this. The two major tasks involving graphical models are learning and inference. Learning is the task of calculating the "best fit" model parameters from raw data, while inference is the task of answering probabilistic queries for a model with known parameters (e.g. what is the marginal distribution of a subset of variables, after…
Sample complexity of population recovery
Abstract: In this talk we will first consider a general question of estimating linear functional of the distribution based on the noisy samples from it. We discover that the (two-point) LeCam lower bound is in fact achievable by optimizing bias-variance tradeoff of an empirical-mean type of estimator. Next, we apply this general framework to the specific problem of population recovery. Namely, consider a random poll of sample size n conducted on a population of individuals, where each pollee is asked to…
Walking within growing domains: recurrence versus transience
Abstract: When is simple random walk on growing in time d-dimensional domains recurrent? For domain growth which is independent of the walk, we review recent progress and related universality conjectures about a sharp recurrence versus transience criterion in terms of the growth rate. We compare this with the question of recurrence/transience for time varying conductance models, where Gaussian heat kernel estimates and evolving sets play an important role. We also briefly contrast such expected universality with examples of the rich…