Separating Estimation from Decision Making in Contextual Bandits
Abstract: The contextual bandit is a sequential decision making problem in which a learner repeatedly selects an action (e.g., a news article to display) in response to a context (e.g., a user’s profile) and receives a reward, but only for the action they selected. Beyond the classic explore-exploit tradeoff, a fundamental challenge in contextual bandits is to develop algorithms that can leverage flexible function approximation to model similarity between contexts, yet have computational requirements comparable to classical supervised learning tasks…