Beyond UCB: statistical complexity and optimal algorithm for non-linear ridge bandits
Abstract: Many existing literature on bandits and reinforcement learning assume a linear reward/value function, but what happens if the reward is non-linear? Two curious phenomena arise for non-linear bandits: first, in addition to the "learning phase" with a standard \Theta(\sqrt(T)) regret, there is an "initialization phase" with a fixed cost determined by the reward function; second, achieving the smallest cost of the initialization phase requires new learning algorithms other than traditional ones such as UCB. For a special family of…