Click here to flash read.
We study the problem of transfer-learning in the setting of stochastic linear
bandit tasks. We consider that a low dimensional linear representation is
shared across the tasks, and study the benefit of learning this representation
in the multi-task learning setting. Following recent results to design
stochastic bandit policies, we propose an efficient greedy policy based on
trace norm regularization. It implicitly learns a low dimensional
representation by encouraging the matrix formed by the task regression vectors
to be of low rank. Unlike previous work in the literature, our policy does not
need to know the rank of the underlying matrix. We derive an upper bound on the
multi-task regret of our policy, which is, up to logarithmic factors, of order
$\sqrt{NdT(T+d)r}$, where $T$ is the number of tasks, $r$ the rank, $d$ the
number of variables and $N$ the number of rounds per task. We show the benefit
of our strategy compared to the baseline $Td\sqrt{N}$ obtained by solving each
task independently. We also provide a lower bound to the multi-task regret.
Finally, we corroborate our theoretical findings with preliminary experiments
on synthetic data.
No creative common's license