The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning. (arXiv:2305.15703v3 [cs.LG] UPDATED)

Click here to flash read.

While distributional reinforcement learning (DistRL) has been empirically
effective, the question of when and why it is better than vanilla,
non-distributional RL has remained unanswered. This paper explains the benefits
of DistRL through the lens of small-loss bounds, which are instance-dependent
bounds that scale with optimal achievable cost. Particularly, our bounds
converge much faster than those from non-distributional approaches if the
optimal cost is small. As warmup, we propose a distributional contextual bandit
(DistCB) algorithm, which we show enjoys small-loss regret bounds and
empirically outperforms the state-of-the-art on three real-world tasks. In
online RL, we propose a DistRL algorithm that constructs confidence sets using
maximum likelihood estimation. We prove that our algorithm enjoys novel
small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce
the $\ell_1$ distributional eluder dimension which may be of independent
interest. Then, in offline RL, we show that pessimistic DistRL enjoys
small-loss PAC bounds that are novel to the offline setting and are more robust
to bad single-policy coverage.

Click here to read this post out

ID: 426975; Unique Viewers: 0

Unique Voters: 0

Total Votes: 0

Votes:

Latest Change: Sept. 26, 2023, 7:31 a.m. Changes:

/u/anonymous

Dictionaries:

Words:

Spaces:

CC:
No creative common's license

Comments: