Click here to flash read.
Reinforcement learning (RL) has proven to be highly effective in tackling
complex decision-making and control tasks. However, prevalent model-free RL
methods often face severe performance degradation due to the well-known
overestimation issue. In response to this problem, we recently introduced an
off-policy RL algorithm, called distributional soft actor-critic (DSAC or
DSAC-v1), which can effectively improve the value estimation accuracy by
learning a continuous Gaussian value distribution. Nonetheless, standard DSAC
has its own shortcomings, including occasionally unstable learning processes
and needs for task-specific reward scaling, which may hinder its overall
performance and adaptability in some special tasks. This paper further
introduces three important refinements to standard DSAC in order to address
these shortcomings. These refinements consist of critic gradient adjusting,
twin value distribution learning, and variance-based target return clipping.
The modified RL algorithm is named as DSAC with three refinements (DSAC-T or
DSAC-v2), and its performances are systematically evaluated on a diverse set of
benchmark tasks. Without any task-specific hyperparameter tuning, DSAC-T
surpasses a lot of mainstream model-free RL algorithms, including SAC, TD3,
DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T, unlike
its standard version, ensures a highly stable learning process and delivers
similar performance across varying reward scales.
No creative common's license