Alleviating the estimation bias of deep deterministic policy gradient via co-regularization

作者:

Highlights:

• Our method dynamically alleviates the estimation biases based on the difference between the overestimated and underestimated learners.

• Theoretical analysis demonstrates that the estimation biases are reduced compared with the baselines.

• Our method achieves the most stable performance on the average reward compared with the baselines.

摘要

•Our method dynamically alleviates the estimation biases based on the difference between the overestimated and underestimated learners.•Theoretical analysis demonstrates that the estimation biases are reduced compared with the baselines.•Our method achieves the most stable performance on the average reward compared with the baselines.

论文关键词:Reinforcement learning,Overestimation,Underestimation,Co-training,Deterministic policy gradient

论文评审过程:Received 25 December 2021, Revised 14 May 2022, Accepted 22 June 2022, Available online 28 June 2022, Version of Record 10 July 2022.

论文官网地址:https://doi.org/10.1016/j.patcog.2022.108872