Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance

作者:

摘要

Several studies have demonstrated that reward from a human trainer can be a powerful feedback signal for control-learning algorithms. However, the space of algorithms for learning from such human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward, this article investigates the problem of learning from human reward through six experiments, focusing on the relationships between reward positivity, which is how generally positive a trainer's reward values are; temporal discounting, the extent to which future reward is discounted in value; episodicity, whether task learning occurs in discrete learning episodes instead of one continuing session; and task performance, the agent's performance on the task the trainer intends to teach. This investigation is motivated by the observation that an agent can pursue different learning objectives, leading to different resulting behaviors. We search for learning objectives that lead the agent to behave as the trainer intends.

论文关键词:Reinforcement learning,Modeling user behavior,End-user programming,Human–agent interaction,Interactive machine learning,Human teachers

论文评审过程:Received 19 August 2013, Revised 2 October 2014, Accepted 29 March 2015, Available online 2 April 2015.

论文官网地址:https://doi.org/10.1016/j.artint.2015.03.009