Online fitted policy iteration based on extreme learning machines

作者：

Highlights：

•

摘要

Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to learn useful policies and the poor scalability to high-dimensional problems due to the use of local approximators. This paper presents a novel RL algorithm, called online fitted policy iteration (OFPI), that steps forward in both directions. OFPI is based on a semi-batch scheme that increases the convergence speed by reusing data and enables the use of global approximators by reformulating the value function approximation as a standard supervised problem. The proposed method has been empirically evaluated in three benchmark problems. During the experiments, OFPI has employed a neural network trained with the extreme learning machine algorithm to approximate the value functions. Results have demonstrated the stability of OFPI using a global function approximator and also performance improvements over two baseline algorithms (SARSA and Q-learning) combined with eligibility traces and a radial basis function network.

论文关键词：Reinforcement learning,Sequential decision-making,Fitted policy iteration,Extreme learning machine

论文评审过程：Received 2 July 2015, Revised 29 December 2015, Accepted 8 March 2016, Available online 14 March 2016, Version of Record 2 April 2016.

论文官网地址：https://doi.org/10.1016/j.knosys.2016.03.007