TD(λ) converges with probability 1

作者:Peter Dayan, Terrence J. Sejnowski

摘要

The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future.

论文关键词:reinforcement learning, temporal differences, Q-learning

论文评审过程:

论文官网地址:https://doi.org/10.1007/BF00993978