TD(λ) converges with probability 1
作者:Peter Dayan, Terrence J. Sejnowski
摘要
The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future.
论文关键词:reinforcement learning, temporal differences, Q-learning
论文评审过程:
论文官网地址:https://doi.org/10.1007/BF00993978