TD(λ) converges with probability 1

作者：Peter Dayan, Terrence J. Sejnowski

摘要

The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future.

论文关键词：reinforcement learning, temporal differences, Q-learning

论文评审过程：

论文官网地址：https://doi.org/10.1007/BF00993978

原文链接
谷歌学术
必应学术
百度学术