Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

作者：Nicolas Meuleau, Paul Bourgine

摘要

This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to define a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suffers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular configurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and back-propagating them with the dynamic programming or temporal difference mechanisms. This allows reproducing global-scale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the efficiency of these propositions.

论文关键词：reinforcement learning, exploration vs. exploitation dilemma, Markov decision processes, bandit problems

论文评审过程：

论文官网地址：https://doi.org/10.1023/A:1007541107674