Probably Approximately Correct (PAC) Exploration in Reinforcement Learning
by Alexander L. Strehl
Abstract:
Reinforcement Learning (RL) is a powerful paradigm within the Artificial Intelligence and Machine Learning communities. The general problem is how to enable an agent (computer program, robot, etc…) to maximize an external reward signal by acting in an unknown environment. In this thesis, we study the problem as mathematically modeled by a finite state and finite action Markov Decision Process (MDP). In particular, our focus is on the problem of exploration: how does an agent determine whether to act to gain new information (explore) or to act consistently with past experience to maximize reward (exploit). We study both fundamental types of RL algorithms: those that explicitly model the dynamics of their environment (model-based) and those that learn only a value or utility function over the states of the environment (model-free). We develop and analyze algorithms of both types that are Probably Approximately Correct for MDPs (PAC-MDP), meaning that, with arbitrarily high probability, they (provably) act near-optimally almost all of the time. We also develop lower bounds and provide a general framework that applies to most of the results in this thesis and to other results that generalize the finite MDP assumption.