Combining Exploration and Control in Reinforcement Learning: The Convergence of SARSA Michael L. Littman ABSTRACT This paper describes a recently developed reinforcement-learning algorithm known as SARSA (also ``modified Q-learning'' or ``Q-bar''), which learns the value of its own behavior. The algorithm combines a reward-seeking component (to try to maximize its performance) and a primitive form of exploration (to ensure that the space of possible behaviors is searched adequately). The paper shows, for the first time, that SARSA converges to a well-defined value function, and that the asymptotic performance of the algorithm is optimal over the space of allowed behaviors.