Learning Policies for Partially Observable Environments: Scaling Up Michael L. Littman mlittman@cs.brown.edu Anthony R. Cassandra arc@cs.brown.edu Leslie Pack Kaelbling lpk@cs.brown.edu Department of Computer Science Brown University Providence, RI 02912-1910 Abstract Partially observable Markov decision processes (POMDPs) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of POMDPs is motivated by a need to address realistic problems, existing techniques for finding optimal behavior do not appear to scale well and have been unable to find satisfactory policies for problems with more than a dozen states. After a brief review of POMDPs, this paper discusses several simple solution methods and shows that all are capable of finding near-optimal policies for a selection of extremely small POMDPs taken from the learning literature. In contrast, we show that none are able to solve a slightly larger and noisier problem based on robot navigation. We find that a combination of two novel approaches performs well on these problems and suggest methods for scaling to even larger and more complicated domains.