Planning Under Uncertainty
I will make a number of presentations to introduce students to the
background material necessary to read the research papers---only
undergraduate-level computer science knowledge and basic probability
theory and calculus will be assumed. I think there are useful
contributions to be made by researchers in AI, algorithms, complexity,
numerical methods, and systems, and I think people in all these areas
would find some useful information in the seminar. Everyone's welcome!
This seminar will introduce students to an exciting area of research
that draws upon results from operations research (Markov decision
processes), machine learning (reinforcement learning), and traditional
AI (planning) to attack problems with great scientific and commercial
potential. We will read and discuss a handful of recent papers that
will give students an appreciation for the state of the art. Students
will undertake a group research project to solidify their
understanding of the basic concepts and perhaps to contribute to the
Michael L. Littman
I will make a number of presentations to introduce students to the background material necessary to read the research papers---only undergraduate-level computer science knowledge and basic probability theory and calculus will be assumed. I think there are useful contributions to be made by researchers in AI, algorithms, complexity, numerical methods, and systems, and I think people in all these areas would find some useful information in the seminar. Everyone's welcome!
The purpose of this seminar is to explore some middle ground between these two well-studied extremes with the hope of understanding how we might create systems that can reason efficiently about plans in complex, uncertain worlds. We will review the foundational results from AI and OR and read a series of papers written over the last few years that have begun to bridge the gap.
My approach to organizing the seminar will be to try to keep the assigned reading to a minimum and to ask students to concentrate on understanding the state of the field and on identifying the important open research questions.
Nonetheless, we need some common ground to begin. I will assume that students are familiar with programming (any language), algorithm analysis (big O notation), calculus (derivates of multivariate functions), and probability theory (conditional probabilities).
Michael Lederman Littman. Algorithms for Sequential Decision Making. Ph.D. dissertation and Technical Report CS-96-09, Brown University, Department of Computer Science, Providence, RI, March 1996. Chapter 1: Introduction. (local postscript, local bibliography postscript)
In a sense, these algorithms completely solve the problem of planning under uncertainty. The rest of the seminar is concerned with solving MDPs more efficiently by exploiting additional structure present in some instances.
Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling. On the complexity of solving Markov decision problems. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI--95), Montreal, Quebec, Canada, 1995. (postscript)
Homework: Represent a complex domain as an MDP.
Prioritized Sweeping uses a heuristic for measuring when updating the value of a state is likely to be important for computing an approximately optimal solution quickly.
Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103, 1993. (compressed postscript)
Real-time dynamic programming attempts to find a good approximate policy quickly by focussing updates on states that are likely to be visited.
Andrew G. Barto, S. J. Bradtke and Satinder P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1):81--138, 1995. (compressed postscript)
Another approach is to explicitly produce a good partial policy by identifying states that are likely to be visited and solving a smaller MDP.
Jonathan Tash and Stuart Russell. Control strategies for a stochastic planner. In Proceedings of the 12th National Conference on Artificial Intelligence, 1079--1085, 1994. (postscript)
This insight can be exploited by exchanging the classical table-based method for representating value functions to one that uses a function approximator (for example, a neural net) to map state description vectors to values. A wildly successful example of this is TD-Gammon; this work makes use of several important background ideas including gradient descent and temporal difference learning that we will need to look at as well.
Richard S. Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3(1):9-44, 1988. (postscript)
Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58--68, 1995. (html)
Another interesting application of TD lambda and neural networks applied to MDP-like problems is Crites and Barto's elevator controller.
Robert H. Crites and Andrew G. Barto. Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, 1996. The MIT Press. (compressed postscript)
An even simpler and similarly successful example for cellular phone channel assignments is based on TD(0) and a linear function approximator.
Satinder Singh and Dimitri Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. To appear in Advances in Neural Information Processing Systems 9, 1997. The MIT Press. ( citeseer page)
Tesauro's work is difficult to generalize because it simultaneously addresses many unsolved problems. More recent work has begun to tease apart the effect of using function approximation in dynamic programming from the use of the temporal difference algorithm.
Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, 1995. The MIT Press. (compressed postscript, Recent workshop on value function approximation)
These results were not convincing to everyone.
Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, 1996. The MIT Press. (postscript)
Some of the most interesting recent work has concerned theoretical results on when function approximation will and will not result in a convergent algorithm. Results exist concerning gradient descent methods and averaging methods.
Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, 30--37, 1995. Morgan Kaufmann. (compressed postscript, html)
Geoffrey J. Gordon. Stable function approximation in dynamic programming. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, 261--268. 1995. Morgan Kaufmann. (compressed postscript)
John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1/2/3): 59--94, 1996. (local postscript)
Homework: Propose a research project.
David McAllester and David Rosenblitt. Systematic nonlinear planning. In Proceedings of the 9th National Conference on Artificial Intelligence, 1991. (postscript, LISP code)
In the last two years, planning algorithms have been proposed that differ substantially from the classic planners. Although unconventional, these planners have been shown empirically to result in much shorter running times (up to several orders of magnitude faster). Blum and Furst's algorithm views planning as a type of graph search while Kautz and Selman reduce planning to a boolean satisfiability problem.
Avrim Blum and Merrick Furst. Fast planning through planning graph analysis. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pages 1636--1642, 1995. (extended version in compressed postscript)
Henry Kautz and Bart Selman. Pushing the envelope: Planning, propositional logic, and stochastic search. In Proceedings of the 13th National Conference on Artificial Intelligence, 1996. (postscript)
In spite of recent algorithmic advances, the traditional view of planning ignores uncertainty. Uncertainty can be introduced gently by assuming a deterministic domain with some randomness added in.
Jim Blythe. Planning with external events. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, 1994. (author's page)
The Buridan system introduces a more general representation for stochastic STRIPS operators and extends partial order planning to stochastic domains. Its representation is equivalent in expressiveness to MDPs.
Nicholas Kushmerick, Steve Hanks and Daniel S. Weld. An algorithm for probabilistic planning. Artificial Intelligence, 76(1-2): 239--286, 1995. (compressed postscript)
The Buridan system has been expanded so that the plan representation is more powerful (though less powerful than a policy-like representation).
Denise Draper, Steve Hanks and Dan Weld. Probabilistic planning with information gathering and contingent execution. Technical Report 93-12-04, University of Washington, Department of Computer Science, Seattle, WA, December, 1993. (compressed postscript)
An area of intense interest (but remarkably little work!) is combining direct manipulation of STRIPS-type actions with a dynamic-programming-based algorithm. Several papers adopt the view that function approximation is a form of ``abstraction,'' the form of which can be derived automatically from a propositional representation of the planning problem.
Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. To appear in Artificial Intelligence. (postscript)
Craig Boutilier, Richard Dearden and Moises Goldszmidt. Exploiting structure in policy construction. To appear in Proceedings of the International Joint Conference on Artificial Intelligence, 1995. (postscript)
Craig Boutilier and Richard Dearden. Approximating value trees in structured dynamic programming. To appear in Proceedings of the Thirteenth International Conference on Machine Learning, 1996. (postscript) Slides: postscript
Summary Slides: postscript
We'll deal with partially observable Markov decision processes and how to solve them.
Michael L. Littman, Anthony R. Cassandra, and Leslie Pack Kaelbling. Efficient dynamic-programming updates in partially observable Markov decision processes. Submitted to Operations Research. Also available as Brown University Technical Report CS-95-19. (abstract, local postscript)
Slides (on POMDPs from recent talk): postscript
Slides (on algorithms): postscript
Markov games are closely related to MDPs.
Anne Condon. On algorithms for simple stochastic games. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 13:51--71, 1993.
Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning, Proceedings of the Eleventh International Conference on Machine Learning, pages 157--163, 1994. (local postscript)
Lastly, here's a recent paper that combines ideas from the GraphPlan work with ideas on abstraction.
Craig Boutilier, Ronen Brafman and Chris Geib. Structured reachability analysis for solving Markov decision processes, November 1996. (local postscript (draft, do not circulate))
Daphne Koller and Avi Pfeffer. Representations and solutions for game-theoretic problems. Preliminary version appeared in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, August 1995, pages 1185--1192. (postscript)
Michael Trick and Stanley Zin. A linear programming approach to solving stochastic dynamic programs, 1993. (postscript)
We've talked about using the GALA system as a basis for a general declarative language for specifying MDPs. Another options is a standard proposed by Rich Sutton (no longer available, replaced by RLFramework at Alberta in 2005).
We wrote several papers about the projects we worked on.
Michael S. Fulkerson, Michael L. Littman, and Greg A. Keim.
Speeding Safely: Multi-criteria optimization in probabilistic
planning. Submitted Student Abstract for AAAI-97 (postscript)
Stephen M. Majercik and Michael L. Littman. Reinforcement learning
for selfish load balancing in a distributed memory environment. In
Proceedings of the Second International Conference on
Computational Intelligence and Neuroscience, 1997 (forthcoming).
We're working on one more group paper on the load balancing stuff.
Last modified: Fri Jan 10 08:46:21 EST 1997 by Michael Littman, email@example.com
Michael S. Fulkerson, Michael L. Littman, and Greg A. Keim. Speeding Safely: Multi-criteria optimization in probabilistic planning. Submitted Student Abstract for AAAI-97 (postscript)
Stephen M. Majercik and Michael L. Littman. Reinforcement learning for selfish load balancing in a distributed memory environment. In Proceedings of the Second International Conference on Computational Intelligence and Neuroscience, 1997 (forthcoming). (abstract, postscript)
We're working on one more group paper on the load balancing stuff.