Fall 1996

Planning Under Uncertainty

[ Background | Grading | Outline | Projects ]

Class is over... come see the papers we wrote!

Evolving Syllabus...


This seminar will introduce students to an exciting area of research that draws upon results from operations research (Markov decision processes), machine learning (reinforcement learning), and traditional AI (planning) to attack problems with great scientific and commercial potential. We will read and discuss a handful of recent papers that will give students an appreciation for the state of the art. Students will undertake a group research project to solidify their understanding of the basic concepts and perhaps to contribute to the field.

I will make a number of presentations to introduce students to the background material necessary to read the research papers---only undergraduate-level computer science knowledge and basic probability theory and calculus will be assumed. I think there are useful contributions to be made by researchers in AI, algorithms, complexity, numerical methods, and systems, and I think people in all these areas would find some useful information in the seminar. Everyone's welcome!


Michael L. Littman


Research in planning, making a sequence of choices to achieve some goal, has been a mainstay of artificial intelligence (AI) for many years. Traditionally, the decision-making models that have been studied admit no uncertainty whatsoever---every aspect of the world that is relevant to the generation and execution of a plan is known in advance. In contrast, work in operations research (OR) has focussed on the uncertainty of actions but uses an impoverished representation for specifying planning problems.

The purpose of this seminar is to explore some middle ground between these two well-studied extremes with the hope of understanding how we might create systems that can reason efficiently about plans in complex, uncertain worlds. We will review the foundational results from AI and OR and read a series of papers written over the last few years that have begun to bridge the gap.


There are basically two or three papers on different approaches to the same basic problem that I'd like people to read and understand. These papers are quite recent and represent active areas of research that have been maturing quite quickly over the last few years. As a result, to get a deep appreciation for this work, we will need to read a number of papers that introduce the necessary background.

My approach to organizing the seminar will be to try to keep the assigned reading to a minimum and to ask students to concentrate on understanding the state of the field and on identifying the important open research questions.


The seminar should be accessible to any advanced computer science student. My goal is to introduce critical background material as the need arises.

Nonetheless, we need some common ground to begin. I will assume that students are familiar with programming (any language), algorithm analysis (big O notation), calculus (derivates of multivariate functions), and probability theory (conditional probabilities).


In addition to exploring the question of how to create plans that are effective in uncertain environments, there are a number of other important topics that students will learn about in this seminar: students will be exposed to Markov decision processes, dynamic programming, linear programming, temporal difference learning, supervised function approximation, gradient descent and neural networks, STRIPS rules, and partial order planning.


The grading policy is designed to stimulate students to think about some of the important issues in this area. Class grade will be based on:


Organization Meeting

On Thursday, September 5th (D344, 2:30-3:00), we'll get together to discuss the best time to schedule the class. If you can't make the meeting, please send me email (


What is planning? What is uncertainty? What are some applications of planning under uncertainty? I'll lay out the space of issues and describe the part of the space we'll explore.

Michael Lederman Littman. Algorithms for Sequential Decision Making. Ph.D. dissertation and Technical Report CS-96-09, Brown University, Department of Computer Science, Providence, RI, March 1996. Chapter 1: Introduction. (local postscript, local bibliography postscript)

Markov Decision Processes

I'll introduce the MDP model, which is a formal specification for the particular problem we will be examining. I'll describe the fundamental concepts (states, actions, transitions, rewards, discounting, horizons), results (existence and dominance of optimal value function, optimal greedy policies), and the algorithms (value iteration, policy iteration, modified policy iteration, linear programming).

In a sense, these algorithms completely solve the problem of planning under uncertainty. The rest of the seminar is concerned with solving MDPs more efficiently by exploiting additional structure present in some instances.

Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling. On the complexity of solving Markov decision problems. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI--95), Montreal, Quebec, Canada, 1995. (postscript)

Homework: Represent a complex domain as an MDP.

Slides: postscript

Accelerating Solutions to MDPs

One class of algorithms for solving MDPs more quickly restricts value-iteration updates to states that are likely to benefit from additional computational resources.

Prioritized Sweeping uses a heuristic for measuring when updating the value of a state is likely to be important for computing an approximately optimal solution quickly.

Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103, 1993. (compressed postscript)

Real-time dynamic programming attempts to find a good approximate policy quickly by focussing updates on states that are likely to be visited.

Andrew G. Barto, S. J. Bradtke and Satinder P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1):81--138, 1995. (compressed postscript)

Another approach is to explicitly produce a good partial policy by identifying states that are likely to be visited and solving a smaller MDP.

Jonathan Tash and Stuart Russell. Control strategies for a stochastic planner. In Proceedings of the 12th National Conference on Artificial Intelligence, 1079--1085, 1994. (postscript)

Slides: postscript

Value Function Approximation

The above approaches represent states as being completely unrelated objects. In many domains, states can be described in such a way that similar states (from a policy or value standpoint) have similar representations. For example, any attempt to create a transition function for the game of backgammon would likely make use of a board-based representation of states.

This insight can be exploited by exchanging the classical table-based method for representating value functions to one that uses a function approximator (for example, a neural net) to map state description vectors to values. A wildly successful example of this is TD-Gammon; this work makes use of several important background ideas including gradient descent and temporal difference learning that we will need to look at as well.

Richard S. Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3(1):9-44, 1988. (postscript)

Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58--68, 1995. (html)

Slides: postscript

Another interesting application of TD lambda and neural networks applied to MDP-like problems is Crites and Barto's elevator controller.

Robert H. Crites and Andrew G. Barto. Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, 1996. The MIT Press. (compressed postscript)

An even simpler and similarly successful example for cellular phone channel assignments is based on TD(0) and a linear function approximator.

Satinder Singh and Dimitri Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. To appear in Advances in Neural Information Processing Systems 9, 1997. The MIT Press. ( citeseer page)

Tesauro's work is difficult to generalize because it simultaneously addresses many unsolved problems. More recent work has begun to tease apart the effect of using function approximation in dynamic programming from the use of the temporal difference algorithm.

Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, 1995. The MIT Press. (compressed postscript, Recent workshop on value function approximation)

These results were not convincing to everyone.

Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, 1996. The MIT Press. (postscript)

Slides: postscript

Some of the most interesting recent work has concerned theoretical results on when function approximation will and will not result in a convergent algorithm. Results exist concerning gradient descent methods and averaging methods.

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, 30--37, 1995. Morgan Kaufmann. (compressed postscript, html)

Geoffrey J. Gordon. Stable function approximation in dynamic programming. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, 261--268. 1995. Morgan Kaufmann. (compressed postscript)

John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1/2/3): 59--94, 1996. (local postscript)

Homework: Propose a research project.

Stochastic Planning

The work on function approximation attempts to exploit structure in the state space but treats actions as black box tranformations from states to distributions over states. A promising alternative is to use symbolic descriptions of the actions to reason about entire classes of state-to-state transitions all at once. This is the approach taken in AI planning.

David McAllester and David Rosenblitt. Systematic nonlinear planning. In Proceedings of the 9th National Conference on Artificial Intelligence, 1991. (postscript, LISP code)

In the last two years, planning algorithms have been proposed that differ substantially from the classic planners. Although unconventional, these planners have been shown empirically to result in much shorter running times (up to several orders of magnitude faster). Blum and Furst's algorithm views planning as a type of graph search while Kautz and Selman reduce planning to a boolean satisfiability problem.

Avrim Blum and Merrick Furst. Fast planning through planning graph analysis. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pages 1636--1642, 1995. (extended version in compressed postscript)

Henry Kautz and Bart Selman. Pushing the envelope: Planning, propositional logic, and stochastic search. In Proceedings of the 13th National Conference on Artificial Intelligence, 1996. (postscript)

Slides: postscript

In spite of recent algorithmic advances, the traditional view of planning ignores uncertainty. Uncertainty can be introduced gently by assuming a deterministic domain with some randomness added in.

Jim Blythe. Planning with external events. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, 1994. (author's page)

The Buridan system introduces a more general representation for stochastic STRIPS operators and extends partial order planning to stochastic domains. Its representation is equivalent in expressiveness to MDPs.

Slides: postscript

Nicholas Kushmerick, Steve Hanks and Daniel S. Weld. An algorithm for probabilistic planning. Artificial Intelligence, 76(1-2): 239--286, 1995. (compressed postscript)

The Buridan system has been expanded so that the plan representation is more powerful (though less powerful than a policy-like representation).

Denise Draper, Steve Hanks and Dan Weld. Probabilistic planning with information gathering and contingent execution. Technical Report 93-12-04, University of Washington, Department of Computer Science, Seattle, WA, December, 1993. (compressed postscript)

Slides: postscript

An area of intense interest (but remarkably little work!) is combining direct manipulation of STRIPS-type actions with a dynamic-programming-based algorithm. Several papers adopt the view that function approximation is a form of ``abstraction,'' the form of which can be derived automatically from a propositional representation of the planning problem.

Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. To appear in Artificial Intelligence. (postscript)

Craig Boutilier, Richard Dearden and Moises Goldszmidt. Exploiting structure in policy construction. To appear in Proceedings of the International Joint Conference on Artificial Intelligence, 1995. (postscript)

Craig Boutilier and Richard Dearden. Approximating value trees in structured dynamic programming. To appear in Proceedings of the Thirteenth International Conference on Machine Learning, 1996. (postscript) Slides: postscript

Summary Slides: postscript

Advanced Topics

If we make unexpectedly fast progress through the core topics, there are a number of interesting issues we could explore including: hierarchical solutions to MDPs, partially observable MDPs, solving games.

We'll deal with partially observable Markov decision processes and how to solve them.

Michael L. Littman, Anthony R. Cassandra, and Leslie Pack Kaelbling. Efficient dynamic-programming updates in partially observable Markov decision processes. Submitted to Operations Research. Also available as Brown University Technical Report CS-95-19. (abstract, local postscript)

Slides (on POMDPs from recent talk): postscript

Slides (on algorithms): postscript

Markov games are closely related to MDPs.

Anne Condon. On algorithms for simple stochastic games. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 13:51--71, 1993.

Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning, Proceedings of the Eleventh International Conference on Machine Learning, pages 157--163, 1994. (local postscript)

Slides: postscript

Lastly, here's a recent paper that combines ideas from the GraphPlan work with ideas on abstraction.

Craig Boutilier, Ronen Brafman and Chris Geib. Structured reachability analysis for solving Markov decision processes, November 1996. (local postscript (draft, do not circulate))

Slides: postscript

Other Stuff

Here are some papers I've recently found that might be useful to discuss. The first gives a notation for representing multi-player games of incomplete information that might be useful in defining a similar notation for MDPs. The second describes how to solve very large (and even continuous state) MDPs efficiently using linear programming.

Daphne Koller and Avi Pfeffer. Representations and solutions for game-theoretic problems. Preliminary version appeared in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, August 1995, pages 1185--1192. (postscript)

Michael Trick and Stanley Zin. A linear programming approach to solving stochastic dynamic programs, 1993. (postscript)

Project Ideas

The major thrust of the seminar will be to undertake a group project exploring some facet of the problem of planning under uncertainty. Some sample project ideas are: We're starting to make some progress on a domain for later project development. Stephen Majercik has written up his ideas on a load balancing MDP, which are accessible from his home page.

We've talked about using the GALA system as a basis for a general declarative language for specifying MDPs. Another options is a standard proposed by Rich Sutton (no longer available, replaced by RLFramework at Alberta in 2005).

We wrote several papers about the projects we worked on.

Michael S. Fulkerson, Michael L. Littman, and Greg A. Keim. Speeding Safely: Multi-criteria optimization in probabilistic planning. Submitted Student Abstract for AAAI-97 (postscript)

Stephen M. Majercik and Michael L. Littman. Reinforcement learning for selfish load balancing in a distributed memory environment. In Proceedings of the Second International Conference on Computational Intelligence and Neuroscience, 1997 (forthcoming). (abstract, postscript)

We're working on one more group paper on the load balancing stuff.

Last modified: Fri Jan 10 08:46:21 EST 1997 by Michael Littman, mlittman@cs.duke.edu