## Fall 1996 |
## Planning Under Uncertainty |

I will make a number of presentations to introduce students to the background material necessary to read the research papers---only undergraduate-level computer science knowledge and basic probability theory and calculus will be assumed. I think there are useful contributions to be made by researchers in AI, algorithms, complexity, numerical methods, and systems, and I think people in all these areas would find some useful information in the seminar. Everyone's welcome!

- Office: D209 LSRC
- Phone: 660-6537
- Email: mlittman@cs.duke.edu
- Office hours: TBA

The purpose of this seminar is to explore some middle ground between these two well-studied extremes with the hope of understanding how we might create systems that can reason efficiently about plans in complex, uncertain worlds. We will review the foundational results from AI and OR and read a series of papers written over the last few years that have begun to bridge the gap.

My approach to organizing the seminar will be to try to keep the assigned reading to a minimum and to ask students to concentrate on understanding the state of the field and on identifying the important open research questions.

Nonetheless, we need some common ground to begin. I will assume that students are familiar with programming (any language), algorithm analysis (big O notation), calculus (derivates of multivariate functions), and probability theory (conditional probabilities).

- class participation (25%),
- short homework assignments (25%), and
- a final project (50%).

Michael Lederman Littman. Algorithms for Sequential Decision Making. Ph.D. dissertation and Technical Report CS-96-09, Brown University, Department of Computer Science, Providence, RI, March 1996. Chapter 1: Introduction. (local postscript, local bibliography postscript)

In a sense, these algorithms completely solve the problem of planning under uncertainty. The rest of the seminar is concerned with solving MDPs more efficiently by exploiting additional structure present in some instances.

Michael L. Littman, Thomas L. Dean, and Leslie Pack
Kaelbling. On the complexity of solving Markov decision problems. In
*Proceedings of the Eleventh Annual Conference on Uncertainty in
Artificial Intelligence (UAI--95)*, Montreal, Quebec, Canada,
1995. (postscript)

Homework: Represent a complex domain as an MDP.

Slides: postscript

Prioritized Sweeping uses a heuristic for measuring when updating the value of a state is likely to be important for computing an approximately optimal solution quickly.

Andrew W. Moore and Christopher G. Atkeson. Prioritized
sweeping: Reinforcement learning with less data and less real
time. *Machine Learning*, 13:103, 1993. (compressed
postscript)

Real-time dynamic programming attempts to find a good approximate policy quickly by focussing updates on states that are likely to be visited.

Andrew G. Barto, S. J. Bradtke and Satinder P. Singh.
Learning to act using real-time dynamic programming. *Artificial
Intelligence*, 72(1):81--138, 1995. (compressed
postscript)

Another approach is to explicitly produce a good partial policy by identifying states that are likely to be visited and solving a smaller MDP.

Jonathan Tash and Stuart Russell. Control strategies for a
stochastic planner. In *Proceedings of the 12th National
Conference on Artificial Intelligence*, 1079--1085, 1994. (postscript)

Slides: postscript

This insight can be exploited by exchanging the classical table-based method for representating value functions to one that uses a function approximator (for example, a neural net) to map state description vectors to values. A wildly successful example of this is TD-Gammon; this work makes use of several important background ideas including gradient descent and temporal difference learning that we will need to look at as well.

Richard S. Sutton. Learning to predict by the method of
temporal differences. *Machine Learning*, 3(1):9-44, 1988. (postscript)

Gerald Tesauro. Temporal difference learning and TD-Gammon.
*Communications of the ACM*, 38(3):58--68, 1995. (html)

Slides: postscript

Another interesting application of TD lambda and neural networks applied to MDP-like problems is Crites and Barto's elevator controller.

Robert H. Crites and Andrew G. Barto. Improving
elevator performance using reinforcement learning. In
D. S. Touretzky, M. C. Mozer and M. E. Hasselmo, editors, *Advances
in Neural Information Processing Systems 8*, 1996. The MIT Press.
(compressed
postscript)

An even simpler and similarly successful example for cellular phone channel assignments is based on TD(0) and a linear function approximator.

Satinder Singh and Dimitri Bertsekas.
Reinforcement learning for dynamic channel allocation in cellular
telephone systems. To appear in *Advances in Neural Information
Processing Systems 9*, 1997. The MIT Press.
(
citeseer page)

Tesauro's work is difficult to generalize because it simultaneously addresses many unsolved problems. More recent work has begun to tease apart the effect of using function approximation in dynamic programming from the use of the temporal difference algorithm.

Justin A. Boyan and Andrew W. Moore. Generalization in
reinforcement learning: Safely approximating the value function. In
G. Tesauro, D. S. Touretzky and T. K. Leen, editors, *Advances in
Neural Information Processing Systems 7*, 1995. The MIT
Press. (compressed
postscript, Recent
workshop on value function approximation)

These results were not convincing to everyone.

Richard S. Sutton. Generalization in reinforcement learning:
Successful examples using sparse coarse coding. In *Advances in
Neural Information Processing Systems 8*, 1996. The MIT Press.
(postscript)

Slides: postscript

Some of the most interesting recent work has concerned theoretical results on when function approximation will and will not result in a convergent algorithm. Results exist concerning gradient descent methods and averaging methods.

Leemon Baird. Residual algorithms: Reinforcement
learning with function approximation. In Armand Prieditis and Stuart
Russell, editors, *Proceedings of the Twelfth International
Conference on Machine Learning*, 30--37, 1995. Morgan Kaufmann.
(compressed
postscript, html)

Geoffrey J. Gordon. Stable function approximation in dynamic
programming. In Armand Prieditis and Stuart Russell, editors,
*Proceedings of the Twelfth International Conference on Machine
Learning*, 261--268. 1995. Morgan Kaufmann. (compressed
postscript)

John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1/2/3): 59--94, 1996. (local postscript)

Homework: Propose a research project.

David McAllester and David Rosenblitt. Systematic nonlinear
planning. In *Proceedings of the 9th National Conference on
Artificial Intelligence*, 1991. (postscript, LISP
code)

In the last two years, planning algorithms have been proposed that differ substantially from the classic planners. Although unconventional, these planners have been shown empirically to result in much shorter running times (up to several orders of magnitude faster). Blum and Furst's algorithm views planning as a type of graph search while Kautz and Selman reduce planning to a boolean satisfiability problem.

Avrim Blum and Merrick Furst. Fast planning
through planning graph analysis. In *Proceedings of the 14th
International Joint Conference on Artificial Intelligence
(IJCAI)*, pages 1636--1642, 1995. (extended
version in compressed postscript)

Henry Kautz and Bart Selman. Pushing the
envelope: Planning, propositional logic, and stochastic search. In
*Proceedings of the 13th National Conference on Artificial
Intelligence*, 1996. (postscript)

Slides: postscript

In spite of recent algorithmic advances, the traditional view of planning ignores uncertainty. Uncertainty can be introduced gently by assuming a deterministic domain with some randomness added in.

Jim Blythe. Planning with external events. In
*Proceedings of the Tenth Conference on Uncertainty in Artificial
Intelligence*, 1994.
(author's
page)

The Buridan system introduces a more general representation for stochastic STRIPS operators and extends partial order planning to stochastic domains. Its representation is equivalent in expressiveness to MDPs.

Slides: postscript

Nicholas Kushmerick, Steve Hanks and Daniel S. Weld. An
algorithm for probabilistic planning. *Artificial
Intelligence*, 76(1-2): 239--286, 1995. (compressed
postscript)

The Buridan system has been expanded so that the plan representation is more powerful (though less powerful than a policy-like representation).

Denise Draper, Steve Hanks and Dan Weld. Probabilistic planning with information gathering and contingent execution. Technical Report 93-12-04, University of Washington, Department of Computer Science, Seattle, WA, December, 1993. (compressed postscript)

Slides: postscript

An area of intense interest (but remarkably little work!) is combining direct manipulation of STRIPS-type actions with a dynamic-programming-based algorithm. Several papers adopt the view that function approximation is a form of ``abstraction,'' the form of which can be derived automatically from a propositional representation of the planning problem.

Richard Dearden and Craig Boutilier. Abstraction
and approximate decision theoretic planning. To appear in
*Artificial Intelligence*. (postscript)

Craig Boutilier, Richard Dearden and Moises
Goldszmidt. Exploiting structure in policy construction. To appear
in *Proceedings of the International Joint Conference on Artificial
Intelligence*, 1995. (postscript)

Craig Boutilier and Richard Dearden. Approximating value
trees in structured dynamic programming. To appear in *Proceedings
of the Thirteenth International Conference on Machine Learning*,
1996. (postscript)
Slides: postscript

Summary Slides: postscript

We'll deal with partially observable Markov decision processes and how to solve them.

Michael L. Littman, Anthony R. Cassandra, and
Leslie Pack Kaelbling. Efficient dynamic-programming updates in
partially observable Markov decision processes. Submitted to
*Operations Research*. Also available as Brown University
Technical Report CS-95-19. (abstract,
local
postscript)

Slides (on POMDPs from recent talk): postscript

Slides (on algorithms): postscript

Markov games are closely related to MDPs.

Anne Condon. On algorithms for simple stochastic
games. *DIMACS Series in Discrete Mathematics and Theoretical
Computer Science*, 13:51--71, 1993.

Michael L. Littman. Markov games as a framework
for multi-agent reinforcement learning, *Proceedings of the
Eleventh International Conference on Machine Learning*, pages
157--163, 1994. (local
postscript)

Slides: postscript

Lastly, here's a recent paper that combines ideas from the GraphPlan work with ideas on abstraction.

Craig Boutilier, Ronen Brafman and Chris Geib. Structured reachability analysis for solving Markov decision processes, November 1996. (local postscript (draft, do not circulate))

Slides: postscript

Daphne Koller and Avi Pfeffer. Representations
and solutions for game-theoretic problems. Preliminary version
appeared in *Proceedings of the 14th International Joint Conference
on Artificial Intelligence (IJCAI)*, Montreal, Canada, August
1995, pages 1185--1192. (postscript)

Michael Trick and Stanley Zin. A linear programming approach to solving stochastic dynamic programs, 1993. (postscript)

- Are there methods of efficiently evaluating plans in complex domains?
- Do MDP aggregation methods have anything to offer?
- Can hierarchical methods be applied to propositional state spaces?
- Can Blum and Furst's Graphplan algorithm be extended to stochastic domains?
- How do the known results concerning the use of function approximation in dynamic programming relate to one another?
- How does Boutilier and Dearden's structured policy iteration algorithm perform on some simple structured MDPs?

We've talked about using the GALA system as a basis for a general declarative language for specifying MDPs. Another options is a standard proposed by Rich Sutton (no longer available, replaced by RLFramework at Alberta in 2005).

We wrote several papers about the projects we worked on.

Michael S. Fulkerson, Michael L. Littman, and Greg A. Keim.
*Speeding Safely*: Multi-criteria optimization in probabilistic
planning. Submitted Student Abstract for AAAI-97 (postscript)

Stephen M. Majercik and Michael L. Littman. Reinforcement learning
for selfish load balancing in a distributed memory environment. In
*Proceedings of the Second International Conference on
Computational Intelligence and Neuroscience*, 1997 (forthcoming).
(abstract,
postscript)

We're working on one more group paper on the load balancing stuff.

Last modified: Fri Jan 10 08:46:21 EST 1997 by Michael Littman, mlittman@cs.duke.edu