Least Squares Policy Iteration Lagoudakis and Parr What is on-policy vs. off-policy? Why is a state-value function of no use to control? What did the authors have to do to extend LSTD to LSQ? What does P^pi Phi represent? Is it constructable in a continuous state space? How about hat{P^pi Phi}? How is hat{A} used differently given changes in the policy? Is this method memory based? What experience needs to be stored? Problems solved: linear chain, inverted pendulum, bicycle task. Would a direct comparison with other methods be meaningful? Is number of episodes to master the task a useful comparison metric? How was training data collected? Is this realistic? What is the Ormoneit and Sen paper about?