Practical Reinforcement Learning in Continuous Spaces Stated Goals: * safe approximation of value function * learning from small amount of data Other issues: * use of "teaching" in RL Questions: * What is the connection between discretization and hidden state? * What is a "safe" approximation? * Why would we expect value function approximations to overestimate (be too optimistic)? Equations: normal Q learning: Q(s,a) <- Q(s,a) + alpha [r+ gamma max_{a'} Q(s',a') - Q(s,a)] Discussion: * What's "hidden extrapolation"? * What is the IVH? Dimensions of instance-based algorithms: * How are nearby points chosen? Distance threshold to query vector, returns "don't know" if insufficient data. Also creates a hull and returns "don't know" if query is outside the hull. Other options: k-NN. * How are values from points combined? LWR with a width parameter. Other options: straight average, weighted average. * What do instances look like? (s,a,r,s') Other options: keep sequences. * How are values assigned to point (s,a,r,s')? q <- Q(s,a) q_{next} <- max_{a'} Q(s',a') q_{new} <- q + alpha (r + gamma q_{next} - q) Using LWR weights, Q(si,ai) <- Q(si,ai) + weight_i (q_{new) - Q(si,ai)) (Like training a linear approximator.) Training items updated in reverse order to increase their effect. Other options: Repeat until convergence a la Gordon. Use multistep updates. Learn a model. * How act during training? Given expert examples. Other options: random, directed exploration (a la Langford), optimistic initialization. Unanswered question: What is the convergence point of learning updates given a fixed set of experience?