MAXQ: Says that DP for finding the optimal policy requires cubic time. I can't figure out how this could be right. In a deterministic environment, it can be done in linear time. In a stochastic environment, it could take a lot longer. What is the main purpose of hierarchical RL? Is it to exploit a programmer-defined hierarchy? If so, I think it is doing a reasonable job. However, I'm not convinced this is the problem we really want to solve. GLIE is "greedy in the limit *with* infinite exploration". If we adopt the MAXQ mentality, but want to take a model-based approach, what does the algorithm look like? State and action spaces are the same. I suppose the policies need to simply be computed bottom up. But, what does the model learning look like? Well, we want to know the effect of each action given the state. Does this also depend on the current subgoal? Perhaps I should wait to see how MAXQ deals with state abstraction before fleshing this out. How can we resolve the issue of recursive vs. hierarchical optimality? Both seem artificial, but Dietterich expresses the contrast as (approximately) recursive = efficient, hierarchical = higher quality. How can we get the best of both? What is the hierarchy buying us anyway? Does the model-based discussion on pg. 17 still seem negative if we are willing to memorize all experience sequences? What is the perceived cost of model learning? McGovern and Barto Subgoals accelerate current and future problems, in principle. * Searches for bottlenecks in observation space. * Talks about episodes vs. breaking continuous tasks into blocks using Iba's peak-to-peak heuristic. * Describes a connection between subgoals and exploration (I agree with this). (at pg. 3, section 4) Smart and Kaelbling: Q-learning: on Q(s,a) <- Q(s,a) + alpha [r + gamma max_a' Q(s',a') - Q(s,a) ] Ideas: * convex hull to estimate where accuracte queries can be made. * LWR to combine (stored) Q values * combine state and action for neighborhood queries * distance cutoff * need at least kmin within the cutoff or give up * hull computed using the nearby points, if query outside, give up * Note that "give up" value could be viewed as part of the RMAX idea. * uses a sampling and optimization approach for continuous actions * presents training examples in reverse order * uses guidance and passive learning to start the learning process * dismisses optimistic initialization to encourage exploration Ok, can I translate this into my terms? Here's what they say: 1. estimate for Q(s,a) obtained by averaging the neighbors of s,a in state-action space. 2. estimate for V(s') obtained by averaging s',a' neighbors in state-action space for each a' and then maximizing a'. 3. Q'(s,a) = (1-alpha) Q(s,a) + alpha (r + gamma V(s')) or, Q(s,a) = sum_s' (r + gamma V(s')). (Only, s is not revisited.) 4. Q(si,ai) <- Q(si,ai) + ki (qnew - Q(si,ai)) or, Q(si,ai) is the kappa weighted average of its neighbors How can we formalize this? For simplicity, imagine a point x has a fixed set of neighbors. Some of these neighbors... When the point x is first encountered, its value is set to be average of its neighbors, blended with its new prediction with learning rate alpha. Thereafter, whenever a neighbor of x is seen, the neighbor's newly estimated value is blended into x's prediction, with learning rate kappa (kernel). (Should probably be kappa alpha, yes?) What's the expected update here? I think we need to split into "first" and "later". Qfirst(s,a) = (1-alpha) sum_{si,ai neighbor} kappai Qlater(si,ai) + alpha (r + gamma max_a' Qlater(s',a')) Qlater(s,a) = sum_{si,ai neighbor} kappai Qfirst(si,ai) (Also describes results with no Qlater update, works well in discrete action case.) (I hypothesize that the IVH step won't be important in the model-based case.) What is our suggested approach (and notation)? I'll double check this later. MAXQ: Says that DP for finding the optimal policy requires cubic time. I can't figure out how this could be right. In a deterministic environment, it can be done in linear time. In a stochastic environment, it could take a lot longer. What is the main purpose of hierarchical RL? Is it to exploit a programmer-defined hierarchy? If so, I think it is doing a reasonable job. However, I'm not convinced this is the problem we really want to solve. GLIE is "greedy in the limit *with* infinite exploration". If we adopt the MAXQ mentality, but want to take a model-based approach, what does the algorithm look like? State and action spaces are the same. I suppose the policies need to simply be computed bottom up. But, what does the model learning look like? Well, we want to know the effect of each action given the state. Does this also depend on the current subgoal? Perhaps I should wait to see how MAXQ deals with state abstraction before fleshing this out. How can we resolve the issue of recursive vs. hierarchical optimality? Both seem artificial, but Dietterich expresses the contrast as (approximately) recursive = efficient, hierarchical = higher quality. How can we get the best of both? What is the hierarchy buying us anyway? Does the model-based discussion on pg. 17 still seem negative if we are willing to memorize all experience sequences? What is the perceived cost of model learning? McGovern and Barto Subgoals accelerate current and future problems, in principle. * Searches for bottlenecks in observation space. * Talks about episodes vs. breaking continuous tasks into blocks using Iba's peak-to-peak heuristic. * Describes a connection between subgoals and exploration (I agree with this). * Using diverse density and multiinstance learning (surprising approach). * Seems like there are a lot of tunable parameters for identifying bottlenecks (exclude states "close" to start, threshold for identifying new options, etc.) * Nice use of follow up experiments to tease apart the effect. * Learned option acts as a very simple piece of knowledge that is powerful after the task changes (slightly). * Effects were weaker with 4 room example than 2 room example. Smart and Kaelbling: Q-learning: on Q(s,a) <- Q(s,a) + alpha [r + gamma max_a' Q(s',a') - Q(s,a) ] Ideas: * convex hull to estimate where accuracte queries can be made. * LWR to combine (stored) Q values * combine state and action for neighborhood queries * distance cutoff * need at least kmin within the cutoff or give up * hull computed using the nearby points, if query outside, give up * Note that "give up" value could be viewed as part of the RMAX idea. * uses a sampling and optimization approach for continuous actions * presents training examples in reverse order * uses guidance and passive learning to start the learning process * dismisses optimistic initialization to encourage exploration Ok, can I translate this into my terms? Here's what they say: 1. estimate for Q(s,a) obtained by averaging the neighbors of s,a in state-action space. 2. estimate for V(s') obtained by averaging s',a' neighbors in state-action space for each a' and then maximizing a'. 3. Q'(s,a) = (1-alpha) Q(s,a) + alpha (r + gamma V(s')) or, Q(s,a) = sum_s' (r + gamma V(s')). (Only, s is not revisited.) 4. Q(si,ai) <- Q(si,ai) + ki (qnew - Q(si,ai)) or, Q(si,ai) is the kappa weighted average of its neighbors How can we formalize this? For simplicity, imagine a point x has a fixed set of neighbors. Some of these neighbors... When the point x is first encountered, its value is set to be average of its neighbors, blended with its new prediction with learning rate alpha. Thereafter, whenever a neighbor of x is seen, the neighbor's newly estimated value is blended into x's prediction, with learning rate kappa (kernel). (Should probably be kappa alpha, yes?) What's the expected update here? I think we need to split into "first" and "later". Qfirst(s,a) = (1-alpha) sum_{si,ai neighbor} kappai Qlater(si,ai) + alpha (r + gamma max_a' Qlater(s',a')) Qlater(s,a) = sum_{si,ai neighbor} kappai Qfirst(si,ai) (Also describes results with no Qlater update, works well in discrete action case.) (I hypothesize that the IVH step won't be important in the model-based case.) What is our suggested approach (and notation)? I'll double check this later.