The N-armed bandit problem: (slot machines). Each machine has a payoff. But, we don't know the probabilities. n=4:
Exploration vs. exploitation: If we try to get a statistical model by trying the different machines, we could get more accurate statistics (and therefore, increase our odds of a big payoff).
Error bound is generally proportional to 1/sqrt(T) (where T is the number of pulls).
This leads to "interval estimation" (IE), which is good in practice, but not necessarily optimal.
Gittin's index: compute a score for each machine in isolution (like percentage plus error) and pull the one with the highest score. Different from IE, score is calculated differently.
The value function for the trailor bot should be lowest clost to the jackknife position and highest in the center. Chris' exploration indicates that it slopes down very slowly, then plummets at the edges.
For the robots, if we try averaging if we have a state action pair s, a and numbers of times we went to states:
For initial values, we can give each possible outcome a 1 or a .1 or a zero. Experiences:
Imagine a set of states arranged in a 1d chain. However, we initialize the probabilities out of the final state to move to all possible states with equal probability.
In the beginning, every state has equal probability of ending up in any other state. But, many states will be avoided because their probabilities of ending up in a bad state is high---the robot stays in --> --the few states it starts out in and doesn't "explore". How do we --> --get past this?
What if we say that the unknown is good, so the robot will be tempted to try to run things until it gets a good idea of reality. It will try the jackknife states, but quickly learn that they are evil.
To do this, we could give a bonus for unvisited states. We could make an imaginary happy state who all unvisited states link to, thus raising their values.
SARSA: state-action-reward-state-action. Learn based on new state and action, not just new state.
Given 1-epsilon greedy, epsilon random, our old update rule was: Q(s,a) = r + gamma sum{s'} T(s,a,s') max{a'} Q(s',a'). Now, we change the max{a'} Q(s',a') to (1-epsilon) max{a'} Q(s',a') + epsilon/m sum{a'} Q(s',a'). Here, m is the number of available actions.
Now, the value update equation is "aware" that it will choose a random action epsilon% of the time, so it will try to avoid states where it could possibly land in a jackknifed state.