We call an MDP positive bounded if:
- No policy has infinite expected total reward (taking all negative
rewards as zero) from any state.
- For each state, there is at least one action with
non-negative reward.
Why the first?
Why the second?
Next: Negative Models
Up: MATH STUFF
Previous: Issue