CS530 homework 1, due 10/4

Because your professor will be out of the country and have limited email access between 9/25 and 10/1, we especially recommend that you start this assignment early, so that we have enough time to answer any questions you may have, and you have enough time to actually do the assignment! This homework is too long to do at the last minute. Please pay attention to the late policy and collaboration policy in the syllabus.

This homework has 4 parts. Each part is worth the same for grading. In order to do each part, you need to have thought about the preceding parts, but you don't need to have written down or programmed your solution for the preceding parts. Along with your written responses, please submit your code, documented so that we can run it and find our way around it to make small adjustments if we need to. (You may use any programming language.)

You decide to open a grease truck that sells only scallion pancakes for 2 dollars each. Your experience during a grand-opening trial period tells you that, if you have enough pancakes to sell at all times, then every 10 minutes you can sell

It takes 30 minutes and 2+n dollars to make a batch of n pancakes, and you can only be making one batch at a time. To ensure quality, you can only sell pancakes that are less than one hour old. You can always throw away any unsold pancakes, even ones that are still being made.

For example, if you decide to make 3 pancakes now, it will cost you 5 dollars. You can sell these pancakes between 30 minutes from now and 90 minutes from now. You can also start to make more pancakes after 30 minutes from now and sell them later.

Space is so tight in your grease truck that it can only hold 5 pancakes at a time (including the ones that are still being made). For example, if you currently have 4 pancakes, then in order to start a new batch of 2 pancakes, you must first throw away at least 1 pancake, even if someone will buy 1 pancake in the next 10 minutes.

Assume that you are risk-neutral, so each dollar is a unit of utility.

  1. Model your business as a Markov decision process in which each transition takes 10 minutes: Formally specify

    Build a simulator for your business which executes any given policy for any given number of transitions and reports the earning (or loss).

  2. You operate your grease truck from 11am to 2pm each day (so you don't have any pancake to sell until 11:30am). How many transitions is your horizon?

    Implement the procedure in Stone's Section 3.2.2 to compute the optimal policy. Describe the optimal policy: How does it begin each day? How does it manage your inventory as demand fluctuates? How much money do you expect to make each day? Does the policy make intuitive sense?

    Use the simulator you built in part 1 to execute the optimal policy and confirm that you make roughly the expected amount of money each day.

  3. The success of your business prompts you to operate your tiny grease truck 24 hours a day, instead of just 3 hours a day. Implement either value iteration or policy iteration (your choice) to compute the optimal policy. Assume that 1 dollar tomorrow is worth 90 cents today. Why does Stone's procedure for computing the optimal policy not apply here?

    Run value iteration or policy iteration several times, starting from different random initial values/policies. How many iterations does it usually take to converge to the optimal policy? How long does it usually take to converge to the correct values?

    Describe the optimal policy: How does it begin? How does it manage your inventory? What is the expected utility? Does the policy make intuitive sense?

    Use the simulator you built in part 1 to execute the optimal policy and confirm that the resulting utility is roughly as expected.

    Try changing the discount factor (90% per day above) and the size of your truck (5 pancakes above): How do these parameters affect the optimal policy and the convergence speed?

  4. You decide to expand your 24-hour business to another campus location, where customers may not buy scallion pancakes the same way. You plunge in without a trial period. Implement Q-learning and a model-based learning algorithm to deal with this new uncertainty in the transition and reward probabilities. For each of the two algorithms, implement

    Using the simulator you built in part 1 or an extension of it, compare the performance of the two versions of the two algorithms:

Happy frying!