Neural Information Processing Systems Workshop: RL Comparisons

NOTE: The event is now over (12/6/05) and results are being calculated for the workshop. Stay tuned on this website for summaries and other announcements! Here is the workshop proceedings.

As part of the NIPS 2005 workshop program, we are running a workshop on Reinforcement Learning comparisons.

Resources:

The event began Tuesday, November 8th, 11am East-Coast-US time with the following announcement (now updated):
The organizers of the NIPS-05 workshop "Reinforcement Learning Benchmarks and Bake-offs II" cordially invite you to participate in the first RL benchmarking event.

The document below describes how we will run the event. Please download the benchmark distribution itself and install it (see the Readme file). You will need to integrate your solver with the benchmarking framework and send us the output files for analysis. The current framework supports solvers in C or C++. The latest benchmarking code is at:

http://www.cs.rutgers.edu/~mlittman/topics/nips05-mdp/NIPSBenchMark1.0.3.zip

Output files and brief algorithm descriptions must be mailed to Michael Littman (mlittman@cs.rutgers.edu) by Midnight (East-Coast-US time), Monday December 5th to be included in the workshop report. Earlier submissions will be appreciated! We will analyze the data for presentation at the workshop on December 9th.

You will also have the opportunity to present your results and systems (as time allows) at the workshop itself. You may submit the results of multiple solvers in multiple categories.

We look forward to your contributions!

ORGANIZERS (alphabetically): Alain Dutech, Tim Edmunds, Jelle Kok, Michail G. Lagoudakis, Michael Littman, Martin Riedmiller, Brian Russell, Bruno Scherrer, Rich Sutton, Stephan Timmer, Nikos Vlassis, Adam White, Shimon Whiteson.

DOCUMENT DESCRIBING METRICS FOR THE NIPS 2005 BENCHMARKING EVENT

0. Preamble:

This document defines the benchmarks for the NIPS 2005 workshop. Participants will work through a modified version of the RLframework, distributed with this document. The instructions and makefiles should allow participants to attach their solvers to the different domains and generate the data files that will be needed for presentation at the workshop.

There are eight benchmarking categories for which we will be collecting data. They are:

1. General setting:

Each domain is run for a certain maximum number of episodes, Nmax. We set Nmax = 10,000 for all problems.

For all of the problems, a set of 50 starting states is created randomly and held fixed for all participants. A total of Nmax episodes will be run, cycling through the 50 starting states, sequentially.

The principle metric reported is the average summed reward over blocks of 50 episodes. Additional metrics recorded are elapsed wallclock time and number of steps per episode. We will compare the solvers on multiple metrics at the workshop.

Use of prior knowledge about a domain and/or offline computation are allowed, provided they are reported in the results. We also emphasize that we are also interested in solvers regardless of their internal notions of performance metrics and encourage participants to describe the internal metrics used.

We have chosen not to explicitly separate training and testing phases. Each solver should strive to maximize the rewards it accumulates over the Nmax episodes.

An episode ends if either a terminal state is reached or a certain maximum number of cycles, Cmax is reached. We set Cmax = 300 for all problems. Agents cannot terminate episodes prematurely.

2. Problem-specific definitions:

a. MountainCar (from the Sutton-Barto book).
state variables: x1 (position), x2 (velocity)
region for starting states: -1.1 <= x1 <= 0.49, x2 = 0
terminal states: x1>= 0.5
actions: +1, 0, -1
Reward:
0, if x1>= 0.5
-1, else
b. CartPole (classical definition; dt = 0.02, Runge-Kutta).
state variables: pole angle (radians from vertical), pole angular velocity (rad/s), cart position (meters from center), cart velocity (m/s)
region for starting states: -PI/18 <= pole angle <= +PI/18; pole angular velocity = 0; -0.5 <= cart position <= 0.5; cart velocity = 0;
terminal states: |pole angle| >= PI/6 or |pos| >= 2.4 (failure)
actions: -10, -9, ... ,-1, 0, 1, ... 9, 10
Reward:
0, if |pole angle| <= PI/60 and |pos| <= 0.05
-1000, if |pole angle| >= PI/6 or |pos| >= 2.4
-1, else
c. PuddleWorld (from "Generalization in RL: Successful Examples Using Sparse Coarse Coding", Richard Sutton, NIPS 1996).
state variables: x, y
region for starting states: uniformly random over [0,1]^2
terminal states: x>=0.95 and y>=0.95
actions: 4 discrete actions 0-UP, 1-DOWN, 2-RIGHT, 3-LEFT each step is 0.05 + Gaussian noise with mean=0, std=0.01
Reward:
-1 in each step plus -400*<distance inside the puddle> [0,0.1]
d. Blackjack (Based on section 5.1 of the Sutton-Barto book).
Standard rules of blackjack, but modified to allow players to (mistakenly) hit on 21. Prevents epsiodes from starting in terminal states.
state space: 1-dimensional integer array, 3 elements,
element[0] - current value of player's hand (4-21)
element[1] - value of dealer's face-up card (2-11)
element[2] - 0-player does not have usable ace (0/1)
region for starting states: player has any 2 cards (uniformly distributed), dealer has any 1 card (uniformly distributed)
terminal states: Terminates when player "sticks"
actions: discrete integers 0-HIT, 1-STICK
Reward:
-1 for a loss, 0 for a draw and 1 for a win.
e. SensorNetwork (based on "Preprocessing Techniques for Accelerating the DCOP Algorithm ADOPT", S.M. Ali, S. Koenig, M. Tambe. AAMAS'05.). The problem represents two arrays of sensors bracketing an array of target cells
S S S S
 C C C
S S S S
Two targets float in the three target locations. For example:
S S S S
 T T O
S S S S
Each sensor can focus on the cell to its left, on the cell to its right, or not at all. If, in a single step, a target is hit by 3 sensors:
R L O O
 T T O
R O O O
then the target loses one hit point. After each step, the targets can move; in turn (from left to right), a target that is adjacent to at least one empty cell uniformly randomly chooses to try to move left, to try to move right, or to stay in its current cell - if the cell it tries to move into is empty, it succeeds.
state space: 3 cells with target energy level 0 to 3.
region for starting states: one of 3 (uniformly distributed) - [3,3,0],[3,0,3],[0,3,3]
terminal states: 1 - when both targets have zero hit points[0,0,0]
actions: the actions of each sensor (0-don't focus, 1-focus left, 2-focus right) are packed into a single integer [0,3^8]; the ith sensor's action is: (action/(3^(7-i)))%3
Reward:
-1 for each sensor focus, +30 for killing a target.
f. Taxi (adapted from Dietterich's MAXQ work).
The agent controls a taxi on a 5x5 grid with walls (the wall layout is fixed). There are 4 locations on the grid that are possible passenger pick-up/drop-off locations (these locations are also fixed). In each episode, the passenger must be dropped off at a specific drop-off location.
state variables: taxiLocation [0,24], passengerLocation [0,4] (waiting at pick-up/drop-off #[0,3] or in the taxi), drop-offLocation [0,3]
region for starting states: taxi is uniformly randomly in any of the grid squares, passengerLocation is uniformly randomly in one of the passenger states, drop-offLocation is uniformly randomly one of the drop-off locations
terminal states: 1 - passenger is successfully dropped-off
actions: 0-go north, 1-go south, 2-go west, 3-go east, 4-pick up passenger, 5-drop off passenger
Reward:
-1 for an attempted movement (whether it is successful or blocked by a wall), -1 for a successful pickup, 0 for a successful drop-off, -10 for an attempted drop-off with no passenger or at the wrong location, -10 for an attempted pick-up at the wrong location (or if the passenger is already in the taxi).

3. Interaction protocol:

The code for this event is built around the RL-Framework developed at the University of Alberta. When running the executable compiled with the event's main() function, the function calls received by the agent will be as described below.

The agent will be initialized with the task-specification string for the problem being solved (agent_init()), and asked for its unique identifier (agent_get_name()) before the solving begins.

Nmax (10000) episodes will then be run. Each episode will start with exactly one call to agent_start(). The episode will then run with a sequence of zero (if the episode reaches a terminal state after the first action) or more calls to agent_step(). The episode ends when either the agent reaches a terminal state (in which case agent_end() is called exactly once), or the episode reaches Cmax (300) steps (in which case agent_end() is not called).

After all the episodes have been run, agent_cleanup() will be called to allow the agent to free any resources allocated in agent_init().

Benchmark contact: Michael Littman, mlittman@cs.rutgers.edu.