The goal in Flatland is simple:
We seek to minimize the time it takes to bring all the agents to their respective target.
This raises a number of questions:
Actions: what can the agents do?
Observations: what can each agent “see”?
Rewards: what is the metric used to evaluate the agents?
The trains in Flatland have strongly limited movements, as you would expect from a railway simulation. This means that only a few actions are valid in most cases.
Hare are the possible actions:
DO_NOTHING: If the agent is already moving, it continues moving. If it is stopped, it stays stopped. Special case: if the agent is at a dead-end, this action will result in the train turning around.
MOVE_LEFT: This action is only valid at cells where the agent can change direction towards the left. If chosen, the left transition and a rotation of the agent orientation to the left is executed. If the agent is stopped, this action will cause it to start moving in any cell where forward or left is allowed!
MOVE_FORWARD: The agent will move forward. This action will start the agent when stopped. At switches, this will chose the forward direction.
MOVE_RIGHT: The same as deviate left but for right turns.
STOP_MOVING: This action causes the agent to stop.
The actions are defined in flatland.envs.rail_env.RailEnvActions.
You can refer to the directions in your code using eg
In Flatland, you have full control over the observations that your agents will work with. Three observations are provided as starting point. However, you are encouraged to implement your own.
The three provided observations are:
Global grid observation
Local grid observation
Global, local and tree: A visual summary of the three provided observations.
The provided observations are defined in envs/observations.py
Each of the provided observation has its strengths and weaknesses. However, it is unlikely that you will be able to solve the problem by using any single one of them directly. Instead you will need to design your own observation, which can be a combination of the existing ones or which could be radically different.
At each time step, each agent receives a combined reward which consists of a local and a global reward signal.
Locally, the agent receives \(r_l = −1\) for each time step, and \(r_l = 0\) for each time step after it has reached its target location. The global reward signal \(r_g = 0\) only returns a non-zero value when all agents have reached their targets, in which case it is worth \(r_g = 1\).
Every agent \(i\) receives a reward:
\(\alpha\) and β are factors for tuning collaborative behavior. This reward creates an objective of finishing the episode as quickly as possible in a collaborative way.
In the NeurIPS 2020 challenge, the values used are: \(\alpha = 1.0\) and \(\beta = 1.0\).
The reward is calculated in envs/rail_env.py
The episodes finish when all the trains have reached their target, or when the maximum number of time steps is reached.
🚉 Other concepts¶
Going further, you will want to run experiment using a variety of environments. You can create custom levels either using multiple random generators, or design them by hands.
An important aspect of these levels will be their stochasticity, which means how often and for how long trains will malfunction. Malfunctions force the agents the reconsider their plans which can be costly.
Finally, trains in real railway networks don’t all move at the same speed. A freight train will for example be slower than a passenger train. This is an important consideration, as you want to avoid scheduling a fast train behind a slow train!
Speed profiles are not used in the first round of the NeurIPS 2020 challenge.