# Single agent¶

Goal

By the end of this tutorial, you will be able to train a single agent to navigate in Flatland using DQN!

We use the single_agent_training.py file to train a simple agent with the tree observation to solve the navigation task. This tutorial walks you through this file step by step.

## Setting up the environment¶

Before you get started with the training, you will need to have PyTorch installed. You can install everything you need using the requirements.txt file:

\$ pip install -r requirements.txt


We start by importing the necessary flatland modules:

from flatland.envs.rail_env import RailEnv
from flatland.envs.rail_generators import sparse_rail_generator
from flatland.envs.schedule_generators import sparse_schedule_generator
from utils.observation_utils import normalize_observation
from flatland.envs.observations import TreeObsForRailEnv


For this simple example we want to train on randomly generated levels using the sparse_schedule_generator. We use the following parameter for our first experiment:

# Parameters for the Environment
x_dim = 35
y_dim = 35
n_agents = 1


It is possible to use multiple speed profiles, which simulate different kinds of trains. In the NeurIPS 2020 challenge, we only consider trains with a speed of 1.0, so we setup the speed profiles accordingly below:

# Different agent types (trains) with different speeds.
speed_ration_map = {
1.: 1.0,  # 100% of fast passenger train
1. / 2.: 0.0,  # 0% of fast freight train
1. / 3.: 0.0,  # 0% of slow commuter train
1. / 4.: 0.0  # 0% of slow freight train
}


For this experiment we will use the tree observation:

# We are training an Agent using the Tree Observation with depth 2
observation_builder = TreeObsForRailEnv(max_depth=2)


We then pass it as an argument to the environment constructor:

env = RailEnv(
width=x_dim,
height=y_dim,
rail_generator=sparse_rail_generator(
max_num_cities=3,  # Number of cities in map (where train stations are)
seed=1,  # Random seed
grid_mode=False,
max_rails_between_cities=2,
max_rails_in_city=3
),
schedule_generator=sparse_schedule_generator(speed_ration_map),
number_of_agents=n_agents,
malfunction_generator_and_process_data=malfunction_from_params(stochastic_data),
obs_builder_object=tree_observation
)


We have now successfully set up the environment for training!

## Setting up the agent¶

To set up an appropriate agent we need the state and action space sizes. We calculate this based on the tree depth and its number of features:

# Given the depth of the tree observation and the number of features per node we get the following state_size
features_per_node = 9
tree_depth = 2
nr_nodes = 0
for i in range(tree_depth + 1):
nr_nodes += np.power(4, i)
state_size = features_per_node * nr_nodes

# The action space of flatland is 5 discrete actions
action_size = 5


In the training_navigation.py file you will find further bookkeeping variable that we initiate in order to keep track of the training progress. We omit them here for brevity.

It is important to note that we reshape and normalize the tree observation provided by the environment to facilitate training. To do so, we use the utility functions normalize_observation(observation: TreeObsForRailEnv.Node, tree_depth: int, observation_radius=0) defined in the utils folder.

# Build agent specific observations
for a in range(env.get_num_agents()):
if obs[a]:


We now use the normalized agent_obs in our training loop:

for episode_idx in range(1, n_episodes + 1):
# Reset environment
obs, info = env.reset(True, True)
# Build agent specific observations
for a in range(env.get_num_agents()):
if obs[a]:
agent_prev_obs[a] = agent_obs[a].copy()

# Reset score and done
score = 0
env_done = 0

# Run episode
for step in range(max_steps):
# Action
for a in range(env.get_num_agents()):
if info['action_required'][a]:
# If an action is require, we want to store the obs at that step as well as the action
update_values = True
action = agent.act(agent_obs[a], eps=eps)
action_prob[action] += 1
else:
update_values = False
action = 0
action_dict.update({a: action})

# Environment step
next_obs, all_rewards, done, info = env.step(action_dict)
# Update replay buffer and train agent
for a in range(env.get_num_agents()):
# Only update the values when we are done or when an action was taken and thus relevant information is present
if update_values or done[a]:
agent.step(agent_prev_obs[a], agent_prev_action[a], all_rewards[a], agent_obs[a], done[a])
cumulated_reward[a] = 0.

agent_prev_obs[a] = agent_obs[a].copy()
agent_prev_action[a] = action_dict[a]

if next_obs[a]:

score += all_rewards[a] / env.get_num_agents()

# Copy observation
if done['__all__']:
env_done = 1
break

# Epsilon decay
eps = max(eps_end, eps_decay * eps)  # decrease epsilon


## Results¶

Running the training_navigation.py file trains a simple agent to navigate to any random target within the railway network. After running you should see a learning curve similar to this one:

and the agent behavior should look like this: