Sequential agent#

Goal

In this tutorial, you will discover the simplest Flatland policy: a sequential approach that moves trains one at a time following a shortest-path strategy.

Have a look at the sequential_agent.py file.

The sequential agent shows a very simple baseline that will move agents one after the other, taking them from their starting point to their target following the shortest path possible.

This is very inefficient but it solves all the instances generated by the sparse_rail_generator. However when being scored in the AIcrowd challenge this agent will get a low score due to the time it needs to solve an episode.

Here you see it in action:

Sequential_Agent

Setting up the environment#

Install everything you need with conda using the environment.yml file:

$ conda env create environment.yml

We start by importing the necessary flatland modules:

from flatland.envs.observations import TreeObsForRailEnv
from flatland.envs.predictions import ShortestPathPredictorForRailEnv
from flatland.envs.rail_env import RailEnv
from flatland.envs.rail_generators import sparse_rail_generator
from flatland.envs.line_generators import sparse_line_generator
from flatland.utils.rendertools import RenderTool

For this simple example we want to train on randomly generated levels using the sparse_line_generator. We use the following parameter for our first experiment:

# Parameters for the Environment
x_dim = np.random.randint(30, 35)
y_dim = np.random.randint(30, 35)
n_agents = np.random.randint(3, 8)

For this experiment we will use the tree observation. We pass it as an argument to the environment constructor:

env = RailEnv(
    width=x_dim,
    height=y_dim,
    rail_generator=sparse_rail_generator(),
    line_generator=sparse_line_generator(),
    obs_builder_object=TreeObsForRailEnv(max_depth=1, predictor=ShortestPathPredictorForRailEnv()),
    number_of_agents=n_agents)
env.reset(True, True)

We have now successfully set up the environment!

Setting up the policy#

To set up an appropriate agent we need the state and action space sizes. We calculate this based on the tree depth and its number of features:

# Calculate the state size given the depth of the tree observation and the number of features
n_features_per_node = env.obs_builder.observation_dim
n_nodes = 0
for i in range(observation_tree_depth + 1):
    n_nodes += np.power(4, i)
state_size = n_features_per_node * n_nodes

# The action space of flatland is 5 discrete actions
action_size = 5

We train the agent with a Double Duelling DQN policy. Here are the relevant publications for this method:

# Double Dueling DQN policy
policy = DDDQNPolicy(state_size, action_size, Namespace(**training_parameters))

In the single_agent_training.py file you will find further bookkeeping variable that we initiate in order to keep track of the training progress. We omit them here for brevity.

We reshape and normalize each tree observation provided by the environment to facilitate training. To do so, we use the utility functions normalize_observation(observation, tree_depth: int, observation_radius=0) defined in the utils folder.

# Build agent specific observations
for a in range(env.get_num_agents()):
    if obs[a]:
        agent_obs[a] = normalize_observation(obs[a], tree_depth, observation_radius=10)

We now use the normalized agent_obs in our training loop:

for episode_idx in range(1, n_episodes + 1):
    # Reset environment
    obs, info = env.reset(True, True)
    # Build agent specific observations
    for a in range(env.get_num_agents()):
        if obs[a]:
            agent_obs[a] = normalize_observation(obs[a], tree_depth, observation_radius=10)
            agent_prev_obs[a] = agent_obs[a].copy()

    # Reset score and done
    score = 0
    env_done = 0

    # Run episode
    for step in range(max_steps):
        # Action
        for a in range(env.get_num_agents()):
            if info['action_required'][a]:
                # If an action is require, we want to store the obs at that step as well as the action
                update_values = True
                action = agent.act(agent_obs[a], eps=eps)
                action_prob[action] += 1
            else:
                update_values = False
                action = 0
            action_dict.update({a: action})

        # Environment step
        next_obs, all_rewards, done, info = env.step(action_dict)
        # Update replay buffer and train agent
        for a in range(env.get_num_agents()):
            # Only update the values when we are done or when an action was taken and thus relevant information is present
            if update_values or done[a]:
                agent.step(agent_prev_obs[a], agent_prev_action[a], all_rewards[a], agent_obs[a], done[a])
                cumulated_reward[a] = 0.

                agent_prev_obs[a] = agent_obs[a].copy()
                agent_prev_action[a] = action_dict[a]

            if next_obs[a]:
                agent_obs[a] = normalize_observation(next_obs[a], tree_depth, observation_radius=10)

            score += all_rewards[a] / env.get_num_agents()

        # Copy observation
        if done['__all__']:
            env_done = 1
            break

    # Epsilon decay
    eps = max(eps_end, eps_decay * eps)  # decrease epsilon

Results#

Running the single_agent_training.py file trains a simple agent to navigate to any random target within the railway network. After running you should see a learning curve similar to this one:

Learning_curve

and the agent behavior should look like this:

Single_Agent_Navigation

This is a good start! In the next page, we will see how to train a policy that can handle multiple agents in a more robust way. We will also see how to train a policy that we can submit to the Flatland challenge!