Centralized Critic PPO#
TL;DR
A PPO agent using a Centralized Critic. The implementation uses a multi-agent variation of Centralized Critic with a Set Transformer for preprocessing the criticโs input.
๐ก The idea#
The Flatland environment presents challenges to learning algorithms with respect to the agentโs behavior when faced with the presence of other agents, as well as with sparse and delayed rewards. Centralized critic methods are a way to deal with such problematic multi-agent training situations.
The base architecture implemented here is a a fully connected network with PPO trainer. At execution time the agents will step through the environment in the usual way. During training, however, a different network, is used that provides the gradients on which to train the agentโs network. This โcentral criticโ is provided with a more holistic view of the simulation state in order to better approximate an adequate reward for the agentโs actions in the current situation.
Different variations of centralized critics exist. The method contained in the baselines does not use global observations as many other implementations, but combines the observations of the to-be-trained-agent with the observations of the other agents. The approach was chosen in order to create a solution that would be feasible in large environments. This might not be the solution with the best performance and participants should therefore not be deterred from developing more specialized variations.
The model also contains a Set Transformer. This transformer sits between the observations of the agents and the centralized critic in order to, providing the critic with an embedding of the observations. This is useful for dealing with the sparsity of the observations, with the temporal nature of the data, as well as with possible permutation invariance in the case of a changing subset of agents included in the observation.
๐๏ธ Files and usage#
The general structure of the code allows substitution of the separate parts, which will be discussed in the next section.
๐ Training#
An example configuration file can be found in neurips2020-flatland-baselines/experiments/flatland_sparse/small_v0/tree_obs_fc_net/ccppo.yaml
.
Run it with:
$ python ./train.py -f experiments/flatland_sparse/small_v0/tree_obs_fc_net/ccppo.yaml
๐ง Model#
The model is implemented in neurips2020-flatland-baselines/models/cc_transformer.py
๐ฆ Implementation Details#
The cc_transformer
model file contains the code for all concepts needed for central-critic and transformer implementation. The three components Transformer, Agent-Model and Critic-Model are constructed and put together in the CentralizedCriticModel
class.
Transformer#
The transformer is built with self-attention layers which can be found in the MultiHeadAttentionLayer
class. The layers will calculate the embedding with trained linear projections. The dimensions of these outputs can be decided on, as they are the product of the number of parallel heads (num_head
) and the output dimension (d_model
).
Central Critic#
The central critic is constructed with the build_fullyConnected
function to build a fully connected network of the given input, output and hidden layer sizes.
Agent Model#
The agents are built with fully connected layers in the build_fullyConnected
to build a fully connected network of the given input, output and hidden layer sizes.
Trainer#
For training the fully connected layers we use the standard PPO trainer implementation provided by RLlib with necessary updates to the post-processing. In centralized_critic_postprocessing
we ensure that training_batches
contain all the necessary observations of neighboring agents, as well as performing the advantage estimation.
The trainer registration is done outside of classes or functions at the end of the file. Here, the CCPPO Policy is defined, based on the standard PPOTrainer
and subsequently used to register the model to be used in ray.
๐ Results#
The described approach was developed and evaluated in a project of Deutsche Bahn in cooperation with InstaDeep. In the context of the project CCPPO was validated and directly compared to standard PPO on a previous version of Flatland. This original implementation that used a custom version of a sparse reward and a custom observation similar to Flatlandโs tree observation outperformed standard PPO. For the current baselines the CCPPO model was run using the new Flatland version together with its standard observation and reward. The graph below shows the results for these runs with and without the transformer. For comparison, a run of the current best method in the baseline (Ape-X) with comparable sizes and parameters is shown. The full configuration can be found in the corresponding CCPPO config files. The results show that CCPPO is able to learn in this setup, but with a lower performance than Ape-X. We provide an example setup for experiments with CCPPO and encourage participants to further refine and develop this approach.