Imitation Learning Training#
TL;DR
We consider 2 broad Imitation Learning approaches as per below
Pure Imitation Learning
This step involves training in a purely offline process via stored experiences
This is implemented using the RLLib supported MARWIL algorithm which is an on-policy algorithm.
Hybrid or Mixed Learning
This step involves training in both an offline process via stored experiences and actual environment simulation experience. This is implemented using the off-policy APE-X DQN algorithm in RLLib with the stored experiences sampled from the replay buffer based on a user configurable ratio of simulation to expert data.
π‘ The idea#
The Flatland environment presents challenges to learning algorithms with respect to regard the organization the agentβs behavior when faced with the presence of other agents as well as with sparse and delayed rewards. Operation Research (OR) based solutions which follow rule-based and planning approaches have shown to work well in this environment. We propose a hybrid approach using these succesful OR solutions to bootstrap the reinforcement learning process and possibly further improve them. This trained reinforcement learning based solution can be used in larger environments where scalability is required.
ποΈ Files and usage#
ποΈ Parameters#
The parameters for MARWIL can be configured with the following parameters:
beta
: This ranges from0-1
with a value of0
represents a vanilla imitation learning.
The parameters for APEX based Imitation Learning can be configured with the following parameters:
"/tmp/flatland-out"
: This ranges from0-1
with a value of0
represents no imitation learning.This represents the porportion of imitation samples.sampler
: This ranges from0-1
with a value of0
represents100%
Imitation Learning. This represents the porportion of simulation samples.loss
: Possible values are dqfd,ce,kl which represents the loss based on DQfD , cross entropy and kl divergence respectivelylambda1
: Weight applied for the APE-X DQN loss in calculating the total losslambda1
: Weight applied for the Imitation loss in calculating the total loss
π Training#
MARWIL#
Example configuration: neurips2020-flatland-baselines/baselines/imitation_learning_tree_obs/marwil_tree_obs_all_beta.yaml
.
Run it with:
$ python ./train.py -f baselines/imitation_learning_tree_obs/marwil_tree_obs_all_beta.yaml
APE-X Imitation Learning (IL)#
Example configuration: neurips2020-flatland-baselines/baselines/imitation_learning_tree_obs/apex_il_tree_obs_all.yaml
.
Run it with:
$ python ./train.py -f baselines/imitation_learning_tree_obs/apex_il_tree_obs_all.yaml
π§ Model#
We use a standard fully connected neural network model for both the MARWIL and APE-X IL.
The model with the custom loss used in APE-X IL is implemented in neurips2020-flatland-baselines/models/custom_loss_model.py
.
π¦ Implementation Details#
Generating Expert Demonstrations#
We have provided a set of expert demonstration in this location. Some of the results presented below were based on the file expert-demonstrations.tgz
. This expert demonstrations file has the saved version of the environment and the corresponding expert actions.
Next we convert them to a rllib compatible format. More details can be found here
We follow the below steps to generate our list of expert demonstration
Download and extract the expert demonstrations file in the location
neurips2020-flatland-baselines/imitation_learning/convert_demonstration
Now run the
saving_experiences.py
to generate the expert demonstrations in rllib format (*.json
)Copy the generated experiences to the location
\tmp\flatland-out
Run Model with Expert Demonstrations#
The custom_loss_model model file contains the code basis for all concepts needed for Imitation Loss implementation.
For the training process refer to the training section.
π Results#
Results for MARWIL and APEX based Imitation Learning
MARWIL Results#
Full metrics of the training runs can be found in the Weights & Biases report
APE-X IL Results#
Full metrics of the training runs can be found in the Weights & Biases report
The results show that a pure Imitation Learning can help push the mean completion to more than 50%
on the sparse, small flatand environment comparable results. Combining both the expert demonstrations along with environment training using the fast APE-X
achieves a mean completion rate comparable to the corresponding pure Reinforcement Learning(RL) runs. A notable observation is that Imitation Learning algorithms show a better minimum completion rate.