Deep imitation reinforcement learning with expert demonstration data

: In recent years, deep reinforcement learning (DRL) has made impressive achievements in many fields. However, existing DRL algorithms usually require a large amount of exploration to obtain a good action policy. In addition, in many complex situations, the reward function cannot be well designed to meet task requirements. These two problems will make it difficult for DRL to learn a good action policy within a relatively short period. The use of expert data can provide effective guidance and avoid unnecessary exploration. This study proposes a deep imitation reinforcement learning (DIRL) algorithm that uses a certain amount of expert demonstration data to speed up the training of DRL. In the proposed method, the learning agent imitates the expert's action policy by learning from demonstration data. After imitation learning, DRL is used to optimise the action policy in a self-learning way. By experimental comparison on a video game called the Mario racing game, it is shown that the proposed DIRL algorithm with expert demonstration data can obtain much better performance than previous DRL algorithms without expert guidance.


Introduction
For a Markov decision process (MDP), a reinforcement learning [1] agent can continuously adjust its action strategy through the feedback from the environment so that an optimised policy can be obtained. When the input dimension is large, such as image inputs, a deep neural network [2] can be used to approximate the value function or policy. In recent years, various deep reinforcement learning (DRL) methods have been successfully developed and applied in playing games [3,4] and robot control tasks [5,6]. However DRL usually requires a lot of data and time to learn a good policy. On the other hand, in some complex situations, the reward function is difficult to design. In many cases, we can only design relatively simple and fuzzy reward rules, and it is difficult to design a precise and perfect reward function. Without a good reward function, it is difficult for the algorithm to converge to the optimal strategy. It may even lead to learning failure.
In addition to DRL, supervised learning which uses expert data seems to be the solution to the problem. Someone such as NVidia, tried to use end-to-end deep learning, training convolutional neural network (CNN) to achieve end-to-end automatic driving [7]. However, the direct use of expert data for supervised learning requires a human to collect a large number of training samples. It is also necessary to ensure that the training samples are comprehensive and perfect which is unrealistic. Also the network cannot continue to evolve and improve through self-learning. Also there are also some methods that let an agent to imitate a human.
However these methods also have the same problem of end-to-end deep learning.
Therefore, we consider the possibility of using expert data to guide the training of DRL and let the agent learn human strategies. We propose a deep imitation reinforcement learning (DIRL) algorithm to improve DRL with expert demonstration data. In DIRL, when it almost reaches the expert level, it gradually transforms to an ordinary DRL algorithm and continues the selflearning process. To achieve this, we use the expert demonstration data to guide the training sample exploration of DRL and influence the reward function (Fig. 1). Some experiments were conducted by using the DIRL method on the video game platform called the Mario racing game. Deep Q Network (DQN) was used as the baseline DRL algorithm to be compared with the DIRL algorithm. It can be found that the proposed DIRL algorithm converges much faster and the final policy of DIRL is better because of the expert guidance.
The structure of this paper is organised as follows. Section 2 will introduce the related works and Section 3 will describe the proposed DIRL algorithm. Finally, we will show the experimental results and obtain the conclusion.

Related works
There are many works about speeding up the training of DRL or using the expert data. Some people improved the DRL algorithm by using a more effective exploration strategy or changing the reward function, and others proposed the imitation learning algorithm to imitate human behaviour or used demonstrations to help the training of DRL.

Change exploration strategy
Many successful DRL algorithms use less effective explorations [8,9]. In [10], the bootstrapped algorithm was proposed, using deep neural networks (DNN) for random sampling to improve the way of exploring the state space [10]. Lee and Chung [11] proposed a method to explore high-dimensional state space more efficiently through assigning exploration bonuses. Schaul et al. [12] selected better samples for training in the sample space, but in fact, these

Reward function with expert data
Human preferences have been used to set reward values rather than absolute values or used human feedback to train a reward function in case that it is difficult to define a reward function [13][14][15]. The reward function was trained to imitate human preferences, thereby finally the learning agent can obtain the strategy which is similar to human behaviours.

Imitation learning
Imitation learning is currently the hottest method to use expert data and learn human behaviour. The simplest imitation learning is supervised learning. For example, NVidia realises the unmanned driving with end to end control by collecting a large number of samples [7]. However this method is very limited, it requires a huge number of samples, and it needs to cover almost all possible situations, which is very difficult to accomplish. The generative adversarial nets (GAN) algorithm was proposed for imitation, with a generator for generating a sequence of actions, and a discriminator for distinguishing whether it is an expert operation [16]. Through the training of GAN, the motions produced by generators have become more and more close to expert movements. Some researchers applied imitation learning to robot control, allowing robots to perform very complex tasks [17][18][19][20][21] while some works directly let robots learn from humans by watching videos [22]. However the problem is that not all expert data are optimal, and it is often difficult to obtain sufficient expert data.

Use demonstrations to help DRL
The deep q-learning from demonstrations (DQFD) algorithm was proposed to use pre-training of expert data to speed up DQN training [23,24]. The idea of the DQFD algorithm is similar to our work. However the DQFD algorithm only uses the pre-trained network to replace the Q network of DQN. It was suggested that the policy can be shaped by a human teacher [25] and some shaped reward function can be obtained by demonstration [26]. Some other researchers also used demonstration to solve the exploration problem of reinforcement learning (RL) algorithms [27].

Framework of the DIRL algorithm
In this section, we consider the possibility of using expert data to guide the training of DRL. We let an agent imitate human strategies through human expert data at first. After imitation learning, when it reaches an almost expert level, the algorithm gradually becomes an ordinary DRL method and continues improving through selflearning.
In this paper, we propose a DIRL algorithm to improve DRL with expert demonstration data. On the one hand, when it collects training samples, the algorithm will select better samples according to the expert data as much as possible, and explore the state space that is close to the expert behaviours. In this way, we avoid many unnecessary explorations; on the other hand, the reward function will have additional items related to the expert data. The closer to the expert behaviours, the greater the gain. With time, the algorithm will gradually degenerate into an ordinary DRL algorithm and continue self-learning. Fig. 2 shows the structure of the algorithm.
When we have some expert samples, we hope that we can use supervised learning to guide the training of DRL so that it can be better and faster to reach the level of experts and continue to improve through self-learning. The algorithm will try to explore the state space by the guidance of the expert data at the beginning, and learn from the expert's behaviour. After a certain period of time, the algorithm will gradually degenerate into an ordinary DRL algorithm. However it will still retain a certain probability to obtain the expert sample, as the training pool will always have expert data. The algorithm is achieved mainly through two aspects: i. The approach to collect training samples should be as close as possible to the expert behaviour: Normally, expert data are generally better data. Using information from expert data, we can make better training samples for exploration and reduce some unnecessary fail exploration. Compared to DQN's egreedy method of obtaining training samples, this method does not randomly collect sample at a certain probability, but rather collects samples that are close to the expert's state -action space. Obviously, if we mix expert strategies with strategies based on Q-value greed selection, the effect will be better than the e-greedy method. ii. When the behaviour is the same as the expert behaviour, the reward function gets extra incentive: In order to reach the expert level and to learn human behaviour faster, when actions obtained by the algorithm are consistent with expert behaviours, we allow the reward function to receive additional rewards. In many cases, it is difficult to obtain very accurate and perfect reward functions. Only simple and fuzzy reward rules can sometimes be used to judge whether the action is good or bad through the success or failure of the results. Adding additional items allows agents to have short-term and clear guidelines for rewards. It is wise to learn from experienced experts first. Of course, in order to continue learning and go beyond the experts, the proportion of this additional item will gradually decrease.
What we want to propose is actually an algorithm framework, in which the specific algorithm can be selected according to the actual situation. We will use DQN algorithm and CNN to perform the experiment and we will examine the performance of the algorithm by learning to play the Mario kart racing game. The computer uses the input image to learn how to drive the car to run on the track. In this paper, we will use expert data to train a simple end-to-end neural network as an expert network, and then use expert networks to speed up the convergence of DQN through influencing training sample generation and reward function. We will use the DQN algorithm and CNN to experiment. For a MDP may use (S, A, P, R) to be represented, where S shows a state, A denotes the action, and P(S t + 1 | S t , a) represents a state transition probability, which means the probability of obtaining S t + 1 when we get S t and action a. Also R represents the reward from the environment. DQN uses a network to estimate the value of each action. We can choose the action through the Q network. In this way, we can get the best strategies which can get most reward. In Q-learning, Q values are updated as follows: (see (1)) . We can use neural networks to approximate the Q value function. The DQN training loss function is As we mentioned before, we mainly combine supervised learning and DRL with expert data through two aspects. That is, the training sample exploration method and the reward function expression form.
First of all, we need to have an expert evaluation system, that is, a system that can output expert action probability based on the input. With the input S, the expert system can select the action most likely to be selected by humans by using supervised learning. The training of this expert system can be completed with the sampled expert data. In this paper, we train a simple neural network as an expert prediction system. The network structure does not need to be very complicated. It only needs to be able to output the relatively accurate action label based on the expert data. In this paper, the neural network structure uses the NVidia CNN neural network [7].
The network structure is as follows (Fig. 3): the expert network contains nine layers, including a regular layer, five convolution layers and three fully connected layers. We use convolution kernels in the first three convolutional layers 5 × 5, stepping is 2 × 2 step convolution, using convolution kernels in the last two layers 3 × 3 non-stepped convolutions. The output of this CNN network is a list of action probabilities which means the likelihood of each action. We will use this network as the basis for the output of expert behaviour. At first, when exploring the state space and putting training samples into the experience pool, the system will use the expert network to select the action with a certain probability, and this probability will continue to decrease as time goes by. However in the end, it will still maintain a relatively small probability using expert action to keep the expert data. Therefore, when we train DQN, the agent action selection includes the expert network selection method and e-greedy method of the ordinary DQN, and the selection basis is an attenuation probability coefficient alpha. In addition, because the expert data are not necessarily complete, the choice of whether to perform the output action of the expert network will be determined based on the output probability, that is, the credibility of the output action. In other words, if the probability of the expert network output action is very small, in other words, the expert network is not sure how to act, and we will select the optimal action through the Q network in the training pool.
For the reward function r, we can add an additional compensation to speed up the learning from the experts. If the selected action is the same as the action chosen by the expert network, it will get the gain. In addition, the gain factor will also be multiplied by the attenuation coefficient, after a period of time. R will be degraded to a normal value: Feedback R n indicates a normal reward r and α indicates the attenuation coefficient. When the final selection of motion is the same as the expert motion, flagSame = 1, otherwise it is equal to 0. R g indicates the gain of r, you can set it by yourself. For most of the learning tasks, it is difficult for us to define the quality of each action accurately. In such a situation, it is difficult for an agent to learn optimal strategies quickly and easily, and usually requires a lot of exploration. Therefore, we believe that learning from experts will speed up the learning of DRL.

Detailed steps:
1. For each episode, we get observation S and calculate the expert action (Ae).We put the image inputs into the end-to-end expert net and we can get the expert action (Fig. 4). 2. The second step is exploring state space and sample the train data. We use expert net to guide this process by changing the exploration method and the reward function. Then, to reduce relevance, we put the train data into the experience pool. The process is shown in this figure (Fig. 5). 3. Then, we train the deep Q net. We sample a mini batch of train data from the experience pool D and use it to train the net. The loss function can be described by (2). In order to increase the stability of the training, we use two networks: the present Q network and target Q network which will be updated by the Q network every other time. The loss function can be changed as: Moreover, we can use the Adam optimiser to reduce the loss function. 4. Finally, we decay the probability α. We decay the parameter α from 0.6 to 0.1. It means that when we get the train samples, there is a greater possibility to use expert net to get action at the start time. Also when we get the expert action, the reward function will also have an additional gain. However as time goes by, the probability will become small until it decays to 0.1. We decay α so that the algorithm will change from learning from expert to normal DQN. What's more, we keep α > 0.1 so that the experience pool will always have expert data.
To evaluate the performance of the algorithm, we record the result (the checkpoints it can get) every other time.
The detailed algorithm description can be found in Fig. 6.

Mario racing game environment
We registered Mario racing games in the gym and learned the racing game through the format of the gym (Fig. 7). The reward function: in this game, the status information we can get is very limited. We can get the position of the car in a mini map, so we can get rewards after it arrives at some specific location (checkpoint point). We divide the track roughly into 12 segments. We can get a reward after through a segment. Also it can get rewards after finishing a lap. When the car crashes which will be judged simply based on the image information, it receives a negative reward. In addition, when there is a normal form on the track, there will be a smaller negative reward, so that the car will not stay on the track, and it will run as fast as possible. The reward function table is shown in Table 1.

Experimental setup
First, we played the Mario racing game five times and recorded the image and corresponding human actions as the expert data. The expert data is not so sufficient and perfect but it is enough to train a simple deep neural network whose performance is ok. Then, we use the expert data to train the expert network. The accuracy of the last network training is probably more than 80%. The input of this network is image and the output is the probability of each action.
After that, we sample the train data of DRL. When the agent explores the state-action space, it has probability α to choose the expert actions instead of the ɛ-greedy method. The probability α will decay from 0.6 to 0.1. Also the reward function will get an additional gain when the action of agent is the same as the expert network. We finally set this gain item to 4. Finally, we sample the train data from the experience pool and train the Q net. We did some contrast experiments to evaluate our algorithm and the influence of each item in the algorithm. So, we also used DQN and the expert network that is an end-to-end deep neural network to learn to play this game. To see how the exploration method and reward function improve the DRL algorithm, we did an experiment by using DIRL without additional reward item and compared it with DIRL and DQN. What's more, we set a different decay parameter α to evaluate the influence of it and choose a better parameter.

Results
This paper uses the Mario racing game as an experimental platform. The most intuitive evaluation standard is to record how far the car can run after learning. The farther it runs, the better the learning result of the algorithm is. We use checkpoint that the agent will be able to get as the evaluation standard of algorithm.
To check the quality of training, we use the training network doing four experiments (4 episode) every 2000 steps (time step). We averaged the results of the four experiments as one assessment. Each ring has a 12-point check point. When the checkpoint point reaches 12 points or more, we think the result is relatively good. The middle section of the track is an underground passage, which is generally difficult to pass. Therefore, it can only reach about 6 in general.  The figure of the blue dot is the result of the DIRL algorithm, which is the algorithm we proposed with a changing reward function and exploration method. The red dot indicates the result of the expert network experiment which uses the end-to-end deep neural network. Also the green dot indicates normal DQN experiment results. Due to the underground passage, DQN or expert network is difficult to exceed six points check point. The expert network which is trained by the expert data can guarantee obtaining more than five point check points. It can also drive through underground passages regularly. We can see that it is almost stable in about six. The DQN algorithm is not so stable, because the ambiguity of the underground channel reward function and the similarity of the inner wall and surface make it difficult to steady through the underground channel. Also we can find out that it need a lot of time to train. However, we can see that the DIRL algorithm will almost reach the level of experts after about 300-400 experiments and, subsequently, the performance is significantly better than end-to-end supervised learning and DQN by self-learning. We also counted the number of times with more than 12 and 20 checkpoints using our algorithm and DQN, the first time reaching 12 points and the most points that the algorithm can reach. We can see the use of expert network for guidance, the learning speed of DRL algorithms can be significantly improved ( Table 2).

Reward bonus experiment:
We conduct a comparative experiment of reward functions. After removing the bonus gain item, we did an experiment and compared it with DQN and DIRL with the gain item. Do that and we will see the influence of reward bonus by comparing the experiment of DIRL and DIRL without R bonus. In the meantime, we can also know the influence of the exploration method by comparing the experiment of DIRL without R bonus and DQN.
It can be seen that when the bonus is existing, the algorithm can learn the behaviour of the experts more quickly. The results of the experiment are better than DIRL without gain bonus both in the early and late stages. At the same time, according to comparing the result of not using the bonus gain DIRL and DQN, we can also find that it can also achieve a very good performance improvement by only improving the exploration methods. According to the experiment and contrast, we can see that the changing of the exploration method and reward function can significantly improve the performance of the algorithm.

Decay coefficient experiment:
For observing the influence of the decay coefficient, we set a comparative experiment. The decay coefficient decides the speed of degenerating from DIRL to original DQN and the number of expert data in the training pool. The experimental results are as follows (Fig. 9).
So, we set the experiment with decay coefficient from 1 to 0, decay coefficient from 0.6 to 0.1 and decay coefficient as 0.1. The experimental results are shown in Fig. 10.
We can see that when alpha decays from 1 to 0, the result (the number of checkpoints) is more variable. When α has been 1, the final result is pretty good even better than the result of α from 0.6 to 0.1, but the early learning speed of it is slower. Finally, we choose the alpha from 0.6 to 0.1, which is steadier and better.

Conclusion
In order to speed up the learning speed of DRL and improve the efficiency of learning, we propose to combine supervised learning and DRL when we have some expert demonstration data. The idea is to learn from expert behaviours via conducting effective explorations and changing the reward function. After reaching the level of experts, the learning agent will continue to improve through reinforcement learning and eventually exceed the experts.
Based on the above idea, this paper proposed a DIRL algorithm with expert demonstration data. At the beginning, we sampled some expert data by recording human control strategies. And we used expert data to train a convolutional neural network which can provide human decisions. Then, we collected the training samples   by the expert net with a probability to imitate human behaviours. Also the probability will decrease gradually so that the algorithm will degenerate into normal DRL gradually. It is worth mentioning that the expert samples will always be stored in the training pool. On the other hand, when the behaviour of the agent is consistent with the behaviour of the expert, the reward function becomes larger, prompting the agent to learn from the expert. Finally, we conducted experiments and compared the experimental results of the DQN algorithm. We also modified the parameters and conducted comparisonal experiments. We found that our DIRL algorithm can significantly improve the DRL performance.