Prioritised experience replay based on sample optimisation

: The sample-based prioritised experience replay proposed in this study is aimed at how to select samples to the experience replay, which improves the training speed and increases the reward return. In the traditional deep Q-networks (DQNs), it is subjected to random pickup of samples into the experience replay. However, the effect of each sample is different for the training process of agent. A better sampling method will make the agent training more effective. Therefore, when selecting a sample to the experience replay, the authors first allow the agent to learn randomly through the sample optimisation network, and take the average value returned after each study, so that the mean value is used as a threshold for selecting samples to the experience replay. Second, on the basis of sample optimisation, the authors increase the priority update and use the idea of reward-shaping to give additional reward values to the returns of certain samples, which speeds up the agent training. Compared with traditional DQN and the prioritised experience replay DQN, this study uses OpenAI Gym as platform to improve agent learning efficiency.


Introduction
Today, as the concept of artificial intelligence is becoming more and more popular, the algorithm model of deep reinforcement learning, which is a research hotspot of artificial intelligence, has been more perfect. Beginning with the deep Q-network (DQN) algorithm proposed by Mnih et al. [1], deep reinforcement learning has been shown to be comparable to human players and even surpasses human players in dealing with complex problems such as Atari 2600 games that tend to be real. In addition, deep reinforcement learning has achieved significant success in various tasks, such as robotic autonomous navigation and path planning (Kober and Peters [2], Mirowski et al. [3], Cui et al. [4]), unmanned technology [5], natural language processing [6,7] and so on. In 2016, the appearance of AlphaGo [8] made an uproar in the field of Go, defeating the Korean Go player Sedol Lee with a total record of 4:1. In addition, the improved AlphaGo Master once again successfully defeated the Chinese Go player Jie Ke. At this point, deep reinforcement learning has entered the public's field of vision, and it has become increasingly important in a wide range of practices.
Unfortunately, there are still many challenges that prevent reinforcement learning from being applied more widely in practice, such as the issues of exploration and exploitation, as well as learning problems in complex environment of agents. Among them, one of the most important problems is the low efficiency of the current deep reinforcement learning algorithm. For example, it requires hundreds of millions of interactions between DQN and the environment to learn a good strategy, which requires extremely long training time.
For the traditional deep reinforcement learning model, excessive training time is often an obstacle to the effective learning of the agent. In 2013, DeepMind first proposed the use of the deep reinforcement learning algorithm DQN [9] in the Atari video game experiment. In 2016, with the proposed of double DQN [10], Dueling DQN [11] and priority experience replay [12] algorithm, the training speed and reward of the agent gradually increased. Double DQN optimises the objective function based on DQN, so that the value function of selecting the action value is different from the value function used to evaluate the action value. This reduces the overestimation of the sample, avoids the generation of sub-optimal strategies, and also greatly improves the performance of the DQN. The Dueling DQN divides the action value function into two parts such as the state value function and the advantage value function. In the course of the experiment, when redundant or similar actions occur, thanks to the separation of the value function and the advantage function, the agent can quickly learn the value of these new actions without having to relearn the value of the state. In this way, the training speed of the agent is further improved, and the training performance is better. In the article of priority experience replay, based on the previous improvements, the sample of the experience replay buffer is prioritised, which makes it easier to select larger samples for the training of the agent, shortening the training time and improving the learning efficiency. In addition, Osband et al. [13] proposed a better exploration strategy based on DQN. O'Donoghue et al. [14] proposed Q-learning based on policy gradient. These have further improved the efficiency of intelligent learning.
The algorithm structure proposed above accelerates the training of the agent and improves the learning effect. However, for complex models and learning environments, when the storage capacity of the experience replay buffer reaches 100,000 or more, the training time of the agent is still longer.
In 2016, Google's Deepmind's prioritised experience replay proposed two options for the use of experience replay buffer: one is to choose which experience to store; the second is which experience to be replay. The prioritised experience replay mainly solves the second problem, which is how to prioritise the sample tracks stored in the experience replay buffer. By prioritising the temporal-difference (TD) error, the selected probability that the sample which is more effective for the agent learning is increased, and the probability that the important experience is played back is increased, so that the training result of the agent is more effective and a better return is obtained. In this paper, it is proposed that the prioritised experience replay based on sample optimisation model is focused on solving which experiences are selected for replay. Prioritised experience replay based on sample selects before putting the experience into the experience replay buffer. The more effective and important experience of the agent training is stored in the experience replay buffer as a later experience replay. In the training process after the training, since the samples in the experience replay buffer are selected, the training process is more J. Eng efficient. Secondly, this paper also updates the priority of the experience sample. Also the priority of the samples in prioritised experience replay is not updated in real time. In this paper, when each transition is sampled, the priority corresponding to each sample and the TD error are updated. Thus, during the training process, the priority of the sample for different environmental states is different. This helps the agent to train more realistically and improve training efficiency. In addition, drawing on the idea of reward-shaping [15], an additional reward is given to the first few state samples reaching the final state, so that the agent can learn the final state more quickly during the training process. Fig. 1 is the structure diagram of prioritised experience replay based on sample optimisation. In the traditional experience replay buffer, sampling is randomly selected for the sample, and there is no effective selection for the sample to enter the experience replay buffer. This paper proposes that before the sample enters the experience replay buffer, a training process is performed in advance, and a threshold of reward is obtained through the training process. When the return of the sample is greater than this threshold, the sample is placed in the experience replay buffer, and when the return of the sample is less than this threshold, the sample is discarded.

Network structure diagram
In the article of prioritised experience replay, a priority-based ranking of the samples of the experience replay buffer is proposed. When the absolute value of the TD error of each sample is large, if the priority of the sample is higher, the probability that the sample will be selected is larger. In this section, we add updates based on TD error priorities and reward-shaping of some samples. When the sample is selected, the TD error and priority of the sample are updated at the same time, so that the samples that are more effective training for the agent are more likely to be selected, and the effect of the agent training is accelerated.

Prioritised experience replay based on sample optimisation
2.2.1 Prioritised experience replay: This paper adopts the double DQN network structure. Different from the traditional single DQN network, it not only uses the convolutional neural network to approximate the behaviour value function, but also uses the target Q network to update the target. Also in the loss function, the max operation is not used to obtain the optimal value, but is used to obtain the action value of the optimal strategy. The same value is used to select and measure an action compared to the maximum Q value operation in DQN, which avoids the overly optimistic value estimate caused by the excessively high estimate of Double DQN. The loss function for this network is L(ω) = (r + γQ(s′, arg max Q(s′, a′, ω), ω′) −Q(s, a, ω)) 2 (1) By calculating the gradient of w about the loss function, we can get the function where the parameter ω is used to select the action, which is from the current q network; ω′ is the parameter used to evaluate the action, which is from the target q network. The current q network selects the action, and the target q network calculates the q value.
The main function of the experience replay in the double DQN network is to solve the non-static distribution problem. The specific method is to store the samples obtained by interacting each time step agent with the environment into the experience replay buffer, and randomly pick out small batch samples for training during the training of the agent body and makes the training more efficient.
The use of experience replay will enable the agent to remember and reuse past experience. In the traditional DQN network structure, the past experience was obtained by uniform sampling, which would make all the experience in the experience replay buffer have the same probability, ignoring the importance of different experiences. Therefore, prioritised experience replay uses TD error to mark the importance level of each experience, so that the probability of important experience being sampled is greater, which greatly increases the efficiency of training.
In order to ensure sampling efficiency, a random sampling method was introduced. This method is an algorithm between the pure greedy prioritisation algorithm and the original uniform random sampling algorithm. It not only solves the problem that each sample is evenly extracted, but also solves the problem that a small TD error transition that was first seen in greedy prioritisation is stored in the experience replay buffer and will not be sampled for a long time. In addition, it also solves the problem of over-fitting. The random sampling method can ensure that the probability that the transition which is extracted is monotonous according to the priority. At the same time, it is guaranteed that zero probability extraction may not occur for low priority transitions. The transition with the label i is defined as where p i is the priority of the transition labelled i, all of p i are greater than zero, and α is the influence value that determines the importance of the sampling weight. If α = 0, the importance sampling is not used, which is equivalent to uniform random sampling.
In this paper, in order to make the sorting network more robust, let p i = 1/rank(i), where rank(i) is sorted by the size of δ i . In (3), p is monotonous with respect to δ i , and because of the ordering, it is not very sensitive to outliers, which enhances the stability of the system.

Sample optimisation:
The algorithm proposed is to use the Double DQN network to randomly select samples for pre-training and obtain the corresponding average return before selecting samples to enter the experience pool, and put the samples with higher average returns into the experience pool. Then it will calculate the average of the final return value of the selected sample as the criterion for selecting whether to put the sample into the experience replay buffer. Then, the sample with better return value is put into the experience replay buffer, and the training is played with priority experience. Finally, the final return of the algorithm is improved, and the algorithm is optimised.
Compared to the original prioritised experience replay, this algorithm enhances the selection of samples into the experience replay buffer. Before selecting samples, the sample optimisation network uses the natural DQN algorithm to randomly select samples for training. The return value for each training is defined as In this paper, the mean square error of n samples is taken as the threshold for selecting and adding samples into the experience replay buffer: where R is the average of the return values of n samples. If the final return value of the sample is > R MSD , the sample is placed in the experience replay buffer, and if it is < R MSD , the sample is discarded.
The set of all samples entering the experience replay buffer is defined as where S SAMPLE represents all samples entering the experience replay buffer and M i is the sample trajectory where the final return R i is >R MSD . However, in the training of the agent, it is necessary to explore some situations where the current performance is not good. For some samples with poor current returns, some of the samples should be put into the experience replay buffer, so the improvement of the ξ−greedy strategy is added when selecting samples. The improved sample set in experience replay buffer is defined as where M j is a randomly selected sample track and b is a random number in the (0, 1) interval.

Priority update:
The samples are prioritised based on priority according to the TD error during sample storage. Each time a sample is stored, a priority is assigned based on the absolute value of the TD error, and this data pair is placed in the experience replay buffer. Each time a sample enters the experience replay buffer, the priority of each sample is fixed and does not change with the training process of the agent. However, in the training process of the agent, as the training level is deepened, the actual priority of each sample has changed due to the change of the current TD error, so the probability of being selected should also change. Therefore, this paper proposes a tracking update of the priority. When the sample is sampled once, its corresponding priority should also be updated. The corresponding priority is compared with the current TD error and the average TD error of the previously stored samples. The average TD error formula is defined as The sample priority and TD error mean update formula is defined as where p i ′ is the sample priority after the update, δ¯′ is the average TD error after the update, and λ, γ are two positive parameters belonging to the (0, 1) interval. The idea of this update formula is that a high TD error should increase the probability that the action will be selected again, while a low TD error should reduce this probability.
In order to improve the probability that the high priority samples are selected, this paper proposes a variant of the sample selection probability calculation, and makes the following modifications to the formula (3): where α is still a positive parameter. When α is a sufficiently large value, the picking probability of the sample is equivalent to a uniform random selection, it means that importance sampling is not used, which is equivalent to uniform random sampling.
In this algorithm, we not only preserve the priority of the sample in the prioritised experience replay, but also improve the training efficiency through sample selection and priority update. Sample optimisation DQN with prioritised experience replay algorithm is shown in Fig. 2.

Partial reward-shaping:
In most models of deep reinforcement learning, the external rewards that an agent obtains in the environment are usually sparse. That is, when the final goal is learned, a return value other than zero is returned, and the returns obtained by other states are usually zero. Therefore, in order to speed up the training of the agent, this article adds an internal reward to the return of the first n-step state to the final goal, increasing their total return.
For the Q value of each step in the training, the Bellman formula shows Q π (s, a) = E π r t + 1 + γ ∑ k = 0 ∞ γ k r t + k + 2 s t = s, a t = a = ∑ s′ P sa (s′)[r(s, a, s′) + γQ M π (s′, a′)] After adding an internal reward, the Q value of each step is defined as Q π (s, a) = ∑ s′ P sa (s′) r′(s, a, s′) + γQ M π (s′, a′) r′(s, a, s′) = r(s, a, s′) + F(s, a, s′) where r′(s, a, s′) represents the total reward, r(s, a, s′) represents the external reward, and F(s, a, s′) represents the internal reward. In this article, the internal rewards for the first n steps of the final goal are defined as F(s, a, s′, n) = κ n r t (15) where κ is a constant and r t is the internal return value of the final state. By increasing the internal reward, this algorithm increases the sampling probability of the agent's first n-step state of the final target in training. So that the agent can learn the final goal faster. This method not only speeds up the time for the agent to learn the final state, but also improves the training efficiency. The algorithm of partial reward-shaping is shown in Fig. 3.

Results
We use Gym and Atari [16] as environment platforms for comparison experiments, which proves the superiority of the proposed prioritised experience replay based on sample optimisation. In this paper, MountainCar-v0, Acrobot-v1 in Gym and Breakout-v0, Riverraid-v0 in Atari are selected as experimental environments. The traditional DQN, priority experience replay DQN, and prioritised experience replay based on sample optimisation algorithm are used for experiments. The good performance of the sample optimisation prioritised experience replay algorithm is demonstrated by a comparison of total returns.
For the two Gym environments Acrobot-v1 and MountainCar-v0 in Fig. 4, the experimental setup is as follows: the training network is a three-layer fully connected network. Each layer has 64 neurons. The actions of the two environments are 2 and 3, which correspond to the number of tasks in their respective environments. The activation functions corresponding to the layer network are relu, relu, and softmax. In the prioritised experience replay based on sample optimisation, the other parameters are consistent with the parameters in the prioritised experience replay, and the experimental settings are also consistent.
For the Riverraid and Breakout environments in Atari shown in Fig. 4, the experimental setup is as follows: the training network is a five-layer network, and the first three layers are convolutional neural networks. The first layer of convolution kernels is 8 × 8, and the number of channels is 4. The step size is 4; the second layer convolution kernel is 4 × 4, the channel number is 32, and the step size is 2; the third layer convolution kernel is 3 × 3, the channel number is 64, and the step size is 1. The latter two layers of the network are fully connected networks, the number of neurons is 512, and the number of corresponding tasks is the number of actions. The activation functions corresponding to the five-layer network are relu, relu, relu, relu, and softmax. In the prioritised experience replay based on sample optimisation, the other parameters are consistent with the parameters in the prioritised experience replay, and the experimental settings are also consistent. Fig. 5 shows the return curves of the three algorithms over four discrete motion spaces. In Acrobot-v1 and MountainCar-v0, prioritised experience replay based on sample optimisation can quickly learn the final goal and return better than the original DQN; however, compared to the prioritised experience replay DQN, the return curve between them is not much different. The reason may be that the Acrobot-v1 and MountainCar-v0 environments have relatively simple action spaces and fewer states. In the Acrobot-v1, only the agent can get a return when it exceeds the black line above. In the MountainCar-v0, the effective return can only be obtained when the car reaches the top of the mountain. So, in these two environments, the sample has fewer states and the reward return is sparser. However, the prioritised experience replay based on sample optimisation is mainly optimised for the sample to improve the reward of the agent, so the return is improved compared with the prioritised experience replay DQN, but the improvement is not much. In the two Atari game environments of Riverraid-v0 and Breakout-v0, the prioritised experience replay based on sample optimisation is better than the traditional DQN  and prioritised experience replay DQN. The return curve is higher, and the ultimate goal can be learned faster. In the Riverraid-v0, the agent gains scores by controlling the actions of its own aircraft, killing or avoiding enemies. When hit by the enemy, the game ends. Also in the Breakout-v0, the agent bounces back to the ball and hits the top of the module to get the score. If the rectangle controlled by the agent does not receive the bounced ball five times, the game ends. The possible reasons are that the two environments are more complicated, and the number of states of the sample is more abundant. For example, in Riverraid-v0, the agent not only avoids the enemy, but also hits the enemy to get the score. Moreover in subsequent levels, there are terrain restrictions. The latter two environments are more complicated, and the agent learning time will be longer. Nevertheless compared with the other two algorithms, after optimising the sample, the algorithm is more obvious, and the reward is better.

Conclusion
By studying the evaluation method of the sample priority of the agent, this paper selects the higher priority samples to enter the experience replay buffer for agent sampling training and analyzes the impact of sample priority on agent training. A tracking update method for the sample is proposed in the priority update. The priority and TD error are updated each time when sampled, and the probability calculation formula for sample selection is optimised. In addition, drawing on the idea of reward-shaping, an additional internal reward is added to the first n-step samples of the final state, which increases sampling probability of the n-step samples before final state. In Gym and Atari environment platform, by comparing prioritised experience replay DQN with traditional DQN, the prioritised experience replay based on sample optimisation improves algorithm effect and accelerates the learning efficiency of the agent.

Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant 61976215 and Grant 61772532.