A centralised training algorithm with D3QN for scalable regular unmanned ground vehicle formation maintenance

The unmanned ground vehicle (UGV) has been widely used to accomplish various missions in civilian or military environments. Formation of the UGVs group is an important technique to support the broad applications of multi-functional UGVs. This study proposes a scalable regular UGV formation maintenance (SRUFM) algorithm based on deep reinforcement learning (DRL), which aims to use a uniﬁed DRL framework to improve the lateral control and longitudinal control performance of UGV in different situations of the formation. Based on the tailored design of state, action space, related inter-vehicle information, and the dueling double deep Q-network (D3QN), SRUFM outperforms nature deep Q-network and double deep Q-network in the exploration efﬁciency and the convergence speed in same CARLA training environments with ﬁxed size formation. Furthermore, when the formation’s scale is extended with similar initialisation conditions, the SRUFM can still get a nearly 90% success rate to execute all experiment formation maintenance missions after 4000 episodes of training. Each UGV in the formation can keep distance within the upper and lower error threshold of 0.15 m. The simulation experiments show that the proposed centralised training frame with D3QN is suitable to solve scalable regular UGV formation maintenance missions.


INTRODUCTION
Unmanned ground vehicle (UGV) is the vehicle with the ability of autonomous driving, which can replace humans to perform missions in different environments under autonomous or manual intervention modes [1]. Multi-UGVs collaboration is an essential technique to support the broad UGVs application [2]. The cooperation requires multi-UGVs to assist each other to achieve the best overall performance during the task. The challenge is to design an effective multi-agent framework for UGV formation control in various application environments. Nowadays, many multi-agent frameworks are proposed based on formation control algorithms, such as virtual structure method [3], behaviour-based method [4], leader-follower method [5], graphbased method [6], and potential field method [7]. However, these methods require experts' knowledge to adjust the agents' control parameters to meet different working scenarios' requirements.
Recently, the deep reinforcement learning (DRL) method has been dramatically promoted and shows a great prospect to solve a variety of problems in the UGV area, including dynamic control [8], path planning [9], and motion harmonisation [10]. Compared to classic autonomous drive systems with environment perception, path planning, and dynamics control modules [11], the DRL method combines deep neural networks with a reinforcement learning frame. This combination can help UGV achieve a faster response to more complex traffic scenarios by learning strategies from high-dimensional perceptual input through the end-to-end model [12]. Therefore, the application of DRL in the UGV field is attracting an increasing research focus.
The formation control problems for multi-UGVs need to consider the performance of both lateral and longitudinal control. Specifically, the formation maintenance mission can generally be divided into lane-keeping mission in lateral control and car-following mission in longitudinal control.
The lane-keeping mission is supposed to improve the path tracking performance and handling stability in the lane-keeping manoeuvre. Besides, the control methods can be classified as vehicle model-based methods [13][14][15][16][17] and vehicle model-free methods [18][19][20]. In the vehicle model-based methods, Kim et al. proposed a torque overlay-based robust steering control technique that consisted of a non-linear damping controller [21]. Kang et al. proposed a novel multi-rate lane-keeping system by using a kinematic-based model, while the look-ahead output measurement matrix and the multi-rate Kalman filter were applied to increase the damping of the vehicle and resolve the asynchronous sampling time of multi-sensors [22]. In the vehicle model-free methods, Feher et al. presented the application of a Q-learning reinforcement learning in the lane-keeping mission using track curvature, lateral position, and relative yaw angle as inputs. The latter distance error could be controlled within −0.4 to 0.6 m after training [23]. Also, Aradi et al. presented the research using a policy gradient-based reinforcement learning method for vehicle lane-keeping in a simulated highway environment [24]. On the other hand, the car-following models are mainly designed to control the vehicle's acceleration in the platoon control. The traditional car-following models can be classified as stimulus-response models [25], safe distance models [26], psychophysical models [27], and desired-measure models [28]. Due to high-fidelity traffic data availability, data-driven methods have been widely applied to solve car-following problems, as they combine additional parameters that influence agents' behaviour to train more complex models [29]. Hongfei et al. proposed a four-layer neural network to predict the follower's acceleration [30]. Khodayari et al. proposed a modified neural network approach with one hidden layer to simulate and predict the car-following behaviour based on the instantaneous reaction delay, reaction delay, relative speed, relative distance, and follower's speed as the inputs [31].
Although the previous works have made many contributions, several points need to be improved as follows: (1) Most of the previous research only focused on solving lateral control or longitudinal control problems individually. Therefore, to guarantee UGV dynamics control's performance, they need additional longitudinal control strategies to solve the lane-keeping missions or different lateral control strategies for solving the car-following missions. Obviously, the separate control system vastly increases the complexity of the entire control system. (2) Most of the previous works are not efficient for solving the formation maintenance problem. For vehicle model-based methods, the more the UGVs are considered, the larger the dynamics parameters are needed to be adjusted when the types of UGVs are changed. As for vehicle model-free methods, some multi-UGVs methods with reinforcement learning (RL) are proposed, such as value decomposition networks [32] and monotonic mixing network [33]. However, they need to train an individual network for each UGV, which significantly increases the training time.
To this end, this study attempts to design an efficient UGV formation maintenance training algorithm by considering the lateral and longitudinal controls at the same time. Scalable regular UGV formation, which can be divided into several similar UGV formation units according to the form characteristic of the regular formation, will be mainly studied in this work. This study's main contributions are as follows: (1) Considering the separate control system's complexity, the dueling double deep Q-network (D3QN) structure-based algorithm under comprehensive consideration of both lateral and longitudinal controls is proposed. Deep networks combine the commonly used inputs from model-free lane-keeping methods and data-driven carfollowing methods. Compared with other value-based RL methods (nature deep Q-network (Nature DQN) and double DQN (DDQN)), the algorithm designed to solve scalable regular UGV formation maintenance (SRUFM) mission outperforms in the efficiency of exploration and the speed of convergence. It also keeps a high success rate to finish the experiment route.
(2) SRUFM aims to offer an efficient idea to solve large multi-UGVs formation maintenance missions with less knowledge, computer power, and time cost. In the training part, only one of the smallest formation units is needed. Each UGV in the unit is learned by a centralised SRUFM. After the training, more UGVs with similar position information can be expanded behind the unit to assemble a larger formation. Utilising the same action space, episode ending rules, and reward mechanism, SRUFM still has a high success rate (around 90%) to finish the larger formation maintenance in most formation types without additional training.
The remainder of this article is organised as follows. In Section 2, the experimental environment settings in the CARLA simulator are described. In Section 3, the SRUFM learning method for UGV formation maintenance problems is proposed. In Section 4, different training and testing formation maintenance simulation cases under SRUFM are carried out to compare with other DRL methods. In Section 5, conclusions and future works are presented.

EXPERIMENTAL ENVIRONMENT
The classic autonomous system of UGV can be divided into three parts: Environment perception and localisation, path planning, and dynamics control. The environment perception part analyses information about the surrounding environment and UGV itself by processing data from lidar, camera, and other sensor components. The path planning part selects the optimal decision and makes the optimal path under careful consideration of perception results. The dynamics control part tracks the theoretical optimal path and calculates the actual control signals of steering, drive force, and brake force to realise the movement of the UGV. However, the end-to-end autonomous system integrates environment perception, path planning, and dynamics control into one learning framework. In this study, CARLA simulator [34] is applied to train and test the performance of SRUFM method, which can simulate hardware configuration and autonomous system of UGV. A section of straight road without other dynamic element is chosen as the training route. Red-green-blue (RGB) camera, collision sensor, and lane invasion sensor are selected to provide perspective images, collision detection, and lane line type

Regular Formation Type
SRUFM is supposed to solve the regular formation maintenance mission, where the formation can be divided into several smallest training units. There are three regular formations, and their regular smallest training units (inside the dashed box) are shown in Figure 1. In the training part, one of the smallest units is selected as the training object. One UGV as the leader (L) and the other UGV as the follower (F). In the testing part, more UGVs can be added as followers to achieve the larger-scale formation.

Action space for the aim
According to the formation maintenance mission requirement, four actions are designed to form the action space shown in

Episode ending rules
Four rules are designed to judge whether the current episode should be over as shown in Table 2. Once one of the UGVs has reached one collision or driven out of the lane, and if both UGVs reach the end, the end of the episode will be triggered. Additionally, to prevent the simulation from falling into a dead zone, the episode will also end once the duration exceeds the maximum time. Each time the current episode is over, CARLA simulator will initialise UGVs' position and start a new episode.

Reward mechanism
The setting of the reward mechanism plays a vital role in the performance of SRUFM method. After each step, the UGVs' rewards will be calculated according to the status of each UGV. The reward mechanism is slightly different between the leader and the follower due to the different additional missions of each UGV. The rewards about aim (r aim ), expected velocity (r ev ), current velocity (r cv ) and leader reachability (r lr ) are designed for the leader. The rewards about expected distance (r ed ) and follower reachability (r fr ) are designed for the follower. However, the reward for an accident (r ac ) is shared for both UGVS. When the UGV collides with any element or crossing lane line, r ac is set to -100. In other cases, r ac is set to 0.5. The additional mission of the leader UGV is to reach the expected velocity as soon as possible (in this study, the expected velocity is consistent with the maximum speed that the action space can achieve). The main role of r aim is to urge the leader UGV to move as far as possible from the starting point. At each time step t, the longitudinal distance d(t) between the current location of the UGV and the starting point will be calculated. r aim is set according to the value of d(t) -d(t -1). If this value is larger than 0, r aim = 2; otherwise, r aim = − 1.5. r cv is set as v c ∕v 0 , where v c is the current velocity and v 0 is the expected velocity. The expected velocity range is set as [v 0 − 1, v 0 + 1], and when the speed of the leader is within this range, r ev is set as 1. In other cases, r ev is set as -1.5. If leader UGV or follower UGV reaches the destination, r lr will be set as 1000. In other cases, r lr is set as 0. At each step, the total reward of the leader UGV is r = r ac + r aim + r ev + r cv + r lr . The additional mission of the follower UGV is to keep the expected distance. At each time step t, the relative distance between the leader UGV and the follower UGV D(t) will be calculated. The expected D(t) range is set as [D 0 − 0.15, D 0 + 0.15]. When D(t) is within this range, r ed is set as 1. In other cases, r ed is set as -1.5. If follower UGV reaches the destination, r fr will be set as The frame structure and data flow of the SRUFM algorithm 1000. In other cases, r fr is set as 0. At each step, the total reward of the follower UGV is r fi = r ac + r co + r fr .

SRUFM FOR UGV FORMATION MAINTENANCE
To suppress the overestimation problem and enhance the dynamic environmental information sensitivity, SRUFM retains the structure of D3QN. The online network and the target network still exist, and both networks adopt the dueling structure as shown in Figure 2. SRUFM solves scalable regular UGV formation maintenance mission by training one of the smallest units. Different from other RL-based multi-UGVs methods, the state of each UGV in the unit will be described separately. Each UGV has its state information, including the front camera image, relative speed, and relative distance information. Any of the UGVs in the unit uses the unified SRUFM for its action output in every step. SRUFM utilises all UGVs' experience to update and learn online as shown in Figure 3.

The online and target neural network structure
In this study, the dueling network structure is used both in the online and target neural network as shown in Figure 2. The size of the image in the form of (length, width, depth) is defined. When acquiring the current environment information of UGVs from each frame, RGB-alpha format image obtained from the CARLA camera should be changed into RGB format at first. Then the size of the new format image will be adjusted to (320, 124, 1) according to the region of interest. Considering the possible influence of the changes between the two frames on the UGV decision-making, the grayscale images of two consecutive frames are stacked. The resulting image will be greyed to be one of the inputs, whose size is (320, 124, 2). Detail layers parameters in this part are described below.
The first hidden layer is the convolution layer containing 32 filters of kernel size (8,8) with a stride of 0, and the activation function is the rectified linear unit (ReLU) [35]. The second hidden layer is a max-pooling [36] layer of size (5, 5) with a stride of 4, which can reduce the features' dimensions and extract the main features. The third hidden layer is the convolutional layers containing 64 filters of kernel size (5, 5) while the fifth hidden layer containing 64 filters of kernel size (4, 4), whose strides are both 0 and activation functions are the ReLU. The fourth and sixth hidden layers are max-pooling layers of size (3, 3), whose strides are both 2. Relative speed and relative distance information as (∆V it , ∆D it ) are combined with flattening array (for the leader UGV, . The network is then divided into a value function flow and an advantage function flow. The first layer in the two streams is a fully connected layer with 64 units, and the activation function is the ReLU. The second layer in the value function flow and the advantage function flow contains 4 and 1 linear units, respectively. Finally, the value function flow and the advantage function flow are added to get the value of the actions.

Algorithm design
The formation maintenance mission of UGVs can be defined as a Markov decision process (MDP). Each MDP problem can be described by a tuple (S, A, P, R, γ), where S represents all possible states, A represents all possible actions, P represents the states transition function, R represents immediate rewards, and γ represents the discount factor. At each time step t while in state s t , the agent takes an action a t by using policy . The next state s t +1 is determined accordingly by the transition function P, and an immediate reward r t , using cumulative discounted rewards G t to assess the current state value s t .
where t is the time step, k is the number of each step from the beginning of the current step t to the end of the episode, ∈ (0, 1] is the discount factor, and R t +k+1 is the immediate reward. Two value functions are formulated to represent the worth of the control action selection. They are named as a statevalue function v (s) and state-action function q (s, a): where is a special control policy, s t is the current state and a t is the current action. Substituting Equations 1 with 3 to obtain the Bellman equation, we have where s t +1 is the next state and a t +1 is the next action. From  Equation 4, it can be seen that the value of q (s t , a t ) is determined by immediate reward R t +1 and the Q-value of the next step q (s t +1 , a t +1 ). MDP aims to find the strategy that maximises q * (s, a).
In formation maintenance mission, the state transition function P that is used to calculate the accurate q (s t , a t ) or V (s t ) is recessive to be figured out. Therefore, P approximated methods are needed. Based on temporal-difference (TD) methods, Q-learning [37] is widely used: Q-learning uses a greedy policy to get the next action a t +1 of next state s t +1 , where q (s t , a t ) is online Q-value and q (s t +1 , a t +1 ) is target Q-value; represents a learning rate. However, since the formation maintenance environment's state space is uncountable, it is necessary to construct a mapping between the corresponding state and the Q-value for each action.
where f is the approximate function and w is the parameter of the function. Deep learning with neural networks is good at refining non-linear function relationships in complex application scenarios. Therefore, combining deep learning and Q-learning, the DQN method [38] was proposed. It uses the mean-square error to define the loss function L( ).
where is the parameter of the deep neural network. The update of is defined as The same neural network q (s, a; ) is used to calculate the online Q-value and the target Q-value, which will cause a bad convergence. An optimised DQN named Nature DQN [39] was proposed. It owns an online network to calculate online Q-value q (s t , a t ; ) and the target network to calculate target Q-value q(s t +1 , a t +1 ; ′ ). The t +1 is defined as According to Equation 10, the maximum of the target Qvalue is always used to be selected, which leads to the overoptimisation of the target value estimate. DDQN [40] network is designed to solve the target value over-optimisation problem by choosing the action that owns max Q-value from the online Q-network and its Q-value from the target Q-network as the target Q-value. In SRUFM, this idea is inherited. The loss function is defined as In SRUFM algorithm, dueling DQN [41] is the main structure of both two Q-networks, which separates the final Q-value into two parts: The state-action value V (s t ; , ) and actionadvantage value A(s t , a t ; , ).
where denotes the parameter of convolutional layers, and are two stream parameters of fully connected layers in a dueling network. However, the various combinations of values V and A can get the same Q-value, which reduces the algorithm's stability. Therefore, in SRUFM, the Q-value of each UGV is designed as whereĀ(s t , a t ; , ) is the average value of the sum of output values of the advantage function. According to Equations 11 and 13, the loss function used in SRUFM is designed as where s t is described as [m t , g t ], m t is the information of front camera image, and g t is a two-dimensional vector composed of the relative speed and distance of each UGV. The pseudo-code of SRUFM process is shown in Algorithm 1. Output: Weights * , * , * for SRUFM networks.
Initialise the agent i (i = 1,2), initialise the experience replay repository D to capacity N, the historical observations U = 0, the min training replay memory size U min = 1000, the max replay memory size U max = 6000, the exploration rate ε = 1, Q-network and its parameters , Q t -network with ′ = . The Q-value of each action could be obtained by Q = V+A.
for episode = 1 to M do for t = 1 to T do Get observation information s 1t , s 2t .
Get (s it +1 , R it , done) of each UGV from the environment and append in D with the tuple: Sample minibatch (s t , a t , s t +1 , R t , done) from D and Update , , as: , q (s, a; , , ) , , = t , t , t Reset the Q t -network ′ , ′ , ′ = , , every C episode End for Return * , * , * for SRUFM network A concise illustration of the frame structure and data flow of the SRUFM algorithm is shown in Figure 3. SRUFM is divided into two parts: Control and learning loops. In the control loop, at each step t, each UGV inputs its respective images and relative information into the online Q-network and chooses the current action under the greedy policy. During the training, all experience is stored in the experience reply memory. In the learning loop, a certain amount of experience is extracted to generate the TD error for the online Q-network updating. The target Q-network synchronises parameters with the online Q-network every few episodes.

EXPERIMENTS
Choosing a suitable exploration-exploitation ratio makes it possible to get higher rewards efficiently. In SRUFM, the ε-greedy strategy [42] is used as the action is chosen policy to randomly select an action with the probability of ε and use the greedy  The discount factor γ is set as 0.95, and the maximum time per episode is 25 s. With the hardware setup consisting of a single GTX 1050Ti GPU and 12 G memory, the max replay memory size U max is set as 6000. During the experiments, the U max is effective in preventing catastrophic forgetting and maintain training efficiency. Compared to the min training replay memory size U min used by Wang et al. [43], the U min is set as 1000 (the proportion of the U min to the U max is higher), which ensures the richness of samples in the early stage of training. SRUFM generates a replay memory sample at each step when the number of experience samples exceeds 1000. SRUFM network will be trained using 20 replay memory samples each time. The oldest sample will be replaced if the number of samples is over 6000. The target Q-network synchronises parameters with the online Q-network every 25 episodes. To compare SRUFM with Nature DQN and DDQN in an effective way, the action space is chosen as shown in Table 1. The total number of episodes is 4000. The training scenes are shown in Figures 7(a)-(c). Different RL methods are used to learn three formations maintenance. The action Q-value changes are shown in Figures 4-6. It can be seen that the Q-value of ac(L0) and ac(F0) (where the throttle is equal to 0.6, steer and brake are equal to 0) are higher than other actions Q-values after all RL algorithms training, which meets the human driver's decision-making in the same action space. It also indicates that centralised value-based reinforcement learning is effective in solving regular formation maintenance problems.  Compared to DDQN and Nature DQN, SRUFM performed higher efficiency in training the effective action (L0, F0) as shown in Figures 4(a) and (e), 5(a) and (e), 6(a) and (e). This is especially true in the horizontal and diagonal formations mission, where the curve is more efficient in getting a stable distribution. The UGV rewards and distance varieties are shown in Figures 8 and 9. It can be seen that SRUFM outperforms other algorithms in the exploration efficiency and convergence speed. DDQN gets the close reward results around 2000 episodes in vertical formation (Figures 8(a) and (b)) and around 3000 episodes in horizontal formation (Figures 8(c) and d)). In diagonal formation, although DDQN learns the stable reward around 3200 episodes (Figures 8(e) and (f)), the distance varieties are severe (Figure 9(c)) due to alternative not-so-good policy learned. DQN gets the close reward results around 3500 episodes in vertical formation (Figures 8(a) and (b)) and in horizontal formation (Figures 8(c) and (d)). In diagonal formation, it fails to learn high reward policy (Figures 8(e) and (f)). According to the gentle distance varieties, most episodes finish in a short time (Figure 9(c)).
To verify the algorithm's generalisation ability, a similar CARLA environment is chosen to test different RL methods' performance, as shown in Figures 7(d)-(f). Another follower UGV(UGVF1) is added to the formation. The maximum bias (Max) and minimum bias (Min) to the expected distance is calculated. The distance between leader UGV and follower UGV is D1, and the distance between the follower UGV and the new follower UGV is D2.
Without crossing lane situations or hits, each UGV in the formation can keep distance within the upper and lower error threshold of 0.15 m, which defines one successful episode. The number of total tests is 100, and the success rate is designed as n s ∕100, where n s is the total number of success.
The results shown in Table 3 are collected from a similar testing road. It can be seen that value-based RL methods are suitable to solve the regular formation maintenance problem. Besides, SRUFM has achieved a small stable bias and high success rate, while DDQN got a lower success rate in the vertical formation, and DQN fails to reach the destination in a diagonal formation ( Figure 10).

CONCLUSION
This study presents a new algorithm, SRUFM, which uses DRL to solve scalable regular UGV formation maintenance prob-

FIGURE 10
Deep Q-network failed in diagonal formation testing lems. Adding the real-time relative speed, relative distance, and the front image as the inputs, the proposed method can easily append other UGVs to perform the same formation maintenance mission well after the model training with only two UGVs. It offers a new idea to solve large-scale formation maintenance problems by deconstructing the formation into several smallest training units. The simulation experiments also show that the centralised training frame with value-based DRL is suitable to solve scalable regular UGV formation maintenance problems. Simultaneously, SRUFM outperforms DQN and DDQN in the exploration efficiency and convergence speed. Although the work shown in this study provides encouraging results, there are several limitations related to the more complex application. Future studies aim to extend the SRUFM to solve regular UGV formation maintenance problems in the curvature and slopes route. Different kinds of UGV can be grouped into one regular formation so that the heterogeneous formation maintenance problems will also be focused on in our future works.