Deep tree neural network for multiple‐time‐step prediction of short‐term speed and confidence estimation

Funding information National Natural Science Foundation of China, Grant/Award Number: 52002262; National Key Research and Development Program of China, Grant/Award Number: 2018YFB1600500 Abstract To solve the multiple-time-step prediction of traffic speed and confidence for segment types of expressway, a deep tree neural network (DTNN) with multitask learning is proposed. DTNN contains a classification network, regression networks and a confidence network. These sub backbone networks accomplish the tasks of distinguishing segment types, fitting the speed of segments and the confidence estimation on the predicted speed, respectively. Through multitask learning, the sub networks in DTNN share feature representation and complement each other. To further improve the accuracy of speed prediction on congestion, the mean absolute percentage error loss function (MAPE-loss) is applied in DTNN. It makes the learning and extracted features biased to low speed samples. The traffic speed dataset of the Shanghai Expressway is used to test the DTNN and 12 comparison methods. Results show that the proposed DTNN with MAPE-loss can efficiently improve the predictive accuracy of low-speed samples over the other methods. The trained DTNN also gave highly accurate low speed prediction on the dataset of Suzhou expressway. In addition, the smallest reduction in the R-squared value from the training stage to the testing stage illustrates the best generalization of DTNN model.


INTRODUCTION
As a way to relieve traffic congestion and improve the efficiency of expressway networks, intelligent transportation systems and advanced traffic management systems have been actively developed. A necessary component of such systems is short-term traffic speed prediction, which aims to forecast and evaluate traffic patterns in a short period. If an accurate shortterm prediction of congested segments could be derived, then corresponding traffic management could ease congestion and thereby improve the operation efficiency of expressway networks. The early research about short-term traffic speed prediction mainly focused on time series models and statistical models, including the autoregressive integrated moving average (ARIMA) model [1], exponential smoothing model [2], Kalman filter model [3] and Bayesian model [4]. These models can predict stable traffic conditions by using small traffic datasets, but their prediction tends to be inferior given non-linear variations with large-scale and high-dimensional traffic condition data. As traffic data volume increases, short-term prediction based on data-driven approaches offers a potential solution. Many experts have tried implementing machine learning methods to capture non-linear spatiotemporal relationships for traffic state prediction; these methods include k-nearest neighbour [5,6], support vector machine [7], artificial neural network [8], decision tree [9] and Gaussian maximum likelihood [10]. Meanwhile, deep learning shows promise in traffic condition prediction with the advancement of computational power. Numerous studies have applied neural network-based models, such as deep belief network, convolutional neural network (CNN), recurrent neural network (RNN), long shortterm memory (LSTM) and gated recurrent units GRUs [11][12][13], to capture the spatiotemporal correlations of traffic conditions and obtain favourable prediction accuracy. Gu Y. et al. [14] combined a Bayesian model with deep learning for traffic flow prediction. Zhao Z. et al. [15] established a cascaded LSTM network that combines the interactions among road networks in the time and spatial domains. Yu H. et al. [16] proposed a spatiotemporal recurrent convolutional network for traffic forecasting; the proposed method inherits the advantages of deep CNNs and LSTM neural networks Guo S. et al. [17] proposed a novel attention-based spatiotemporal graph convolutional network that considers recent, daily-periodic and weekly-periodic dependencies. Fukuda S. et al. [18] focused on traffic prediction under incident conditions through a graph convolutional RNN and reduced the effects of unusual conditions on forecasts. Chen X. M. et al. [19] integrated gradient boosting regression trees and the least absolute shrinkage and selection operator to present a multimodel ensemble for short-term traffic flow prediction under normal and abnormal conditions. However, most existing research has focused on single-timestep prediction or one particular segment type (e.g. normal segment, merge segment, diverge segment, weaving segment and ramp). Another issue is that deep learning can always generate a prediction regardless of the divergence between current situations and training datasets. In most cases, prediction confidence cannot be achieved at the current time. By contrast, multi-time-step prediction for all expressway segment types is serviceable but sophisticated. Such prediction also helps traffic managers to not only predict traffic conditions but also achieve prediction confidence. To address existing issues, we aim to propose a novel approach that is applicable to all expressway segment types. This approach is designed to predict multi-timestep speeds and achieve prediction confidence. The deep tree neural network (DTNN) with multitask learning (MTL) is chosen to achieve the aforementioned objectives. MTL [20][21][22] is an approach to learning several related problems simultaneously through a shared representation. Continuous short-term predictions can be seen as multiple related tasks.
Fortunately, several attempts have been made to apply MTL to short-term prediction. Jia H. et al. [23] established an attention-based deep spatiotemporal network with MTL, which can forecast steady trends and sudden changes in passenger flow. Kuang L. et al. [24] constructed an MTL component and captured the correlation between taxi pick-up and drop-off tasks with 3D ResNet for accurate prediction. Luo H. et al. [25] proposed an MTDL model to predict short-term taxi demand in multizone levels. Rago A. et al. [26] proposed an MTL model to evaluate the impacts of the observation windows of traffic profiles on classification accuracy prediction loss, complexity, and convergence. These MTL-based models have already shown that learning related tasks jointly results in a more accurate prediction than single task learning. However, the research into the application of MLT to multi-time-step prediction for all expressway segment types remains limited.
Confident forecasting is essential for the quantification of potential risks and uncertainties for delicate management [27], and existing research has explored the quantification of shortterm prediction confidence. Lin et al. [28] provided a prediction interval with upper and lower bounds to quantify the uncertainty in short-term traffic volume prediction. Xing et al. [29] proposed a probabilistic forecasting model of traffic flow on the basis of a multikernel extreme learning machine and utilized a quantum behaved particle swarm optimization algorithm to optimize the output weights. Ghosh et al. [30] established Bayesian support vector regression to obtain error bars as the measurement of uncertainty, along with the predicted duration of incidents. According to existing research, an interval is normally given to represent confidence. Meanwhile, the confidence of a point traffic speed value in multi-time-step prediction has rarely been given attention.
Given the complex representations of optimized functions and the characteristics of multiple segment types, simultaneously predicting traffic speed and confidence for multiple time steps is a challenge. Dealing with this problem requires solutions to two issues. One is to make use of the correlation between the information at multiple terms, and the other is to fully extract the spatiotemporal features for different segment types. We observe that forecasting tasks can be decomposed into classification tasks, several groups of related regression tasks and confidence tasks. 1) Classification tasks. In this type of task, the category label is predicted for each segment type instance. We expect this subtask to be simpler than regression tasks on original data. It is also easily addressed by one convolution network. 2) Regression tasks. These subtasks learn from and improve one another by sharing informative representation. In one group of regression tasks, instances are predicted for corresponding segment types. 3) Confidence tasks. These subtasks estimate the confidence of speed prediction by calculating the probability of predicted speed belonging to each level of service.
To promote the training and testing of the proposed model, as well as its prediction evaluation, we utilize the loop detector data of the Shanghai urban expressway. The short-term prediction of speed for congested segments every 5-15 min is then realized. The experimental results show that the prediction results of the proposed DTNN model are significantly better than those of the other 10 methods compared herein. We also demonstrate the effectiveness of DTNN for prediction tasks. The contributions of this work are highlighted as follows: 1. A classification network can automatically recognize segment types, that is, normal, merging, diverging, weaving and ramp segments. This network provides the features of specific segments for speed prediction; 2. A confidence network is designed to simultaneously give the confidence value of the predicted speed. The result proves the credibility of the predicted speed; 3. MTL is adopted to train the DTNN for multistep speed prediction (i.e. next 5, 10 and 15 min). MTL improves the generalization of the model by sharing the feature expressions of multiple tasks; 4. The mean absolute percentage error (MAPE) loss function is used in training the proposed DTNN to improve model's predictive accuracy in low-speed cases.
The paper is organized as follows. Section 2 presents the problem formulation for the multi-time-step speed prediction of expressway segments. Section 3 describes the structure of

Speed dataset and feature preselection
The subject investigated in this study is the Shanghai urban expressway. The loop detector data from 1 March 2014 to 14 March 2014 are collected. A total of 3304 in-path loops are located on the expressway, and related data are aggregated every 5 min; the daily data volume can reach 1 million. Each piece of data contains a loop code, a recording time, section speed, section flow rate, section occupancy rate and other related information. The distribution of the loop detectors on Shanghai urban expressway is shown in Figure 1. The distance between adjacent detectors ranges from 150 to 500 m. There are five segment types mentioned in this study, including normal segment, merging segment, diverging segment, weaving segment and ramp segment, denoted as S1-S5 in Figure 2. Herein, a triangular fundamental diagram [31,32] is chosen to fit the relationship between speed and density. The fundamental diagram shows the relationship of the three traffic flow elements (i.e. flow, speed and density). The triangular fundamental diagram, which is well-known and often used in academia [33], represents the flow-density plane in a triangular shape. This diagram can describe the realistic situation in which drivers maintain the same speed with a range of low densities possibly because of the current speed limit. The equation is as follows: where q is the flow rate; k is the density; v is the speed; k c is the critical density; v f is the free flow speed and k j is the jam density, which is always set to 200 pcu km −1 ln −1 . On the basis of the loop detector data, the flow rate, speed and density in each time interval (i.e. 5 min in this study) can be obtained. Then, the parameters k c and v f of the triangular fundamental diagram can be calibrated by the least squares method. Table 1 shows the calibration results of the traffic speed data derived from the five segment types.  As shown in Table 1, the traffic flow characteristic parameters, that is, critical density and free flow speed, show diversity among different segment types. For instance, diverge segment has both lowest critical density and free flow speed; normal segment has low critical density while the free flow speed is the highest; the traffic speeds of expressway are generally related to segment types and the traffic flow characteristics varies from different segment types [34,35]. The existing research and the above test indicate the training model should consider the impacts of different segment types.
To choose the appropriate spatiotemporal feature group, we compare the combinations of features with different numbers of adjacent segments and recent previous time intervals. Ten congested segments of the Shanghai expressway are randomly selected, and the 5 min speed predictions for the segments are obtained by training the BPNN for each feature group. BPNN is a simple neural network, which is easy to train. To ensure the fairness among different segments, we apply the same training number that can guarantee that the BPNN achieves convergence (i.e. 50 times in this research). Table 2 shows the average MAPE and root-mean-squared error (RMSE) values of the 10 congested segments for speed prediction.
As shown in Table 2, different numbers of adjacent upstream/downstream segments and previous time intervals are chosen as the combinations of features. The traffic condition of the target segment reflects the spatial dependency of the adjacent segments and the temporal dependency of the recent time intervals. However, the use of more features cannot significantly improve the predictive accuracy and may even bring noise and reduce training efficiency. The experiment indicates that the third combination of features performs best in terms of MAPE and RMSE. Therefore, the speed of the target segment, the five adjacent upstream and downstream segments, and the five previous intervals of each segment are selected. These 55 variables are grouped into a feature vector for training and testing in this study.

Problem formulation
Let t be the current time, and let t + ξi (i = 1,2,…,m) be the next time step to be predicted. In this study, the interval between time steps is 5 min. Let D be the dataset of traffic speed for the whole time period, including training and testing dataset. Given an input vector where t indicates the current time, d t-l ∈D denotes the traffic speed of the target segment, and r t-l indicates the speed of related segments at t-l, respectively. The optimization problem for short-term speed prediction on time steps (t+ξ 1 , t+ξ 2 , …, t+ξ m ) is defined as follows: denotes the parameter vector of the speed predictive model g and N is the number of parameters in g.
indicates the sub-objective function on the i-th time step and is calculated by the mean squared error (MSE) between g(X) t+ξ and d t+ξ . g(X) t+ξ and d t+ξ are the prediction speed of the model f and the observed speed of the target segment on the ξ-time step, respectively. To solve Equation (2), the multi-objective optimization problem can be converted to a single objective optimization problem. This can be implemented by reducing the sum of Z m over m during the training on a machine learning model. Here, Z m is a sub loss function of the model.

Input data
As discussed in Section 2.1, a sample contains 55 variables, including the 5 latest time steps of the target segment and the first 5 adjacent upstream and downstream segments. The spatiotemporal relationship between the future and previous speeds is included in the sample. In this study, a sample X with 55 features is arranged as [t, l] input matrix of the DTNN (i.e. t = 5 and l = 11). Figure 2 shows the [t, l] input data, in which column X l indicates the speed of traffic node l and row X t indicates the time step t.

DECISION TREE NEURAL NETWORK
We propose the DTNN with MTL to predict speeds on multiple time steps. As shown Figure 4, the architecture of the DTNN contains three subnetworks, namely, classification, regression and confidence networks. The classification network aims to identify the segment type automatically. The regression network The input to the DTNN is the [t, l] matrix X shown in Figure 3. As observed in Figure 4, three sub-models share the same feature extraction module, which extracts spatiotemporal features from the input X.ŷ g andŷ e are the outputs of the generalist and expert networks, respectively.ŝ is the output of the classification network.ĥ is the dot product of the expert and classification networks. c denotes the output vector of the confidence network.Ĥ and C are the two outputs of the DTNN. H is the predicted speed calculated by the regression and classification networks. C is the confidence ofĤ in speed levels calculated by the confidence network.

Classification network
The classification network recognizes the segment types of input sample X ([t,l] matrix) and outputs a probability vector s = [ŝ 1 ,ŝ 2 ,…,ŝ k ] of X belonging to the segment types. It contains two convolution layers and one FC layer. The convolution can be in the following form: .
where u l i indicates the sensitivity matrix at l layer, y l −1 i is the l-th feature map at l-1 layer and y l j is the j-th output feature map at l layer. I indicates the convolution kernel. The activation function f(*) used is ReLU.
The FC layer combines feature maps to implement non-linear mapping. Letŝ = [ŝ 1 ,ŝ 2 ,…,ŝ k ] be the output vector of the classification network; it is the probability of samples belonging to the i-th category. The functions in the FC and output are in the following forms: ] .
Here, u l j is the sensitive factor, and y l j is the j-th hidden node at l layer. W l i j denotes the weight of y l −1 i connected to, y l j and it indicates the importance of feature maps in generating different feature combinations. b l j denotes the bias to the node. ρ is the dropout probability.

Regression network
Regression networks contain expert and generalist networks, which predict speed on multiple time steps. Generalist network is trained without distinguishing segment types and it is good at extracting common features among segment types, such as the correlations with the traffic speed of previous time intervals. Whereas, each expert network is a specialist on one segment type and can extract typical features, such as the merge segment has more attention on the adjacent upstream segments, while the diverge segment has more attention on the adjacent downstream segments.
The input X to the regression network is the same as that to the classification network. In this study, the residual structure is adopted because it can be easily fit into a residual function, for example, Res(X) = f(X)-X, relative to the underlying mapping f(X) = X. The residual structure is as follows: Here, I 1 and I 2 indicate the kernels of convolution layers, and W s X denotes the shortcut across convolution layers in the DTNN.
The outputŷ g of the generalist network is as follows: Here, y l j is the j-th element at l layer; W jo is the weight of connecting the FC layer to the output and indicates the contribution of feature combinations to the output.
The expert network predicts the speed with the output vector of the classification network.ĥ is the dot product of the output vectors of the expert and classification networks, as shown in Equation (7).ĥ Here,ŷ e = [ŷ 1 ,ŷ 2 ,…,ŷ k ] is the output vector of the expert network, andŷ k indicates the k-th output;ŝ = [ŝ 1 ,ŝ 2 ,…,ŝ k ] denotes the output of the classification network. For instance, the classification network outputsŝ = [0,0.7,0.3,0,0], and the  The predicted speedĤ is the weighted sum ofŷ g andĥ and is written asĤ Here, w is the weight of the generalist and expert networks.
To observe the difference between expert and generalist networks, we use BPNN in training a "Generalist" model on mixed segment data and training "Expert-S1," "Expert-S2," "Expert-S3," "Expert-S4" and "Expert-S5" models on the data of five segment types (S1, S2, S3, S4, S5, respectively). Figure 5 presents the MSEs of these models on the S1, S2, S3, S4, S5 and mixed segment data. The Expert-S1 model outperforms the other models on the S1 data, and the Generalist model provides more accurate prediction on mixed segment data than on S1 data. The experts and generalist networks have their own specific knowledge and improve each other.

Confidence network
Confidence is estimated by calculating the probability of the predicted speed belonging to each speed level of service. The five speed ranges for expressway is based on〈〈Code for design of urban road engineering〈(CJJ37-2012) [36]. In the confidence network, two convolution layers, a max pooling layer, an FC layer with dropout and a logistic output layer are used to accomplish the probability estimation. Herein, max pooling reduces a feature map into half.
The probabilities of the predicted speed belonging to different speed levels should be independent and non-competitive. Softmax usually gives conflicting probabilities of multiple outputs, thereby yielding an extremely large probability for a large input and a minimal probability for a small input. Hence, we employ logistics, and not softmax, to calculate the independent probabilities of multiple outputs.
The output vector of confidence network isĉ = {ĉ 1 ,ĉ 2,…ĉj. c i }. If the predicted speedĤ falls at speed level j, the confidence is C =ĉ j , which is denoted as follow.
whereĉ j is the j-th output of confidence vector (j-th service level), and y j denotes the output of FC layer.

Multitask learning
Through the joint learning of multiple tasks, more informative representations can be achieved. Figure 6 shows the schematic of MTL in the DTNN. During MTL, subtasks belonging to one group are performed in parallel on multiple time steps. In each time step, three types of subtasks run in parallel, and they share feature expressions and promote one another. They are described as follows.
• The k+1 group of regression tasks are carried out to learn the data of k segment types and the mixed segment data. Each group of subtasks targets to predict speed on multiple time steps; • The classification task classifies samples into k categories of segments; • The confidence estimation task provides the confidence of the predicted speed and reduces the probability of inauthentic predicted speed.
For the classification and confidence estimation tasks, one hot code, representing k segment types and speed levels, is used as the label of samples. For regression tasks, the speed value [v_5, v_10, v_15,…] of the target segment on multiple steps (t+5, t+10, t+15, …).
The inductive bias, provided by different tasks, makes the model adaptive to different segment types and even unknown ones. For instance, the classification of segment types shares the segment features with the speed prediction task. These features probably improve the accuracy of speed prediction on a specific segment; the speed prediction task on a 5 min step provides short-term information for prediction on 10 and 15 min steps. The DTNN with MTL acquires knowledge of the segment type, segment speed, and the correlation between multiple time steps. Once the speed prediction on a new segment is required, the DTNN with adequate knowledge is likely to discover the relationship between new segments and trained segments so as to improve speed prediction.

Loss function
For the classification task, cross entropy loss function is used. Multiple tasks are realized by multiple loss functions in the training process. The loss function E net of DTNN, which contains many sub loss terms, is as the form: ] . where that is, ξ = 5-, 10-, 15-min….
Here, the first loss term indicates the cross-entropy loss of classification network (k segment types).ŝ j and s j are the output and label of the j-th segment class, respectively; the second loss term indicates the cross-entropy loss of confidence network (i speed levels), andĉ j and c j are the output and probability of the j-th speed level, respectively. W denotes the parameters in DTNN.
• Square loss (L 2 ) Square loss (L 2 ) is a conventional loss function, which is sensitive to outline data. L 2 loss function is used to train the regression networks.
The third term is a L 2 loss of the j-th regression network (j = 1,2,…,k+1), whereĥ is the output of regression networks, and d is an observed sample in D.
The fourth term, ‖W ‖ 2 , is the L 2 regulation on all the parameters. This aims to obtain a low complex network.
• MAPE loss L 2 loss can reduce the error with respect to most samples, but it cannot solve the problem of data imbalance. In shortterm prediction, low-speed samples are more difficult to predict accurately than high-and middle-speed samples. In actual set-tings, the accurate prediction for low traffic speed is important because traffic management always pays close attention to the congestion state (i.e. low speed). The input traffic data usually contain a larger volume of high-and middle-speed samples than low-speed samples. These easy samples with high and middle speeds produce a large loss in total to make the learning process unaware of the valuable low-speed samples.
To solve this problem, we employ MAPE loss. It penalizes the loss from samples of high speed while keeping the loss of low-speed samples unchanged. The third term can be changed into the following form: Here, μ is set to 10 that enlarges the weight of the loss term. MAPE loss only penalizes the importance of high-speed samples and increases the loss of low-speed samples when compared with L 2 loss. The focal loss (L 3 ) penalizes the easy and hard samples.
By combining square loss (L 2 ) and MAPE loss, we apply gradient descent to train the classification, regression and confidence networks.

Gradient descent in deep tree neural network
The parameters W and b are optimized by minimizing the loss function with iterative gradient descent: η is the learning rate that is set to 0.01. We discuss the gradient descent on the parameters of the DTNN. Some parts are similar to the gradient descent of CNNs. Herein, we only present the distinguishing parts of the DTNN. The gradient on W s in the residual structure has the following form: The second convolution layer is shared by the regression, confidence and classification networks. Here, the feedback of loss terms on the kernel parameters is in the following form: The FC layers of the different components in the DTNN are independent on tasks. For the generalist network, the gradients on the weights of the FC layers are in the following form: For the expert network, the gradients on the FC layers are as follows: In the regression network, the expert network has y = su j , and the generalist network has y = u j . In the classification network, the gradient on the FC layer is as follows: For the confidence network, the gradient is as follows:

Data description
Loop detector data from Shanghai urban expressway during 1 March-4 March 2014 are utilized in this study. There are 3304 in-path loops located on the expressway. All the locations of the loop detectors are known and the space between adjacent detectors varies from 150 to 500 m. The data collection is aggregated in every 5 min. Each data includes a loop code, recording time, speed, flow rate and occupancy rate. A total of 406 congested segments are collected, containing 171 segments of S1, 22 segments of S2, 21 segments of S3, 61 segments of S4 and 131 segments of S5. A target segment is randomly selected, and the spatial-temporal traffic characteristics are shown in Figure 7. In Figure 7(a), the temporal dependency of the speed is evident. For instance, there is always a speed drop at 7:00 AM, and the speed increases around 9:00 AM. In Figure 7(b), the traffic speed of adjacent segments has spatial dependency. The congestion spreads to the upstream when the downstream suffers from a capacity drop, for example, one lane closed due to traffic accident. Besides, a large traffic demand from upstream will also induce the congestion at downstream segments. However, the cyclicality and randomness of the time-dependent speed coexist at the same target segment. Thus, as discussed in Section 2.1, the temporal and spatial features must be included when training the prediction model. Two tests are designed to analyse the predictive accuracy and compare the performance of different models respectively. One is to test the predicted accuracy of models on each segment type. For this case, we select samples derived from each segment type for 3 days as test set and the data of different segment types for the other 12 days as the training set. The other test is to show the performance of models on samples of mixed segment types. In this case, we randomly select data of five segment types for 12 days as the training set and data for 3 days as test set. This operation is conducted five times for five-cross validation. The data amount for training is around 0.468 million records, and the data for testing is about 0.117 million records.

Running algorithm
In this study, we perform three short-term predictions on traffic speeds (i.e. the current time is 8:00 AM, and the predictions are 8:05 AM, 8:10 AM and 8:15 AM). We implement the DTNN in TensorFlow. The parameters of the DTNN are given in Table 3. All experiments are performed on a PC with an Intel i5 4.0 GHz CPU and an NVIDIA 1060i GPU.

Evaluation index
MSE, RMSE, and mean absolute error (MAE) are three typical error measures. MAPE puts a penalty on errors with respect to low speed. Therefore, a small MAPE values equates to an accurate prediction of the congestion state. This measure is a commonly used evaluation tool in short-term traffic speed prediction. R-squared (i.e. R 2 ) evaluates the goodness of fit, which represents the fitness of a model with respect to observed samples. The larger R 2 is, the better the fitness of the model is. All methods use the same number of variables to build predictive models. Hence, one can reasonably compare different predictive models in terms of R 2 . The decline of R 2 from a training set to a test set reflects the generalization of the built model. A small decline of R 2 indicates good model generalization.

6
Model result validation and discussion

Predicted speed and confidence
The predicted speed curve, along with its confidence and absolute error (AE), is shown in Figure 8. The confidence network achieves 85% accuracy on the training set and 82% accuracy on the test set. For most of the samples, the confidence of the predicted speed of the DTNN ranges from 0.9 to 1, and the AEs are small. We select five samples (lines 1, 2, 4, 5 and 7) of large AEs as instances. Confidence less than 0.8 is given on three samples denoted by lines 2, 5 and 7. The result illustrates that the predicted speeds for these samples are questionable. Through the confidence calculation, false speed prediction can be raised.
A large AE will not generate a low confidence. When observed and predicted speeds belong to level 1 or 5, the confidence may be high despite the large AE. Specifically, the speed interval of levels 1 and 5 is longer than that of the other levels. When the observed speed belongs to level 2, 3 or 4, even a small AE of the predicted speed could lead to a low confidence. For example, the DTNN yields low confidence for a sample (line 5) but a small AE. In a few cases, confidence misleads the speed prediction. For a given sample (line 6), the DTNN outputs a low confidence but an accurate prediction.
Residuals, that is, the difference between observed data and predicted values, can assess the accuracy of predictive models. The best residuals reveal whether a model has made use of available information in the data despite it not being the most accurate model in the prediction. Figure 9 shows three models in a 5 min step accounting for available information. Autocorrelations exist in the residuals. As observed from the autocorrelation plots, the residuals of the DTNN are not highly correlated. The time plots show that the DTNN residuals have fewer patterns and more uniform and smaller variances than the LSTM and ARIMA models. The histograms demonstrate that the DTNN's residuals follow a normal distribution with small variance. Three models are biased; the mean residual of the DTNN is −5.5 while those for the LSTM and ARIMA models are −5 and −7, respectively.

6.2
Comparison between predictive models Figure 10 shows the predicted curves of the models for the types of ramp, diverging, and merging. DTNN-S builds three individual models in three steps, and DTNN-M trains only one model in three steps. As shown in Figure 10, at the 5 min step, DTNN-S yields a correct predictive trend under high speeds despite its low accuracy for low speeds. DTNN-M yields an accurate speed prediction between 9:30 AM and 10:00 AM and a correct trend at 8:30 AM (e.g. the congestion state). This result is denoted as a circle on the graph. At the 10 min step, DTNN-M yields a similar predicted curve to those of the other models. At the 15 min step, the curve of DTNN-M fits the observed speed accurately at 12:00 AM. DTNN-S only concentrates on speed prediction in one step without considering the other steps. DTNN-M can improve the accuracy by sharing information between multiple steps. Hence, DTNN-M performs the best at the 15 min step but performs relatively poorly at the 5 min step.  Tables S1, S2 and S3 present the MAPE and MAE values of the 15 models in realizing speed predictions at the 5, 10 and 15 min steps for the normal (S1), merging (S2), diverging (S3), weaving area (S4) and ramp segments (S5). In Tables S1, S2 and S3, DTNN-S obtains the smallest MAPE for at least two segment types at the 5 and 15 min steps. DTNN-M obtains an incremental improvement in MAE from 5 to 15 min steps. DTNN-S yields a smaller MAPE than most of the other models, but its MAPE is slightly larger than those of LSTM, RNN and GRU.
LSTM, RNN and GRU perform well in most cases. The LSTM curve is the most accurate for the 10 and 15 min steps, and RNN achieves an accurate prediction for the 5 min step. At the 5 min step, three networks correctly predict the congestion state during rush hour. At the 15 min step, three networks generate the wrong prediction at 12:00 AM. They gradually yield a small increase in MAE from the 5 min step to the 15 min step. Their small MAPEs at the three steps suggest the high accuracy of low-speed prediction. In the graphs, DCNN, a six-layer network, yields a much lower speed at 8:45 AM than the observed speed. In terms of MAPE and MAE, the six-layer DCNN is better than the conventional models but is poorer than DTNN, LSTM, RNN and GRU.
BBRT performs well at the 10 and 15 min steps but performs poorly at the 5 min step. RF, CART, SVR, BPNN and RBF-NN do not perform well at these three steps. BPNN yields a correct predictive trend for the traffic congestion state, but its MAEs and MAPEs for the five segment types are large. RBF-NN obtains a perfect fitting curve on the training samples but performs poorly in the test set. It incorrectly predicts the speed at the 5 min step during rush hour (2:00 PM). Three ARIMA models with parameters (p, d, q) = (1, 1, 0), (2,1,2) and (2,2,2) are tested. They provide well-fitting predicted curves at the 5 min step and poor-fitting ones at the 10 and 15 min steps.
Relative to DTNN, the ARIMA models cannot effectively deal with traffic emergencies. Regardless of segment type, the poor prediction of the ARIMA models can also be observed from the large MAPEs and MAEs in some cases.

Improvement in low-speed prediction
The accurate prediction of the traffic congestion state is more important than that of the smooth traffic state. Such accurate prediction can greatly benefit traffic management. Herein, we observe five loss functions, namely, L 1 loss, L 2 loss, L 3 loss, MAPE loss and shrink loss, in low-speed prediction. Observed from Table 4, compared to the other loss functions, MAPE loss yielded only 0.2% reduction on MAPE on the samples of low and high speed. In some cases, MAPE loss has not much advantage in prediction on high speed samples. Owing to the sensitivity to low speed (i.e. speed < 40 km h −1 ), MAPE loss function can accurately predict on low speed samples. It obtained at least 12.36% smaller MAPE than that of the other loss functions except L 1 loss. The small increase of MAE is in exchange for the decrease of MAPE. On S2 segment, L 1 loss achieved a smallest MAPE on low speed samples. Meanwhile, MAPE loss function yielded only 2.97% increase of MAE on different segment types. Figure 11 shows the scatter plots of DTNN-M using the L 2 loss and MAPE loss relative to LSTM, RNN, and GRU at the 5 and 15 min steps. The isoline represents the predicted speed equal to the observed speed. The low speed samples are shown by circles in the pictures. The predicted points of DTNN-M using the MAPE loss at low speeds are nearer to the isoline than those using the L 2 loss. Relative to LSTM, RNN and GRU, DTNN-M with MAPE loss generally predicts similar speeds at the 5 min step; at the 15 min step, DTNN-M with MAPE loss

Generalization of predictive models
We compare the generalization of two DTNNs (using L 2 loss) with that of the other 10 models.  Figure 12 shows the evaluation bars of the models at threetime steps. All the results show the mean values of five runs on the datasets. As shown in the figure, two DTNNs, BBRT, RNN, LSTM and GRU obtain small MAPEs and large R 2 values a three-time steps. The large R 2 and the small decline of R 2 illustrated the satisfied generalization of DTNN. Among the other models, BBRT, DCNN, RNN, LSTM and GRU show a small decline in R 2 for different datasets. This result highlights their good generalization. BPNN and RF obtain small MSEs and MAEs and a high R 2 , but the decline of R 2 is unstable. The generalization of these models is uncertain. By contrast, CART, SVR and RBF-NN obtain large MSEs and MAEs and small R 2 . Their generalization is certainly poor.

DISCUSSION
Training set size. In theory, training set size is larger, the higher accuracy of the predictive models. But this is not absolute. Actually, highly diverse data is more valuable for training a generalized model. If the diversity is not enough and the similarity of samples is too high, no matter how much training set cannot generate a good model. In this study, we used 12 days data to train the DTNN and the predictive accuracy in terms of MAPE and MAE was satisfied. This illustrated that 12 days data was enough to train a DTNN.   Table 5 show that 7, 8, 9, 10 and 11 days lead to an obvious accuracy decrease of DTNN on S5 segment, but they obtained a little or no accuracy decrease on S1 segment. This illustrated that it is still necessary to use more data for training DTNN in some cases.
Knowledge acquisition. DTNNs can build a highly generalized model and reduce the structural risk to overcome the overfitting. The abundant knowledge acquisition of DTNN is attributed to the MTL. The recognition of segment types provides the knowledge of segment speed to regression networks. In DTNN, the joint of expert networks with the segment knowledge can perform accurately on the segment types, which have been trained by DTNN. For the new or unknown segment types, the generalist network trained on mixed segment types can always predict a speed close to observed value.
Tree-based ensemble regression models could perform well in some cases. This is because they have the diversity of sub models and generate different feature combinations in the models. RNN, LSTM and GRU are the models, which will deal with the sequential data. They have the potential in predicting shortterm speed if integrated with spatial features networks. However, these comparison models only can acquire limited relationship information from the input data because of the single learning task. This is why they cannot outperform DTNN in most cases.
Stable prediction. It must be noted that there is no model can perform perfectly or outperform the other models in every case. As the Chinese proverb "specialist only master his own field," DTNN with different loss functions can play a role in different fields.
• DTNN with MAPE loss perform well on low speed prediction. This can be observed from Table 4. • L 2 loss cannot well deal with traffic emergencies because of its sensitivity to outliers. It reduced the error with respect to outliers by sacrificing the predictive accuracy on the other samples. Usually, on expressway, low speed samples are minority and high and middle speed samples are majority. So, L 2 loss can still achieve error reduction on the whole dataset.
It can be observed from Table S1, S2, S3 that DTNN cannot give stable accurate prediction on different segments. This is because the actual traffic speed of different segments are various. Some segments are dominated by low speed state, while some ones keep middle and high speed state for a long time. DTNN with L 2 loss can high predicted accuracy if the segments are middle and high speed state, while DTNN with MAPE loss can achieve high predicted accuracy on low speed segments.
External factors and improvement. Many studies have indicated that weather and weekdays/weekends are important factors that influence the traffic prediction accuracy. The weather factors include the weather status, temperature and wind. These four factors can be added to the input as new features if DTNN is applied for prediction on more time steps. Meanwhile, the building works would also have impact on the state of segments.
In addition to the external factors, the missing and incomplete data is also an important factor affecting the accuracy. Currently, ground loop detectors sometimes fail to detect the traffic flow. So, the new detector equipment can be employed to help detect the traffic flow more accurately. With the accurate dataset, the trained model is bound to be more accurate in short-term speed prediction.

CONCLUSION
To solve multiple short-term (e.g. 5, 10 and 15 min) predictions of traffic speeds for different segment types, we propose DTNN with MTL. As a result of the different features of segment types and the strong correlation between traffic speeds at continuous time periods, we assign the DTNN with a classification task for distinguishing the segment types of samples and multiple groups of regression tasks for simultaneously fitting samples of continuous times on each segment type. These classification and regression tasks can improve and learn from one another. In the DTNN, the classification network for the classification task shares informative representation with the generalist and expert networks, which carry out groups of regression tasks. In further improving the forecast of traffic congestion state, we propose MAPE loss to make MTL biased to low speeds. The traffic speed dataset of the Shanghai Expressway is used to test our method and 10 other methods. The results show that DTNN with L 2 loss obtains the smallest MAE in most cases and that the DTNN with MAPE loss can efficiently improve the predictive accuracy for low-speed samples. In terms of R 2 , the DTNN also achieves the largest value in most cases. In addition, the smallest reduction of R 2 from training to testing illustrates the best generalization of the DTNN model.
We identify three future research directions. The first is to analyze the sensitivity to different prediction time intervals (5 min in this research) to evaluate the time dimension generalization of the proposed model. The second is to apply the classification and regression processes to urban road segments. The final prospect is to further improve the accuracy and confidence, especially in low-speed predictions.