A two‐scaled fully convolutional learning network for road detection

Correspondence Xianliang Hu, School of Mathematical Sciences, Zhejiang University, 310027, Hangzhou, People’s Republic of China. Email: xlhu@zju.edu.cn Abstract This paper aims to detect road regions based on a two-scaled deep neural network. The information from different scales is helpful to boost the performance of deep learning models, and it is also a widely used strategy in various computer vision applications. In the two-scaled model, skip-architecture and fully convolutional layers are used to fuse the low-level details and high-level semantic information. It enables to detect the road areas by multi-scale feature maps from different reception fields. To avoid the redundancy of scale information and the loss of features caused by the pooling layer, the feature maps before the first pooling layer are adopted in our model. By the convolutional kernels, our model can balance the information of two scales automatically. The loss function is also improved, in which the intersection over union (IoU) term is taken into account to guide the model to learn more features on the whole road regions. Comprehensive experiments on three benchmark datasets demonstrate that this approach can reach state-of-the-art performance.


INTRODUCTION
In recent years, self-driving has aroused wide public concern with the popularity of artificial intelligence. A self-driving car means that a vehicle is capable of traveling between destinations without a human operator. It is a crucial part of intelligent transportation systems and has the potential to reduce traffic accidents compared with the traditional human-driving cars. Detection technologies such as road detection, pedestrian detection, and vehicle detection are fundamental to self-driving. An accurate road detection is the core of vehicle navigation and also the basis for the car to complete other tasks such as anomaly detection [1] and lane detection [2,3], so it is vital to detect the road specifically. The road detection problem is a type of semantic segmentation that focuses on classifying each pixel in a digital image correctly. For road detection, each pixel of an image is divided into two types, road and non-road. Due to the presence of objects such as pedestrians and cars, the shape of the road surface is complicated and irregular which leads to a more challenging problem than general object recognition. There are many types of sensors for road detection, including monocular color cam-More recently, a novel multi-scale neural network is proposed in [15], which utilizes the hierarchical nested structure arising in the fast multipole method. On the other hand, the combination of CNN and multigrid methods (MG) leads to a new model named MgNet [16]. He and Xu discovered the close connections between CNN and MG, and then they proposed a new model which achieved a better performance on benchmark data.
Although multi-scale methods have been proved to be effective by lots of experiments, how to make better use of multi-scale information is a challenging topic. There are two main difficulties in the mainstream approaches: the redundancy of features, and the information loss caused by the pooling layer. In this paper, we are focused on studying a two-scaled structure by collecting the information only from the shallowest and deepest layers, and our main contributions are as follows: 1. We propose a two-scale fully convolutional network (TFCN) for road detection, which uses the information from the lowest and highest level. The proposed model is evaluated on the three public datasets, and it achieves state-of-the-art performances. 2. By using convolutional layers, the proposed model can self balance the features between different scales through learning. 3. Based on the original cross-entropy loss function, the IoU loss term is added, and then the model can be trained from the whole and pixel level. Experimental results show that this improves the performance of the network.
The rest of the paper is organized as follows. The related work about road detection and multi-scale methods is reviewed in Section 2. Section 3 describes our two-scaled model and related strategies in detail. Numerical experiments and results are illustrated in Section 4. Finally, we summarize our work in Section 5.

Road detection
Currently, we are focused on treating the monocular camera images [17,18]. In this field, one basic algorithm framework of road detection can be divided into feature extraction and classification. By using the techniques of machine learning and wavelet transform, many novel methods are proposed. Based on the illuminant-invariant image, Alvarez et al. [19] propose a shadow-invariant feature space which is combined with a likelihood-based classifier. A similarity measure and threshold on this measure are designed for the classifier to decide whether a pixel is in the road region or not. In [20], Gabor filters are adopted to compute the texture orientations feature which is used in a locally adaptive soft-voting (LASV) scheme to estimate vanishing points. Combined with the information of vanishing points, orientation consistency ratio (OCR) features are used to detect road edges. Mei et al. use an RGB space as the feature space in [21] and an algorithm framework which contains inference and learning is proposed based on this space. With the development of deep learning, the powerful feature extraction and representation ability of the deep neural network leads to many methods. A semi-supervised learning (SSL) method using generative adversarial networks (GANs) and a weakly supervised learning (WSL) method based on conditional GANs are proposed by Han et al. [22]. Combining the deep CNNs and Bayesian network, Chen et al. [23] design a road and road boundary detection network (RBNet). Due to the advantages of FCN in image segmentation, some approaches also use similar structures. Muñoz-Bulnes et al. [24] adopt the ResNet with a fully convolutional architecture to segment road, and the augmented data from geometric transformations and pixel value changes is used in the training to improve accuracy.

Multi-scale strategies
In various applications, for example, signal processing [25] and computer vision, the multi-scale information is always used to improve the performance of algorithms. Considering image segmentation, varying multi-scale strategies [26], for example, scale spaces and multi-scale decompositions, are widely applied. A model using the total variation (TV) regularization and an L1 fidelity term is presented by Yin et al. [27]. They show that the model can be composed of a series of subproblems and then prove that the TV-L1 model is able to separate different scale features. Using the TV-L1 model to extract features, Li et al. [28] design a novel multi-scale model for image segmentation. They transform the model into a constrained optimization problem and then solve it with the split Bregman method.
To take advantage of the multi-scale features from convolutional layers, skip-architecture [29] and multi-loss function [30] are usually adopted in deep learning. Let us describe multi-scale neural networks in four categories: image pyramid network; skip-architecture network; feature pyramid network; top-down network.
1) Image pyramid network: As shown in Figure 1a, the image pyramid network directly uses different sizes of input images to obtain multi-scale features, which is used in [31]. For this kind of model, various features are obtained through multiple branches to improve the final results. A typical extension of this model is to increase the connection between each branch, which is used in other works [32]. The disadvantage of this model is that it usually requires manual selection of input sizes and costs a lot of memory. Considering that different convolutional layers of a single branch can extract features of various levels, lots of models with one stream are opted in the following networks.
2) Skip-architecture network: The concept of this network is shown in Figure 1b. For this type of model, the skip layer that integrates different scales of information to get better results is very important. FCN-8s, which adds skip links between lower layers and the final layer to obtain better performances, is an example of this model. Besides, many similar models [11,33] are adopted for image segmentation. 3) Feature pyramid network: As illustrated in Figure 1c, different from the previous image pyramid network, this network uses multi-level feature maps and multiple outputs for information fusion which is adopted in [34]. One of the key points of the model is how to deal with the outputs. The typical work of feature pyramid network is single shot multibox detector (SSD) [35], which uses non-maximum suppression to refine results. Figure 1d is the mainstream model in recent years and its structure is similar to the encoder-decoder architecture. By applying the top-down path, the features extracted from the CNNs are refined and merged. Naturally, the top-down branch is significant for the model. Thus, many top-down structures are proposed in [36,37]. 5) Our network: How to fully utilize the features from the lowest layer/original image is an important topic, as well as in the scientific multi-scale applications. Considering that the pooling operation leads to the waste of many details, even if multi-scale information is used, the above models still lose a lot of features. Moreover, the feature maps of two consecutive scales contain a lot of repeated information. Based on these observations, a network using the shallowest and deepest information is adopted in Figure 1e. By adding the convolutional layers, the model can learn how to equilibrate the features of different scales to get better results.

Architecture overview
In the last few years, deep convolutional networks have been widely used in image segmentation because of the powerful feature representation ability. It is well known that a deep neural network is able to get the hierarchy of features, which motivates the researchers to propose various multi-scale models. Inspired by the feature hierarchy, we decide to integrate the multi-scale information for road detection. In consideration of the loss of much information caused by the pooling layers, we propose a two-scale model where one is the complete high-resolution feature before the first pooling layer and the other is the final semantic information. The model brings two advantages: (i) multi-scale information is used, including the whole shallow details and the deepest features, (ii) the continuous multi-scale structure is not adopted to avoid the repetition of information and simplify the model. The proposed model is a fully CNN and its architecture is depicted in Figure 2. We adopt the classical structure (VGG) [38] to extract features, and then the semantic information graph is obtained by the fully convolutional and deconvolutional layers. For capturing the other scale information, the lowlevel features before the first pooling layer are integrated by 1 × 1 convolution. Finally, the two-scale features are incorporated to get the final result. The advantage of this structure is that through learning the weights of 1 × 1 convolutional layers, the neural network can balance the integration of low-level information and high-level information without artificial adjustment.

Two-scale model
The road detection problem can be defined as follows. Given an image I : Ω → ℝ 3 , where the domain Ω is a bounded and subset of ℝ 2 . The aim is to find the best partition of Ω into disjoint two regions Ω 1 and Ω 2 by certain measures. The general model can be formulated as where represents the set of parameters for E, Ω 1 ∪ Ω 2 = Ω and Ω 1 ∩ Ω 2 = ∅. Based on different energies E, many models are designed for image segmentation. Considering road detection is a binary classification task and the image domain is a discrete region, we adopt the cross-entropy loss as energy function E. For the two distributions p and q, the cross-entropy is defined as If p is a Bernoulli distribution with parameter and q is the other with parameter , the crossentropy can be transformed into In our model, the input I is the image with three channels, and the output Y is the predicted map of roads. For convenience, we denote the weights of the high-level branch as where the weights of the l th convolutional and last deconvolutional layers are w c l h and w d h , respectively. It is noted that c 1 , .., c n and d are sequential numbers. As for the low-level branch, the filters are determined by the weights W l . For fusing the two-scale output information, two layers with 1 × 1 kernels, which are indicated by W f = {w l f , w h f }, are adopted to merge the feature maps. By using the energy function (2), our model can be expressed as where the label l : The function g(I (x)) = P (Y (I (x)) = 1; W h , W l , W f ) represents the neural networks. The information from two scales is obtained and merged by the network g through learning the parameters W h , W l , W f . The algorithm for training all the parameters W h , W l , W f is the stochastic gradient descent method. After training, we can use the model to predict the road region of images.

Intersection over union
In deep learning, a model is learned by minimizing the loss function over the training set via some optimization methods. However, lower losses are not always equivalent to better results. In the example of Figure 3, the loss defined as (6) of Figure 3c is 0.18 which is smaller than the loss of Figure 3d, but Figure 3d gets the better segmentation, whose accuracy is higher.
To address this drawback, intersection over union (IoU) that is used to evaluate the coincidence of two areas is introduced. IoU, which depicts the shape similarity, is one of the most commonly used metrics in object detection. IoU is defined by  where A, B ∈ ℝ n . Compared with the cross-entropy measure that focuses on the information of an image at the pixel level, IoU pays more attention to the properties of the whole image region. For the road detection problem, besides the pixel information, the area of road shape is also important. As shown in Figure 3, IoU is more suitable than the cross-entropy loss in this example where the higher IoU means a better accuracy. Based on this idea, we adopt a new loss function, where the right item indicates the loss of IoU, '⋅' denotes multiplication, and we use p to represent P (Y (I (x)) = 1; W h , W l , W f ) for convenience. The function M (p) is defined as

Datasets
To evaluate the proposed method, we select three widely used road detection datasets and the characteristics of the above datasets are summarized in Table 1. 1) KITTI [39]: The KITTI dataset is one of the most popular public datasets in road detection. It consists of 289 training and 290 test images, which includes four different categories of road scenes: urban unmarked (UU), urban marked (UM), urban multiple marked lanes (UMM), and the combination of the three above (URBAN). For choosing the model in the training, the training set is randomly divided into two subsets (269 images for training and 20 images for validation). 2) CamVid [40]: The Cambridge-driving labeled video database is the first collection of videos with object class semantic labels which provides 32 classes. We use the image dataset based on CamVid which consists of 701 road images with a resolution of 720 × 960 pixels. In the following experiments, the original dataset is randomly divided into two classes (401 images for training and 300 images for testing). 3) SC [41]: The single class road detection dataset (SC) consists of 755 images with 640 × 480 pixels, which include various scenarios at different times such as noon and nightfall. These images almost cover the scenes in the real world. We divide the dataset into two parts (555 images for training and 200 images for testing) to evaluate the approach.

Metrics
For the binary classification problem, there are several measures to evaluate the methods. We use the following criteria for the approach.
• Precision (PRE): Precision is adopted to measure the ability of the model to classify the positive class. The definition is • Recall: Recall measures the ability to search for the positive class. It is defined as • False Positive Rate (FPR): FPR indicates the proportion of the negative samples, which are wrongly classified into the positive class, in the total number of negative samples • False Negative Rate (FNR): FNR is the ratio of the positive samples, which are predicted into the negative category, in the all positive samples. The formula is as follows • Accuracy (ACC): Accuracy is an indicator of the capability to classify all categories, which is formulated as • F1-measure (F1): F1-measure is the harmonic mean of recall and precision, which can be calculated as In the above equations, TP, TN, FP, FN denote the number of true positive, true negative, false positive, and false negative. Furthermore, the maximum F1-measure (maxF) and average precision (AP) are also used in the KITTI road benchmark. For the output confidence maps, maxF can be computed by choosing a classification threshold to maximize the F1-measure max F = arg max F1-measure.
A series of precision and recall values can be calculated by choosing different thresholds, and then a precision-recall curve (PR curve) is obtained, where the abscissa is recall and the ordinate is precision. AP is the area of the graph enclosed by the PR curve and the coordinate axis where P (r ) is the expression of PR curve and r denotes the recall.

Implementation details
In our model, we use five convolutional blocks inspired by VGG [38] where the numbers of convolutional layers are 2, 2, 4, 4, and 3. The size of the filters in these blocks is all 3 × 3. 7 × 7, 1 × 1, and 1 × 1 kernels are chosen in the last block. Finally, a feature map with the same size as the input image is obtained by upsampling. At the same time, the corresponding low-level details are acquired before the first pooling layer, and then they are input in a 1 × 1 convolutional layer. In the end, the two feature maps are fused to get the final result.
In the experiments, a stochastic gradient descent algorithm with momentum of 0.99 is used and the learning rate is 10 −4 . Besides, we set dropout rate of 0.75 in the last convolutional block. For different data, the input size of the network is shown in the third column of Table 1. We set the classification threshold as a value of 0.5. For SC and CamVid datasets, the model is trained on their training sets and evaluated on the corresponding test sets. For the KITTI dataset, we divide the training data into a set with 269 pictures and a validation set with 20 images. The initial values of the first five convolutional blocks are from VGG19 which is trained on the ImageNet database. The whole network is implemented with tensorflow 1.13.

Comparison results
To verify the effectiveness of our approach, we evaluate it by comparison to other models on the above three datasets. The layer number of the first five convolutional blocks in FCN is the same as our model. The trained models based on the different   (3) and (5) are named TFCN and TFCN + , respectively. Table 2 shows the results of different models on the KITTI validation set. From the quantitative results, it is observed that our model achieves the best result on all criteria which verifies the validity of TFCN. Moreover, TFCN is simpler than FCN-8s, in which only two-scale information is used. By adding the IoU item in the loss function, TFCN + achieves an improved performance, which indicates the effectiveness of the IoU part. Figure 4 presents the training process of three models: FCN, TFCN, and TFCN + , where Figures 4a, 4b, and 4c,   of TFCN and TFCN + are smaller and stabler than that of FCN at the later stage from the right column of Figure 4, which shows some good features of our model. For further comparison between the results of our model and those from other methods, we submitted the result of TFCN + to KITTI server, where the algorithms are evaluated in the bird's-eye-view space (BEV). Our method gets the top 10 average precision (AP) in all four categories of road scenes. The eight real-name submissions that only use left color images in the UM road category and some popular methods in the Urban road type are listed in Table 3. The results for UMM road type are listed in Table 4. For convenience, all models in the list are named after their names on the KITTI benchmark server.
In the list, s-FCN-loc [12], SSLGAN [22], RBNet [23], DEEP-DIG [24], StixelNet II [42], MultiNet [43], DDN [44], RoadNet3 [45], Up-Conv-Poly [46] and ALO-AVG-MM [47] all use deep neural networks to detect the road areas. An average F1 value of 94.75% is achieved by TFCN + for UM, URBAN, and UMM road images. It is noted that our model achieves competitive results on all criteria for different road categories, which indicates its effectiveness. From Tables 3 and 4, we find that the  [48], and PSPNet [49] are presented in Table 5. FCN32/16/8s all use the fully convolutional layer, and the last two models adopt the features from different layers to get better results. Based on FCN, Unet, and Segnet both use the encoder-decoder architecture for image segmentation. By applying pyramid pooling module, PSPNet combines the features from various levels to segment images. From Table 5, we know that TFCN gets the best recall and FNR. TFCN + achieves higher accuracy, precision, and F1-measure, but its recall is decreased. Table 6 lists the performance of different models on SC dataset. It is indicated that TFCN achieves the best results for all criteria. Results averaged over all criteria reveal a performance improvement of approximately 0.3% for TFCN + as compared to TFCN, which verifies the helpfulness of the mixed loss function.
The experimental results of three datasets indicate that our model is more competitive than other methods. Occasionally, since the CamVid and SC datasets contain various road scenes of different light intensity, weather conditions, and so on, the performance of TFCN has slight fluctuations. All in all, our model can achieve better results for different datasets when only  [50] for the raw road images first, and then the model TFCN + is trained and tested on these processed images. HE is an effective image enhancement technology, and it is typically applied to grey-scale images to a new image with more contrast. For the road images, we use the HE method in [50]. First, we apply the HE to the R, G, and B channels of the images independently of one another. Then, the preprocessed images are obtained by merging the new R, G, and B channels. The results on the validation images of KITTI, and the test set of CamVid and SC are shown in Table 7, where Raw denotes the original images. Although the grey images contain less information, TFCN + still achieves an ACC of close to 99%. For the images preprocessed with HE, our model achieves the best average recall value of 98.09% on the three datasets. From Table 7, we find that the results for different preprocessed images are very close and the maximum difference is only about 0.3%. Overall, it is shown that our model has good robustness and can detect road accurately under various illumination conditions.

Visualization results
The results on the KITTI server are shown in Figure 5, where the green, blue, and red regions represent true positives, false positives, and false negatives, respectively. The pictures in the first row are the results of the normal view, and the second row is the results of BEV space which covers 20m × 46m real world. From Figure 5, we can see that our model can deal with various complicated road surfaces, and only some distant and irregular boundaries cannot be detected accurately. The results of the KITTI test data indicate the good generalization ability of our model. Figure 6 presents the visualization results on some examples of KITTI validation set. From the results of the first and  second rows, it is obvious that the accuracy of our model for road detection is higher and the FP region is smaller. Meanwhile, the images of the last two rows show that the results predicted by TFCN have less false positives. In general, TFCN is more effective than other models, and TFCN + is better than TFCN. It is easy to find that the results from TFCN have smaller red and blue areas. Through Figures 7 and 8, we know that TFCN has good generalization ability to handle a variety of road images.
Some examples of visualization results on CamVid test set are shown in Figure 7. Our model has achieved good results, even for the irregular road surface such as Figure 7b. In Figure 7, we can see that our model can be applied to many situations. As for the SC dataset, we present some examples in different scenes in Figure 8, for example, urban road, tunnel, slippery road, and highway.  Figure 6 to Figure 8, we find that our model achieves better performances than FCN for road boundary, but it still has not achieved the best results. The accurate detection of very complex and irregular road boundaries is still a challenge for our model. Based on the above experimental results, some characteristics are summarized as follows: (1) TFCN has good detection ability for different road conditions. (2) After adding the IoU item in the loss function, the results predicted by TFCN + is better, and the areas of FP and NP are more concentrated than other models' results. Generally, our model has achieved good performances and the main reasons are as follows. The road images have two features: first, the road boundaries are complex and fuzzy, thus much high-resolution information is required for detecting boundaries accurately; second, the features of road are obvious, which means the semantic information of the road images is clear. The proposed model combines low-level and high-level features, where the former contains much high-resolution information to detect edges and the latter provides semantic features. Consequently, our model is suitable and effective for road detection.

CONCLUSION
In this paper, we propose a two-scaled learning network for road detection, which makes full use of original image details and high-level semantic information. The model uses the advantages of multi-scale strategies and fully convolutional layers to detect the road regions. Our two-scaled model has achieved more com-petitive results in three open datasets to other state-of-the-art methods which verify the effectiveness of the model. Our work in this paper demonstrates that low-level information is vital for road detection. Compared with FCN-8s, our model achieves better results with fewer parameters by making proper use of low-level features. It makes us aware that the performances for road detection may continue to improve if we could make better use of low-level information. Thus, we will try more ways in the future.