Dense spatio-temporal stereo matching for intelligent driving systems

This paper addresses the problem of matching stereo images acquired by a stereo system mounted aboard an intelligent vehicle. The main idea behind the new method consists in involving temporal matching between a current stereo pair and its preceding one to achieve the spatial matching of the former stereo by involving the matching results obtained at the last frame. The proposed method is achieved in three main steps. First, an edge based disparity map is derived from the disparity map of the preceding frame, we call it the assisting disparity map (ADM). Second, for each scan-line, a set of local ranges and global ranges are deduced from the ADM to keep only potential matching candidates. Third, the matching is done on the basis of dynamic programming algorithm by involving in the resulting local and global ranges we get from the last step. The proposed approach has been tested on both real and synthetic stereo sequences and the results demonstrate its effectiveness.


INTRODUCTION
Stereo vision is a measurement method of finding correspondence between two or more input images in order to obtain a detailed 3D representation of a scene. It has shown an interesting increase in research efforts due to numerous applications such as 3D navigation, obstacle detection and 3D reconstruction. One of the main applications of the stereo vision methods is advanced driver assistance systems (ADAS) [1][2][3], which aims at improving driving comfort and safety of people and goods in the road. Some works propose to solve the problem of matching stereo images using local algorithms based on block matching or pixel correlation due to their simplicity [4][5][6]. Other works suggest to use global methods such as graph-cut (GC) [7,8], belief propagation [9][10][11] or dynamic programming (DP) [2,12,13]. These approaches have become very popular due to their accurate and fast results.
This work is tackling the problem of stereo matching methods in ADAS. The stereo sensor providing stereo images is fixed on an intelligent vehicle. The proposed stereo matching approach should deal with dynamic scenes since the intelligent vehicle, and the other vehicles and pedestrians around it are moving. We believe that integrating the temporal information This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology in a stereo matching approach can improve the results by considering the consistency between temporally adjacent frames [2,[13][14][15]. Therefore, we present in this paper a dense stereo matching method based on spatial and temporal information. The main novelties of the proposed method are: The computation of the assisting disparity map from which global and local ranges are deduced. The usage of a novel cost function that takes advantage of different image components such as gray intensity, gradient magnitude, gradient orientation and the census transform, and the utilization of a two way dynamic programming algorithm. To initialise the system, the first disparity map is generated without using the temporal information (i.e. by applying dynamic programming directly on the images). In the next frames, we start by detecting edge curves in the stereo pair, the spatial matching between the preceding stereo pair is deduced from the preceding disparity map, and the temporal matching between temporally adjacent image is obtained using optical flow [16,17]. From this correspondence we can generate an edge based disparity map called assisting disparity map (ADM) from which global and local ranges of disparity are integrated. These ranges are used as control points to drive the dynamic programming in the search process for matching the current stereo pair in the aim of obtaining a dense disparity map.
The paper is organised as follow: Section 2 pins point the state of the art stereo matching algorithms using the temporal information. Section 3 details the proposed matching algorithm. Section 4 presents the experimental results and comparisons. Finally, Section 5 concludes the paper.

RELATED WORKS
Stereo matching methods can be categorised into two main categories: Local and global methods. Local methods match every pixel in the reference image independently, typically using pixelbased matching cost such as block matching or pixel correlation [4][5][6]18] almost all these methods use the strategy of winner take all in which the disparity of a pixel is chosen based on the lowest aggregate cost. The matching cost is aggregated using an aggregation function that basically assumes that the disparity in homogeneous areas have little difference and share the same disparity. Aggregation functions can be classified into window based method [19] , filter based method [20,21], or segmenttree-based method [22,23]. On the other hand, global methods try to minimise an energy function either on a single scan-line or over the entire image. Algorithms such as graph-cut (GC) [7,8], dynamic programming (DP) [2,12,13], semi global matching (SGM) [24] and belief propagation (BP) [9][10][11] have become very popular due to their accurate results. [25] presents a taxonomy and evaluation of these global methods. It was demonstrated that GC and BP approaches give the best results at the expense of computational time, while DP and SGM give good results but are way faster due to their one-dimensional optimisation solution. As mentioned in [25], the disparity range is a major key to the quality of the disparity map for all stereo matching approaches. incorrect disparity ranges can affect the performance of the stereo method in both the complexity and the accuracy. Generally, the developed methods assume that the appropriate disparity search range is already established [26]. The subject of using temporal consistency [27][28][29] in order to obtain more accurate disparity maps has attracted a lot of attention in recent years. Some of the works depend on opticalflow, others choose to use spatio-temporal window to match stereo sequences.
In [30], Zhang et al. extended the spatial window of computing cost to a spatio-temporal window. The spatial window is used to compute the sum of squared difference (SSD) while the spatio-temporal window is used to compute the sum of SSDs. This method showed promising results in static scenes, but did not perform well in dynamic scenes. In [31], the authors presented a novel algorithm that takes a video sequence as input and compute depth map for each frame. Davis et al. [32] have developed a method called space-time stereo similar to the one in [30]. However, the method was designed to perform in geometrically static scenes.
In [33], the authors presented a method of estimating disparity ranges based on a disparity histogram generated with sparse feature matching algorithms SURF. Cai [34] proposed an approach to integrate the optical flow estimation into a block stereo matching algorithm using dynamic programming. In [14], Jiang et al. used features detection, edge motion estimation and motion detection as a method to predict the disparity map in consecutive frames. Dobias et al. [35] proposed a method of predicting the disparity map of the current frame based on the one computed on the preceding frame using estimated motion of the calibrated stereo rig. Vedula et al. [36] computed three dimensional scene flow from two dimensional optical flow using a linear algorithm, from which a three dimensional structure information is deduced.
In [2], the authors presented a method to match edge points of stereo images, in which the temporal information is integrated by finding the relationship between consecutive frames using an algorithm called association. A disparity range is deduced based on both the association and the disparity of the previous frame to compute the current one. [37] expanded on the same idea by computing the initial disparity map by matching edge curves, which is used to obtain disparity ranges together with matching control edge points which are used to guide dynamic programming algorithm in matching the current frame. In [29], the authors presented a fast edge-based stereo matching approach devoted to road applications in which two passes of the dynamic programming algorithm are applied to estimate the final disparity map. Some works proposed to solve the stereo matching problem using convolutional neural networks. Yang et al. [38] suggested a unified model that employs semantic features from segmentation and introduces semantic softmax loss, which helps improve the prediction accuracy of disparity maps. In [39], the authors proposed to encapsulates all convolutional features into unary features descriptor using multi-level context ultra aggregation.

PROPOSED STEREO MATCHING ALGORITHM
This section details the steps of the proposed stereo matching algorithm, which is depicted in Figure 1. The main contribution of the proposed method consists involving temporal information in the matching process. The matching results of the preceding frame are exploited in the matching of the current one. For clarity purposes, we use the following notations for the rest of the paper: 1. I L k and I R k indicate the left and right stereo images acquired at time k and f k = (I L k , I R k ) is the current stereo pair. 2. d k (resp. d k−1 ) indicates the disparity map acquired at time k (resp. k − 1).
The computation of the disparity map d k illustrated in Figure 1 can be described as follows: First, the so-called ADM is calculated based on the disparity map computed at the preceding frame. The ADM allows to generate possible disparities in each scan-line of the current frame f k . Second, a dynamic programming based algorithm is applied on the stereo images of the current frame f k to derive the disparity map at the current frame.

The assisting disparity map computation
This section clarifies the steps we follow to obtain the Assisting Disparity Map (ADM). It is an edge based disparity map deduced from the preceding disparity map d k−1 . The ADM is used to determine the disparity ranges at each image sanline, which guide the dynamic programming search through the disparities obtained at the edge points. The quality of the ADM is an essential element in the process of getting an accurate disparity map.

Edge point detection
The first step of computing the ADM is detecting edge curves in the stereo images where the depth discontinuities are more likely to be located. In this work, we use Canny Edge Detector [40] that provides continuous edge curves, and produces more significant edge points, which are crucial to the proposed matching method. Let's denote by S m f = {C m,i f } i∈1,N the set of edge curves of the image I m f such that f ∈ {k − 1, k} is the frame index and m ∈ {L, R} is the index of the stereo image, L for the left image and R for right image.

Spatio-temporal matching of edge curves
The procedure of deducing the ADM is based on correspondences between edge curves in consecutive frames f k−1 and f k and the disparity map d k−1 of the frame f k−1 . This is achieved through four main steps as shown in  with edge curves S R k in the image I R k using the same principle as the first step. 4. The last step deduces the correspondences between edge curves S L k and S R k . To locate the spatial match of C L,i k . First, we find the match of C L,i k in I L k−1 , we call it C L, j k−1 . Second, the edge curve C R,m k−1 in the image I R k−1 corresponding to C L, j k−1 is found. Third, we search in the image I R k for the match of C R,m k−1 , we call it C R,n k . Therefore, the edge curve C L,i k is the match of C R,n k . We apply the same steps for all edge curves in the image I l k in order to deduce their matches in I R k . From these correspondences, we can generate the ADM for the current frame f k . An example of the ADM is depicted in Figure 3.

Stereo Matching Algorithm
We detail in this section the proposed algorithm adopted to obtain the disparity map. It is based on dynamic programming. The ADM is used to generate the global and local disparity ranges in each image scan-line. The pairs of matched edge points that create the ADM allow also to guide the dynamic programming algorithm to look for the best matches.

Disparity range
The choice of minimal and maximal disparity values is a crucial key for the quality of the disparity map. We utilise the same idea as in [2] where the authors compute the v-disparity map [41] from the v-disparity and determine the range of the disparities based on its analysis. The main idea of the v-disparity map consists in accumulating pixels with the same disparity on the same scan-line. Therefore, the v-disparity map in a road scene can be divided into two main parts, i.e. the top part containing obstacles and the bottom one containing the road. The obstacles appear as vertical lines while the road as an oblique lines. In the part representing the road, the disparity in each scan-line y i have a value of a × y i + b such that a and b are the oblique line equation parameters. Hence, to take into account the uncertainty caused by the computation, the disparity at scan-line y i should be between d min = (a × y i + b) − and d max = (a × y i + b) + . While the top part representing obstacles, the disparity value should be between d min = d 1 − and d max = d 2 + such that d 1 is the disparity value of the farthest obstacle and d 2 is the disparity of the closest object and is a tolerance value to select. Figure 4 illustrates the process of extracting disparity ranges.

Local disparity ranges
Since the depth discontinuities are usually located on the edge points, we use the ADM to determine the disparity values of the edge points in the image to use them as local disparity ranges. In each scan-line, a set of local disparity ranges is determined. For each edge point e i , d i = ADM (e i ) is the disparity determined by the ADM. The interval [d i − , d i + ] is defined as a local range deduced in that point, where is a tolerance value to select. These ranges are used to guide the dynamic programming algorithm to make the best matches as shown in Figure 5.

Cost function
We use a cost function based on gray scale, texture, gradient magnitude and orientation information in order to reflect the dissimilarity between two pixels p(x, y) and q(x, y + d ) in the same scan-line in the stereo images, which is defined by the following equation: such that I (x, y) is the gray intensity in the image, I lbp (x, y) is the census transform, I grad (x, y) is the gradient magnitude and I ori (x, y) is the gradient orientation. w i are weights associated to each component. We believe that the use these components will better reflect the dissimilarity between pixels even in the areas that contain little texture and distinct features.

Dynamic programming
To solve the problem of finding the best match between pixels on left and right images of the same scan-line, we present it as path finding problem on a 2D plane as shown in Fig The intersection of these lines is called nodes, which indicates candidate matches between the two pixels p i (x, y) and q j (x, y + d k ). The cost function in 3.2.3 is used to fill the search plane. For each pixel p i , two paths are computed, one extending from left to p i and the other one from right to p i . The disparity of the path with the minimum cumulative cost is selected as the disparity value for p i . Each of the paths is forced to pass through the local ranges integrated from the ADM described in 3.2.2, and the nodes outside the local ranges are discarded and noted as invalid nodes. This will correct any failure of the dynamic programming algorithm at any stage. This algorithm is applied for each scan-line of the frame f k . Dynamic programming algorithm is known to cause a streaking effect near object boundaries. In order to overcome this problem, we use a disparity map filter based on weighted least squares filter (WLS) [42,43], which will get rid of errors related the streaking effect due to its good performance and edge preserving smoothing.

EXPERIMENTAL RESULTS
The proposed algorithm has been tested on synthetic and real stereo sequences for evaluation purposes and has been compared to recent temporal methods. We also compare it to the same algorithm presented in this paper without using the temporal information (i.e. disparity ranges) to highlight its importance. For the rest of this section, we will refer to the proposed algorithm as DSTM (Dense Spatio-Temporal Matching), and the method without the temporal information as SM (spatial matching).

Synthetic stereo image sequences
We evaluate the proposed algorithm on the dataset MARS/PRESCAN synthetic stereo images publicly available in [44] containing stereo sequences and their ground-truths. The images are 512×512 in size and have a disparity range of 48 pixels. From Fig 6, we clearly remark that the disparity computed using SM method shown in Figure 6(c) is more noised, which means that it contains more false matches. On the other hand, the disparity map computed using DSTM illustrated in Figure 6(b) appears to be more homogeneous, meaning that it contains less false matches. For clarity, we utilise false colours ranging from red (small disparity values) to blue (big disparity values), in order to represent the disparity maps. Table 1 presents the results obtained by matching the frames #293 to #295, using DSTM and SM. The results are expressed in terms of the number of matched pixels (NMP), the percentage of correct matches (PCM) and the number of correct matches (NCM). The improvement that the temporal information yield is very recognisable. DSTM matches more pixels which means that it computes a more dense disparity map, and have a higher percentage of correct matches (PCM), outperforming SM by 6.25%, 3.99% and 4.96% in the frames #293, #294 and #295, respectively.  For further validation of the obtained results, DSTM has been compared to recent temporal matching methods. Fig 7 shows the left image of the frame #294 alongside the disparity maps obtained using DSTM and the spatio-temporal methods proposed in [2,15,29,37]. We note that these methods generate sparse disparity maps, which means that they deal only with edge points which have discriminative features, unlike DSTM that generates a dense disparity map and deals with the entire image including occluded and texture-less areas. Table 2 presents a comparison of the results obtained by the proposed method and the other estimators in [2,15,29,37] at the frame #294. We remark that DSTM matches 160,671 pixel from which 156.833 are correct, and this is due to the fact that DSTM is a dense matching approach. Moreover, DSTM has the highest percentage of correct matches (97.61%) compared to the other method [2,15,29,37] that match 97.35%, 96.15%, 93.99% and 88.03% respectively, despite the fact that DSTM deals with more difficult large untextured regions.  [29], (d) ISTM [15], (e) [37] and (f) SM [15]

Real stereo sequences
We also conducted our experiments on real stereo sequences. We utilised colour images of KITTI 2015 dataset [45], which is a real-world dataset with street views from a driving car. It contains stereo sequences with sparse ground-truth disparities obtained using LIDAR sensor. The images size is 376×1240. Figure 8 shows examples of left images from this dataset and Figure 9 their corresponding disparity maps generated using DSTM and SM. We notice that the disparity map computed using SM contains a lot of miss-matches contrary to the disparity map computed using DSTM that seems to be smooth, meaning that it contains less false matches. We adopt the same metrics described in 4.1 to evaluate the performance of the proposed algorithm on the real stereo sequences. Table 3 shows the results obtained using DSTM compared to SM. As depicted in the table, DSTM provides better results compared to SM. In the disparity maps computed  Table 4 presents a comparison of the proposed approach with other ones that are evaluated under KITTI dataset. The results are reported in terms of percentage of stereo disparity outliers. We remark that the proposed method outperforms both SM and TW-SMNet [46]. We also note that the method  SegStereo [38] provide best results. However, the paid cost in terms of computational performance and hardware used for these approaches is very demanding.

Running time
The proposed algorithm has been implemented using C++ on an HP Intel Core i7-6700 CPU @3.40 GHz with 4GB RAM. Since the matching approach deals with each scan-line individually, it can be implemented using several threads. Table 5 illustrates the running time of the proposed method on 1, 4 and 8 CPUs for both the virtual images (frame #294) and the real ones (frame #9). In the virtual image, the ADM computation takes 835 ms, and the stereo matching step takes 13.76 ms for each scan line, which translates to 7.892 s, 2.749 s and 2.030 s when running on 1, 4 and 8 CPUs respectively. For the real image, due to its larger size, 2076 ms is needed to generate the ADM, and about 36.21 ms for computing the disparities of each scan-line. This means that the computation of the entire disparity map takes 15.632 s on 1 CPU, 6.389 s on 4 CPUs and 4.882 s when running on 8 CPUs. The current form of the proposed method is not applicable in real time since the shutter of the camera is triggered at 10 fps. Nonetheless, this issue can be resolved using a GPU with more processors to run both steps.

CONCLUSION
The aim of this paper was to present a dense spatio-temporal stereo matching method. The proposed approach takes into account the temporal consistency between adjacent frames, which allows the use the matching results obtained in the previous frame in the computation of the disparity map of the current frame. The proposed method is achieved in three major steps. First, an edge based disparity map called the assisting disparity map (ADM) is computed. Second, for each scan-line, a global range and a set of local ranges are deduced from the ADM. Third, the obtained ranges are used to guide a two way dynamic programming algorithm to search for the best matches and therefore improve the results. The method proposed has been tested on synthetic and real stereo image sequences and the results obtained show its effectiveness.