Augmented Lagrangian-based approach for dense three-dimensional structure and motion estimation from binocular image sequences
Abstract
In this study, the authors propose a framework for stereo–motion integration for dense depth estimation. They formulate the stereo–motion depth reconstruction problem into a constrained minimisation one. A sequential unconstrained minimisation technique, namely, the augmented Lagrange multiplier (ALM) method has been implemented to address the resulting constrained optimisation problem. ALM has been chosen because of its relative insensitivity to whether the initial design points for a pseudo-objective function are feasible or not. The development of the method and results from solving the stereo–motion integration problem are presented. Although the authors work is not the only one adopting the ALMs framework in the computer vision context, to thier knowledge the presented algorithm is the first to use this mathematical framework in a context of stereo–motion integration. This study describes how the stereo–motion integration problem was cast in a mathematical context and solved using the presented ALM method. Results on benchmark and real visual input data show the validity of the approach.
1 Introduction
1.1 Problem statement
The integration of the stereo and motion depth cues offers the potential of a superior depth reconstruction, as the combination of temporal and spatial information makes it possible to reduce the uncertainty in the depth reconstruction result and to augment its precision. However, this requires the development of a data fusion methodology, which is able to combine the advantages of each method, without propagating errors induced by one of the depth reconstruction cues. Therefore the mathematical formulation of the problem of combining stereo and motion information must be carefully considered.
The dense depth reconstruction problem can be casted as a variational problem, as advocated by a number of researchers [[1], [2]]. The main problem in dense stereo–motion reconstruction is that the solution depends on the simultaneous evaluation of multiple ‘constraints’ which have to be balanced carefully. This is sketched in Fig. 1, which shows the different constraints to be imposed for a sequence acquired with a moving binocular camera. Considering a pair of rectified stereo images at time t = t0 and a stereo pair
at time t = t0 + tk, with tk being determined by the frame rate of the camera. A point
in the reference frame
can be related to a point
via the stereo constraint, as well as to a point
via the motion constraint. Using the stereo and motion constraints in combination, the point
can even be related to a point
, via a stereo + motion or a motion + stereo constraint. It is evident that, ideally, all these interrelations should be taken into consideration for all the pixels in all the frames in the sequence. In the following, we present such a methodology for addressing the stereo–motion integration problem for dense reconstruction.

Motion and stereo constraints on a binocular sequence
1.2 State-of-the-art
The early work on stereo–motion integration goes back to the approach of Richards [[3]], relating the stereo–motion integration problem to the human vision system. Based on this analysis, Waxman and Duncan [[4]] proposed in a stereo–motion fusion algorithm. They define a ‘binocular difference flow’ as the difference between the left and right optical flow fields, where the right flow field is shifted by the current disparity field. In 1993, Li and Duncan [[5]] presented a method for recovering structure from stereo and motion. They assume that the cameras undergo translation, but no rotational motion. Tests on laboratory scenes presented good results; however, the constraint of having only translational motion is hard to fulfil for a real-world application.
The above-mentioned early work on stereo–motion integration generally considers only sparse features and uses three-dimensional (3D) tracking techniques [[6]] or direct methods [[7]] for reconstruction. Tracking techniques track 3D tokens from frame-to-frame and estimate their kinematics. The motion computation problem is formulated as a tracking problem and solved using an extended Kalman filter. Direct methods use a rigid-body motion model to estimate relative camera orientation and local ranges for both the stereo and motion components of the data. The obvious disadvantage of sparse reconstruction methodologies is that no densely reconstructed model can be obtained. To overcome this problem, other researchers have proposed model-based approaches [[8]]. The visible scene surface is represented with a parametrically deformable, spatially adaptive, wireframe model. The model parameters are iteratively estimated using the image intensity matching criterion. The disadvantage of this kind of approache is that they only work well for reconstructing objects that can be easily modelled (small objects, statues, …), and not for unstructured environments like outdoor natural scenes.
Recent approaches to stereo–motion-based reconstruction concentrated more on dense reconstruction. The general idea of these approaches is to combine the left and right optical flows with the disparity field, for example, using space carving [[9]] or voxel carving [[10]]. Some researchers [[11]] emphasise on the stereo constraint and only reinforce the stereo disparity estimates using an optical flow information, whereas Isard and MacCormick [[12]] use more advanced belief propagation techniques to find the right balance between the stereo and optical flow constraints.
Sudhir et al. [[13]] model the visual processes as a sequence of coupled Markov random fields (MRFs). The MRF formulation allows us to define appropriate interactions between the stereo and motion processes and outlines a solution in terms of an appropriate energy function. The MRF property allows to model the interactions between stereo and motion in terms of local probabilities, specified in terms of local energy functions. These local energy functions express constraints helping the stereo disambiguation by significantly reducing the search space. The integration algorithm as proposed by Sudhir et al. [[13]] makes the visual processes tightly constrained and reduces the possibility of an error. Moreover, it is able to detect stereo-occlusions and sharp object boundaries in both the disparity and the motion field. However, as this is a local method, it has difficulties when there are many regions with homogeneous intensities. In these regions, any local method of computation of stereo and motion is unreliable. Other researchers (e.g. Larsen et al. in [[14]]) later improved the MRF-based stereo–motion reconstruction methodology by making it able to operate on a 3D graph that includes both spatial and temporal neighbours and by introducing noise suppression methods.
As an alternative to the MRF-based approach, Strecha and Van Gool [[1], [15]] presented a partial differential equation (PDE)-based approach for 3D reconstruction from multi-view stereo. Their method builds upon the PDE-based approach for dense optical flow estimation by Proesmans et al. [[16]] and reasons on the occlusions between stereo and motion to estimate the quality or confidence of correspondences. The evolution of the confidence measures is driven by the difference between the forward and backward flows in the stereo and motion directions. Based on the above-estimated per-pixel and per-depth cue quality or confidence measures, their weighting scheme guides at every iteration and at every pixel the relative influences of both depth cues during the evolution towards the solution.
Other researchers [[17]-[20]] use scene-flow-based methods for stereo–motion integration. Like the optical flow, 3D scene flow is defined at every point in a reference image. The difference is that the velocity vector in scene-flow field contains not only x, y, but also z velocities.
Zhang and Kambhamettu [[17]] formulated the problem as computing a 4D vector (u, v, w, d), where (u, v) are the components of optical flow vector, d is the disparity and w is the disparity motion, at every point of the reference image, where the initial disparity is used as an initial guess. However, with serious occlusion and limited number of cameras, this formulation is very difficult, because it implies solving for four unknowns at every point. At least four independent constraints are needed to make the algorithm stable. Therefore in [[17]], constraints on motion, disparity, smoothness and optical flow, as well as confidence measurement on the disparity estimation, have been formulated. The major disadvantage of this approach, is its limitation for slowly moving Lambertian scenes under constant illumination.
The method advocated by Pons et al. in [[18]] handles projective distortion without any approximation of shape and motion and can be made robust to appearance changes. The metric used in their framework is the ability to predict the other input views from one input view and the estimated shape or motion. Their method consists of maximising, with respect to shape and motion, the similarity between each input view and the predicted images coming from the other views. They warp the input images to compute the predicted images, which simultaneously removes projective distortion.
Huguet and Devernay [[19]] proposed a method to recover the scene flow by coupling the optical flow estimation in both cameras with dense stereo matching between the images, thus reducing the number of unknowns per image point. The main advantage of this method is that it handles occlusions both for optical flow and stereo. In [[20]], Sizintsev and Wildes extend the scene-flow reconstruction approach, by introducing a spatiotemporal quadric element, which encapsulates both spatial and temporal image structure for 3D estimation. These so-called ‘stequels’ are used for spatiotemporal view matching. Whereas Huguet and Devernay [[19]] apply a joint smoothness term to all displacement fields, Valgaerts et al. [[21]] propose a regularisation strategy that penalises discontinuities in the different displacement fields separately.
1.3 Related work
As can be noted from the overview of the previous section, most of the recent research works on stereo–motion reconstruction use scene-flow-based reconstruction methods. The main disadvantage to 3D scene flow is that it is computationally quite expensive, because of the 4D nature of the problem. Therefore we formulate the stereo–motion depth reconstruction problem into a constrained minimisation one and use a sequential unconstrained minimisation technique, namely, the augmented Lagrange multiplier (ALM) for solving it. This approach has been presented originally by De Cubber in [[22]]. The use of ALM has been also proposed recently by Del Bue et al. [[23]]; however, they apply the technique only to singular stereo reconstruction and structure from motion, whereas we propose an ALM use for integrated stereo–motion reconstruction.

-
The proposed methodology poses the dense stereo–motion reconstruction problem as a constrained optimisation problem and uses the AL to transform the estimation into unconstrained optimisation problem, which can be solved with a classical method. Whereas other researchers express the stereo–motion reconstruction problem as a MRF [[13], [14]] or a graph cut [[2]] optimisation problem. The approach we follow is very natural, as the stereo–motion reconstruction problem is by nature a highly constrained and tightly coupled optimisation problem and the AL has been proven before [[23], [24]] to be an excellent method for these kind of problems.

Processing strategy of a binocular sequence: from a left and right image sequences, proximity maps are calculated through stereo and dense structure from motion
These maps are iteratively improved by constrained optimisation, using the AL method
2 Methodology
2.1 Depth reconstruction model
The stereo–motion integration problem for dense depth estimation can be regarded as a high-dimensional data fusion problem. In this paper, we formulate the stereo–motion depth reconstruction problem into a constrained minimisation one, with suitable functional that minimises the error on the dense reconstruction. Fig. 2 illustrates the proposed methodology, where a pair of stereo images at time t is related to a consecutive pair at time t + 1.
Fig. 2 considers a binocular image stream consisting of left and right images of a stereo camera system. The left and right streams are processed individually, using the dense structure-from-motion algorithm proposed by De Cubber and Sahli in [[25]], resulting in, respectively, a left and right proximity maps dl and dr. In parallel, the left and right images are combined using a stereo algorithm [[26], [27]], embedded in the used ‘Bumblebee’ stereo camera. As a result of this stereo computation, a new proximity map from stereo dc can be defined. The reason for calling this proximity map dc lies in the fact that it is defined in the reference frame of a virtual central camera of the stereo vision system.
There exist strong interrelations between the different proximity maps dl, dc and dr, which need to be expressed to ensure consistency and to improve the reconstruction result. Therefore we adopt an approach where the left proximity map dl is optimised, subject to two constraints, relating it to dc and dr, respectively. In parallel, the right proximity map dr is optimised, also subject to two constraints, relating it to dc and dl. The compatibility of the left and right proximities is hereby automatically ensured, as both dl and dr are related to dc.



-
relates dl to the proximity map obtained from stereo dc.
-
relates dl to the proximity map of the right image dr.
-
relates dr to the proximity map obtained from stereo dc.
-
relates dr to the proximity map of the left image dl.






















2.2 Numerical implementation



















Constrained optimisation for binocular depth reconstruction using AL
An aspect which is not depicted by Algorithm 1 is the choice of the optimal framerate. The underlying structure-from-motion algorithm uses the geometric robust information criterion scoring scheme introduced by Torr in [[35]] not to assess the optimal framerate. This will have as an effect that if the camera does not move (no translation and no rotation) between two consecutive time instants, no reconstruction will be performed.
3 Results and analysis
3.1 Qualitative analysis using a real-world binocular video sequence
3.1.1 Evaluation methodology
The validation and evaluation of a dense stereo–motion reconstruction algorithm requires the use of an image sequence with a moving stereo camera. Hence, we recorded, using a Bumblebee stereo head, an image sequence of an office environment as illustrated Fig. 4, denoted here after as ‘Desk’ sequence. The translation of the camera is mainly along its optical axis (Z-axis) and along the positive X-axis. The rotation of the camera is almost only along the positive Y-axis.

Some frames of the binocular desk sequence
a Frame 1, left image
b Frame 1, right image
c Frame 10, left imate
d Frame 10, right image
-
Cluttered environment with many objects at different scales of depth.
-
Relatively large untextured areas (e.g. the wall in the upper left) making correspondence matching very difficult.
-
Areas with specular reflection (e.g. on the poster in the upper right of the image), violating the Lambertian assumption, traditionally made for stereo matching.
-
Variable lighting and heavy reflections (in the window on the upper right), causing saturation effects and incoherent pixel colours across different frames.
The initialisation step of the iterative optimiser estimates an initial value for the left and right depth fields. This method consists of warping a stereo proximity image to the left and right camera reference frames. The initial values for the left and right proximity maps still contains a lot of ‘blind spots’, or areas where no (reliable) proximity data is available. These areas are caused by unsuccessful correspondences in the used stereo vision algorithm, which performs an area-based correlation with sum of absolute differences on bandpassed images [[26], [27]]. This algorithm is fairly robust and it has a number of validation steps that reduce the level of noise. However, the method requires texture and contrast to work correctly. Effects like occlusions, repetitive features and specular reflections can cause problems leading to gaps in the proximity maps. In the following discussion, we will evaluate how well the proposed dense stereo–motion algorithm is able to cope with these blind spots and see whether it is capable of filling in the areas where depth data are missing.
To compare our method to the state-of-the-art, we implemented a more classical dense stereo–motion reconstruction approach. This approach defines classical stereo and motion constraints, based upon the constant image brightness assumption, alongside the Nagel–Enkelmann regularisation constraint. These constraints are integrated into one objective function, which is solved using a traditional trust-region method. As such, this approach presents a relatively simple and straightforward solution. This methodology is used to serve as a base benchmarking method for the AL-based stereo–motion reconstruction technique.
Applying this more classical technique to the ‘Desk’ sequence shown in Fig. 4 results in a depth reconstruction as shown in Fig. 5. Overall, the reconstruction of the proximity field correlates with the physical reality, as imaged in Fig. 4, but there are some serious errors in the reconstructed proximity fields, notably on the board in the middle of the image. This leads us to conclude that this method is not suitable for high-quality 3D modelling. In the following, we compare these results with the ones obtained by the proposed AL-based stereo–motion optimisation methodology, using the same input sequence.

Proximity maps for different frames of the desk sequence using the global optimisation algorithm
a Frame 1, left proximity d101
b Frame 1, right proximity d10r
c Frame 10, left proximity d101
d Frame 10, right proximity d10r
3.1.2 Reconstruction results
Fig. 6 shows the reconstructed left and right proximity maps using the algorithm shown in Fig. 3. The reconstructed proximity field correlates very well with the physical nature of the scene. Foreground and background objects are clearly distinguishable. The depth gradients on the left and back walls can be clearly identified, despite the fact that there is very little texture on these walls. The occurrence of specular reflection on the poster does not cause erroneous reconstruction results. The only remaining errors on the proximity field are in fact because of border effects. Indeed, at the lower left of Fig. 6a and the lower right of Fig. 6b, one can note some areas where the regularisation has smoothed out the proximity field. The reason for this lies in the lack of initial proximity data in these areas. Owing to the total absence of proximity information in these areas, the algorithm used the solution from the neighbouring regions. In general, this was performed correctly, but because of the lack of information, the algorithm estimated the direction of regularisation wrongly at these two locations. This is quite a normal side-effect when using area-based optimisation techniques, which can be solved by extending the image canvas before the calculations. The result of Fig. 6 can be compared with Fig. 5, which shows the same output, but using the global optimisation approach. From this comparison, it is evident that the result of the AL-based reconstruction technique is far superior to the one using global optimisation. The global optimisation result features numerous problems: erroneous proximity values, under-regularised areas, over-regularised areas and erroneous estimation of discontinuities. None of those problems are present in the result of the AL, as shown in Fig. 6.

Proximity maps for different frames of the desk sequence using the AL algorithm
a Frame 1, left proximity dl1
b Frame 1, right proximity d1r
c Frame 10, left proximity d101
d Frame 10, right proximity d10r
To show the applicability of the presented technique for 3D modelling, the individual reconstruction results were integrated to form one consistent 3D representation of the imaged environment. Fig. 7 shows four novel views of the 3D model. From the different novel viewpoints, the 3D structure of the office environment can be clearly deduced, there are no visible outliers and all items in the scene have been reconstructed, even those with very low texture. This illustrates the capabilities of the proposed AL-based stereo–motion reconstruction technique, which allows the reconstruction of a qualitative 3D model.

Reconstructed 3D model of the desk sequence
a Novel view 1
b Novel view 2
c Novel view 3
d Novel view 4
3.2 Quantitative analysis using standard benchmark sequences
For quantitative analysis, we compared the performance of the proposed approach with a traditional variational scene-flow-based method using standard benchmark sequences. The selected benchmarking sequences are the well known ‘Cones’ and ‘Teddy’ sequences created by Scharstein and Szeliski [[36], [37]], shown on the top row of Fig. 8.

Quantitative analysis: input images and ground truth depth maps
Top row: left input image and bottom row: ground truth left depth image. Left column: Cones sequence and right column: Teddy sequence
a Cones sequence, left image at t0
b Teddy sequence, left image at t0
c Cones sequence, ground truth depth image at t0
d Teddy sequence, ground truth depth image at t0
As a baseline algorithm, the variational scene-flow reconstruction approach presented by Huguet and Devernay [[19]] was chosen, as the authors provided the algorithm online, which makes it possible to perform comparison tests. To be able to supply a correct comparison of the stereo–motion reconstruction capabilities of both algorithms, the same base stereo algorithm [[38]] was used to initialise both methods.
The results of any reconstruction algorithm depend largely on the correct initialisation of the algorithm and the selection of the parameters. In the initialisation phase of the proposed method, the estimation of the motion vectors τl, ωl and τr, ωr via sparse structure from motion plays an important role. To assess the validity of the motion vector estimation results, it is possible to compare the measured motion with the perceived motion between the subsequent images. For example, for the ‘Cones’ sequence, the main motion is a horizontal movement, which is correctly expressed by the estimated translation vectors: tl = [0.0800, 0.3151, 0.0988] and tr = [0.1101, 0.3131, 0.1255]. Ideally, both vectors should be identical (as both cameras follow an identical motion pattern), which gives an idea of the errors on the motion estimation process.
Parameter-tuning is a process which affects many modern reconstruction algorithms, as the parameter selection makes comparison and application of the algorithms in real situations difficult. For the proposed approach, one parameter is of major importance: the parameter μ deciding on the balance between the data and the regularisation term. In our experiments, a value of μ = 0.5 was chosen, based on previous [[25]] analysis. A remaining parameter of lesser importance is the threshold on ε for stopping the iterative solver. This parameter is somewhat sequence-dependent, with typical values somewhere between 10 and 20. With regard to the benchmark algorithm by Huguet and Devernay, all parameters were chosen as provided by the authors in their original implementation.
Fig. 9 shows a qualitative comparison of the reconstruction results of both methods. The quantitative evaluation is done by computing the root-mean-square (RMS) error on the depth map measured in pixels, as presented by Table 1. The proposed AL-based binocular structure-from-motion (ALBDSFM) approach has the convenient property of decreasing the residual on the objective function dramatically in the first iteration, whereas convergence slows down in subsequent iterations. For this reason, we also included the results after one iteration in the tables.
RMS error | Cones | Teddy |
---|---|---|
ALBDSFM (1 iteration) | 2.411 | 5.002 |
ALBDSFM (convergence) | 2.381 | 4.961 |
variational scene flow [[19]] | 9.636 | 8.650 |

Comparison of the reconstruction result using the traditional variational scene-flow method [[19]] and the proposed method
Top row: reconstructed left depth image using [[19]] and bottom row: reconstructed left depth image using proposed method. Left column: Cones sequence and right column: Teddy sequence
a Cones sqeuqnce, depth image at t0 using [[19]]
b Teddy sequence, depth image at t0 using [[19]]
c Cones sequence, depth image at t0 using proposed
d Teddy sequence, depth image at t0 using proposed method
As can be noted from Fig. 9 and Table 1, the proposed ALBDSFM algorithm performs better on both the ‘Cones’ and ‘Teddy’ sequences. On the ‘Cones’ sequence, the ALBDSFM approach is better capable of representing the structure of the lattice in the back, whereas this structure is completely smoothed by the ‘SceneFlow’ algorithm. On the ‘Teddy’ sequence, both reconstruction results are visually quite similar. It is clear that both reconstruction techniques suffer from over-segmentation. This is a typical problem of the Nagel–Enkelmann regularisation problem we used and can partly be remedied by fine-tuning the regularisation parameters; however, for keeping the comparison honest, we did not perform such sequence-specific parameter-tuning. The quantitative analysis on the ‘Teddy’ sequence in Table 1 shows an advantage for the ALBDSFM approach.
Table 2 gives an overview of the total processing times required for both algorithms. It must be noted that all experiments were performed on an Intel Core i5 central processing unit (CPU) of 1.6 GHz. The ‘SceneFlow’ algorithm is a C++ application (available on http://devernay.free.fr/vision/varsceneflow), whereas the ALBDSFM is implemented in MATLAB. While none of the algorithms can be called fast, it is clear that the ALBDSFM approach is much faster than the ‘SceneFlow’ implementation. The processing time is mostly dependent on the computational cost for a single iteration (within the ‘while’-loop of Algorithm 1) and the number of iterations, as the initialisation and stereo computation steps only take a few seconds. As the iteration step consists of a double optimisation step using the Brent's method, its computational complexity is of the order of O(2n2) with n, the number of image pixels. When high-resolution images are used, the computational cost quickly rises, which explains the relatively large processing times. However, none of these algorithm implementations make use of multi-threading or graphics processing unit (GPU) optimisations, so large speed gains could be obtained by applying such optimisations.
Processing time (min) | Cones | Teddy |
---|---|---|
ALBDSFM (1 iteration) | 24 | 24 |
ALBDSFM (convergence) | 74 | 205 |
variational scene flow [[19]] | 257 | 243 |
4 Conclusions
The combination of spatial and temporal visual information makes it possible to achieve high-quality dense depth reconstruction, but comes at the cost of a high computational complexity. To this extent, we presented a novel solution to integrate the stereo and motion depth cue, by simultaneously optimising the left and right proximity fields, using the AL. The main advantage of our algorithm is the ability to exploit all the available constraints in one minimisation framework. Another advantage is that the framework is able to incorporate any given stereo reconstruction methodology. The algorithm has been implemented and applied on real imagery as well as benchmarks. A comparison of the proposed method to the variational scene-flow method shows that the quality of the obtained results far exceed the quality of the results using the traditional method. The added quality with respect to normal stereo comes with a penalty of increased processing time, which is still important. Even though future optimisation of the implementation, by considering, for example, a GPU implementation, will certainly further reduce the processing time, we consider that the proposed approach can already be effectively used at this moment in an off-line production environment, where the proposed 3D reconstruction methodology presents an excellent reconstruction tool allowing high-quality 3D recosntruction from binocular video.
5 Acknowledgment
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 285417.