Volume 8, Issue 2 p. 98-109
Article
Free Access

Augmented Lagrangian-based approach for dense three-dimensional structure and motion estimation from binocular image sequences

Geert De Cubber

Corresponding Author

Geert De Cubber

Electronics and Information Processing (ETRO), Vrije Universiteit Brussel, Brussels, 1040 Belgium

Mechanical Engineering, Royal Military Academy of Belgium, Brussels, 1000 Belgium

Search for more papers by this author
Hichem Sahli

Hichem Sahli

Electronics and Information Processing (ETRO), Vrije Universiteit Brussel, Brussels, 1040 Belgium

Interuniversity Microelectronics Centre – IMEC, Heverlee, 3001 Belgium

Search for more papers by this author
First published: 01 April 2014
Citations: 1

Abstract

In this study, the authors propose a framework for stereo–motion integration for dense depth estimation. They formulate the stereo–motion depth reconstruction problem into a constrained minimisation one. A sequential unconstrained minimisation technique, namely, the augmented Lagrange multiplier (ALM) method has been implemented to address the resulting constrained optimisation problem. ALM has been chosen because of its relative insensitivity to whether the initial design points for a pseudo-objective function are feasible or not. The development of the method and results from solving the stereo–motion integration problem are presented. Although the authors work is not the only one adopting the ALMs framework in the computer vision context, to thier knowledge the presented algorithm is the first to use this mathematical framework in a context of stereo–motion integration. This study describes how the stereo–motion integration problem was cast in a mathematical context and solved using the presented ALM method. Results on benchmark and real visual input data show the validity of the approach.

1 Introduction

1.1 Problem statement

The integration of the stereo and motion depth cues offers the potential of a superior depth reconstruction, as the combination of temporal and spatial information makes it possible to reduce the uncertainty in the depth reconstruction result and to augment its precision. However, this requires the development of a data fusion methodology, which is able to combine the advantages of each method, without propagating errors induced by one of the depth reconstruction cues. Therefore the mathematical formulation of the problem of combining stereo and motion information must be carefully considered.

The dense depth reconstruction problem can be casted as a variational problem, as advocated by a number of researchers [[1], [2]]. The main problem in dense stereo–motion reconstruction is that the solution depends on the simultaneous evaluation of multiple ‘constraints’ which have to be balanced carefully. This is sketched in Fig. 1, which shows the different constraints to be imposed for a sequence acquired with a moving binocular camera. Considering a pair of rectified stereo images urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0001 at time t = t0 and a stereo pair urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0002 at time t = t0 + tk, with tk being determined by the frame rate of the camera. A point urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0003 in the reference frame urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0004 can be related to a point urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0005 via the stereo constraint, as well as to a point urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0006 via the motion constraint. Using the stereo and motion constraints in combination, the point urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0007 can even be related to a point urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0008, via a stereo + motion or a motion + stereo constraint. It is evident that, ideally, all these interrelations should be taken into consideration for all the pixels in all the frames in the sequence. In the following, we present such a methodology for addressing the stereo–motion integration problem for dense reconstruction.

Details are in the caption following the image

Motion and stereo constraints on a binocular sequence

1.2 State-of-the-art

The early work on stereo–motion integration goes back to the approach of Richards [[3]], relating the stereo–motion integration problem to the human vision system. Based on this analysis, Waxman and Duncan [[4]] proposed in a stereo–motion fusion algorithm. They define a ‘binocular difference flow’ as the difference between the left and right optical flow fields, where the right flow field is shifted by the current disparity field. In 1993, Li and Duncan [[5]] presented a method for recovering structure from stereo and motion. They assume that the cameras undergo translation, but no rotational motion. Tests on laboratory scenes presented good results; however, the constraint of having only translational motion is hard to fulfil for a real-world application.

The above-mentioned early work on stereo–motion integration generally considers only sparse features and uses three-dimensional (3D) tracking techniques [[6]] or direct methods [[7]] for reconstruction. Tracking techniques track 3D tokens from frame-to-frame and estimate their kinematics. The motion computation problem is formulated as a tracking problem and solved using an extended Kalman filter. Direct methods use a rigid-body motion model to estimate relative camera orientation and local ranges for both the stereo and motion components of the data. The obvious disadvantage of sparse reconstruction methodologies is that no densely reconstructed model can be obtained. To overcome this problem, other researchers have proposed model-based approaches [[8]]. The visible scene surface is represented with a parametrically deformable, spatially adaptive, wireframe model. The model parameters are iteratively estimated using the image intensity matching criterion. The disadvantage of this kind of approache is that they only work well for reconstructing objects that can be easily modelled (small objects, statues, …), and not for unstructured environments like outdoor natural scenes.

Recent approaches to stereo–motion-based reconstruction concentrated more on dense reconstruction. The general idea of these approaches is to combine the left and right optical flows with the disparity field, for example, using space carving [[9]] or voxel carving [[10]]. Some researchers [[11]] emphasise on the stereo constraint and only reinforce the stereo disparity estimates using an optical flow information, whereas Isard and MacCormick [[12]] use more advanced belief propagation techniques to find the right balance between the stereo and optical flow constraints.

Sudhir et al. [[13]] model the visual processes as a sequence of coupled Markov random fields (MRFs). The MRF formulation allows us to define appropriate interactions between the stereo and motion processes and outlines a solution in terms of an appropriate energy function. The MRF property allows to model the interactions between stereo and motion in terms of local probabilities, specified in terms of local energy functions. These local energy functions express constraints helping the stereo disambiguation by significantly reducing the search space. The integration algorithm as proposed by Sudhir et al. [[13]] makes the visual processes tightly constrained and reduces the possibility of an error. Moreover, it is able to detect stereo-occlusions and sharp object boundaries in both the disparity and the motion field. However, as this is a local method, it has difficulties when there are many regions with homogeneous intensities. In these regions, any local method of computation of stereo and motion is unreliable. Other researchers (e.g. Larsen et al. in [[14]]) later improved the MRF-based stereo–motion reconstruction methodology by making it able to operate on a 3D graph that includes both spatial and temporal neighbours and by introducing noise suppression methods.

As an alternative to the MRF-based approach, Strecha and Van Gool [[1], [15]] presented a partial differential equation (PDE)-based approach for 3D reconstruction from multi-view stereo. Their method builds upon the PDE-based approach for dense optical flow estimation by Proesmans et al. [[16]] and reasons on the occlusions between stereo and motion to estimate the quality or confidence of correspondences. The evolution of the confidence measures is driven by the difference between the forward and backward flows in the stereo and motion directions. Based on the above-estimated per-pixel and per-depth cue quality or confidence measures, their weighting scheme guides at every iteration and at every pixel the relative influences of both depth cues during the evolution towards the solution.

Other researchers [[17]-[20]] use scene-flow-based methods for stereo–motion integration. Like the optical flow, 3D scene flow is defined at every point in a reference image. The difference is that the velocity vector in scene-flow field contains not only x, y, but also z velocities.

Zhang and Kambhamettu [[17]] formulated the problem as computing a 4D vector (u, v, w, d), where (u, v) are the components of optical flow vector, d is the disparity and w is the disparity motion, at every point of the reference image, where the initial disparity is used as an initial guess. However, with serious occlusion and limited number of cameras, this formulation is very difficult, because it implies solving for four unknowns at every point. At least four independent constraints are needed to make the algorithm stable. Therefore in [[17]], constraints on motion, disparity, smoothness and optical flow, as well as confidence measurement on the disparity estimation, have been formulated. The major disadvantage of this approach, is its limitation for slowly moving Lambertian scenes under constant illumination.

The method advocated by Pons et al. in [[18]] handles projective distortion without any approximation of shape and motion and can be made robust to appearance changes. The metric used in their framework is the ability to predict the other input views from one input view and the estimated shape or motion. Their method consists of maximising, with respect to shape and motion, the similarity between each input view and the predicted images coming from the other views. They warp the input images to compute the predicted images, which simultaneously removes projective distortion.

Huguet and Devernay [[19]] proposed a method to recover the scene flow by coupling the optical flow estimation in both cameras with dense stereo matching between the images, thus reducing the number of unknowns per image point. The main advantage of this method is that it handles occlusions both for optical flow and stereo. In [[20]], Sizintsev and Wildes extend the scene-flow reconstruction approach, by introducing a spatiotemporal quadric element, which encapsulates both spatial and temporal image structure for 3D estimation. These so-called ‘stequels’ are used for spatiotemporal view matching. Whereas Huguet and Devernay [[19]] apply a joint smoothness term to all displacement fields, Valgaerts et al. [[21]] propose a regularisation strategy that penalises discontinuities in the different displacement fields separately.

1.3 Related work

As can be noted from the overview of the previous section, most of the recent research works on stereo–motion reconstruction use scene-flow-based reconstruction methods. The main disadvantage to 3D scene flow is that it is computationally quite expensive, because of the 4D nature of the problem. Therefore we formulate the stereo–motion depth reconstruction problem into a constrained minimisation one and use a sequential unconstrained minimisation technique, namely, the augmented Lagrange multiplier (ALM) for solving it. This approach has been presented originally by De Cubber in [[22]]. The use of ALM has been also proposed recently by Del Bue et al. [[23]]; however, they apply the technique only to singular stereo reconstruction and structure from motion, whereas we propose an ALM use for integrated stereo–motion reconstruction.

The augmented Lagrangian (AL)-based stereo–motion reconstruction methodology presented here differentiates itself from the current state-of-the-art in stereo–motion reconstruction by a number of key factors: the processing strategy, depicted in Fig. 2, considers three sources of information for the structure estimation process: a left and a right proximity maps from motion, and a proximity map from stereo. During optimisation, information from the (central) proximity map from stereo is transferred to the left and right proximity maps, which are the ones actually being optimised simultaneously. During the optimisation process, data is constantly being interchanged between both optimisers, as they are highly dependent. The advantage of this concurrent optimisation methodology is that it provides a symmetric processing cue. This makes it easer to handle the uncertainties induced by the unknown displacements between the different cameras, in comparison with other approaches [[13]] who consider only one reference image and warp all other images to this reference image for matching and depth estimation. Other researchers have noted this too and have used even more depth or proximity maps. In [[1]], Strecha and Van Gool combine four proximity maps urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0009, as displayed in Fig. 1. The problem with using so many proximity maps, however, is that the problem size is increased drastically, and with it, also the computation time.
  • The proposed methodology poses the dense stereo–motion reconstruction problem as a constrained optimisation problem and uses the AL to transform the estimation into unconstrained optimisation problem, which can be solved with a classical method. Whereas other researchers express the stereo–motion reconstruction problem as a MRF [[13], [14]] or a graph cut [[2]] optimisation problem. The approach we follow is very natural, as the stereo–motion reconstruction problem is by nature a highly constrained and tightly coupled optimisation problem and the AL has been proven before [[23], [24]] to be an excellent method for these kind of problems.

Details are in the caption following the image

Processing strategy of a binocular sequence: from a left and right image sequences, proximity maps are calculated through stereo and dense structure from motion

These maps are iteratively improved by constrained optimisation, using the AL method

2 Methodology

2.1 Depth reconstruction model

The stereo–motion integration problem for dense depth estimation can be regarded as a high-dimensional data fusion problem. In this paper, we formulate the stereo–motion depth reconstruction problem into a constrained minimisation one, with suitable functional that minimises the error on the dense reconstruction. Fig. 2 illustrates the proposed methodology, where a pair of stereo images at time t is related to a consecutive pair at time t + 1.

Fig. 2 considers a binocular image stream consisting of left and right images of a stereo camera system. The left and right streams are processed individually, using the dense structure-from-motion algorithm proposed by De Cubber and Sahli in [[25]], resulting in, respectively, a left and right proximity maps dl and dr. In parallel, the left and right images are combined using a stereo algorithm [[26], [27]], embedded in the used ‘Bumblebee’ stereo camera. As a result of this stereo computation, a new proximity map from stereo dc can be defined. The reason for calling this proximity map dc lies in the fact that it is defined in the reference frame of a virtual central camera of the stereo vision system.

There exist strong interrelations between the different proximity maps dl, dc and dr, which need to be expressed to ensure consistency and to improve the reconstruction result. Therefore we adopt an approach where the left proximity map dl is optimised, subject to two constraints, relating it to dc and dr, respectively. In parallel, the right proximity map dr is optimised, also subject to two constraints, relating it to dc and dl. The compatibility of the left and right proximities is hereby automatically ensured, as both dl and dr are related to dc.

The dense stereo–motion reconstruction problem can thus be stated as the following constrained optimisation problem
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0010(1)
with E(x) as the objective functional and θi(x) expressing a number of constraint equations.
A traditional solving technique for constrained optimisation problems as the one posed by (1) is the Lagrangian multiplier method, which converts a constrained minimisation problem into an unconstrained minimisation problem of a Lagrange function. In theory, the Lagrangian methodology can be used to solve the stereo–motion reconstruction problem; however, to improve the convergence characteristics of the optimisation scheme, it is better [[28]] to use the AL urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0011, with λ as the Langrangian multiplier. The AL, which was presented by Powell and Hestenes in [[29], [30]], adds a quadratic penalty term to the original Lagrangian
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0012(2)
with a penalty parameter ρ > 0.
In the context of dense stereo–motion reconstruction, we seek to simultaneously minimise two energy functions: El(dl), for the left image, and Er(dr), for the right image, which we seek to optimise subject to four constraints
  1. urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0013 relates dl to the proximity map obtained from stereo dc.

  2. urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0014 relates dl to the proximity map of the right image dr.

  3. urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0015 relates dr to the proximity map obtained from stereo dc.

  4. urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0016 relates dr to the proximity map of the left image dl.

According to the AL theorem and the definition given by Equation 2, we can write the AL for the left image as follows
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0017(3)
For the right image, we have in a similar fashion
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0018(4)
The energy functions in (3) and (4) express the relationship between structure and motion between successive images.
It has to be noted that the approach for solving the reconstruction problem, in principle, is not tied to the formulation of the dense structure-from-motion problem, so any formulation can be chosen. Here, we use the dense structure-from-motion approach presented originally by De Cubber in [[22]], which formulates the dense structure from motion as minimising the following energy functional [[25]]
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0019(5)
The data term is based on the image derivatives based optical flow constraint
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0020(6)
where Ix and Iy denote the spatial gradient of the image in the x- and y-directions, It denotes the temporal gradient, d is a depth (proximity) parameter and the motion coefficients urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0021 are defined as a function of the camera focal length f and its translation τ = (τx, τy, τz) and rotation ω = (ωx, ωy, ωz)
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0022(7)
As expressed by (5), a regularisation is used to filter erroneous reconstruction results and to smooth and extrapolate the structure (depth) over related pixels. A key aspect here is of course to find out which pixels are related (e.g. belonging to the same object on the same distance), such that proximity information can be propagated and which pixels are not related. Here, we make use of the Nagel–Enkelmann anisotropic regularisation model, as defined in [[31]]
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0023(8)
with D as a regularised projection matrix.
The energy functions El(dl) and Er(dr) can then be defined as
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0024(9)
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0025(10)
with urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0026 are as given by (6) for, respectively, the left and right images and urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0027; urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0028, the regularisation term, according to (8). The diffusion parameter μ regulates the balance between the data and regularisation term. In order to regulate this balance, μ is estimated iteratively, using the methodology described in [[22]].
The constraints urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0029 with (i, j = left, centre, right) express the similarity between an estimated proximity map di and another proximity map dj. In order to calculate this similarity measure, the second proximity map must be warped to the first one. This warping process can be expressed by introducing a warping function, ψ = ψ(x, d, ω, τ), with d as the proximity, ω and τ as the camera rotation and translation, respectively. ψ allows defining the constraint equations, urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0030 as errors in the warping
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0031(11)
The first constraint, urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0032, expresses the similarity between the estimated left proximity map dl and the proximity map from stereo dc. The ‘motion’ that is considered in this case is in fact the displacement between the left camera and the virtual central camera, which is known a priori. Since we consider rectified stereo images, the rotational movement between the cameras is zero (ωstereo = 0) and the translational movement is according to the X-axis over a distance of half the stereo baseline b, such that τcl = (b/2, 0, 0)T. For estimating the depth, an iterative procedure is proposed. Following this methodology, the current estimate of the proximity map dl is filled in (11). As such, the warping process is integrated in the optimisation scheme and will gradually improve over time. Finally, urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0033 is given by
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0034(12)
The second constraint, urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0035, on the left proximity map can be obtained in the same way
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0036(13)
Note that, in this case, we use the translation over the whole baseline τst = (b, 0, 0)T for warping the right proximity map to the left proximity map.
The constraints on the right proximity map are as follows
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0037(14)
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0038(15)
By integrating the definitions of the energy functions of (9) and (10), and the constraint (12)–(15) into the formulation of the AL functions, given by (3) and (4), the constrained minimisation problem stated in (1) is now completely defined. How this problem is numerically solved is discussed in the following section.

2.2 Numerical implementation

The discrete version of (3) is given by
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0039(16)
The constraints given by (12) and (13) measure the dissimilarity between the left proximity map and the (warped) central and right proximity map, respectively. However, these proximity maps are discrete and possibly highly discontinuous, which makes them impractical to work with in an optimisation scheme. Therefore we use an interpolation function fI(d, x, y) which interpolates the discrete function d at a continuous location (x, y). In this work, we use a bi-cubic spline interpolation function [[32]], and formulate the discrete version of the constraint urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0040 of (12) as
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0041(17)
Similarly, urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0042 is given by
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0043(18)
The update equations of the Lagrangian multipliers urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0044 are derived as follows. When the solution for xk converges to a local minimum x∗, then the λk must converge to the corresponding optimal Lagrange multiplier λ∗. This condition can be expressed by differentiating the AL of (2) with respect to x
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0045(19)
In the local minimum, ∇E(x∗) = 0 and the optimality conditions on the AL require that also urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0046; hence, we can deduce
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0047(20)
which give us an update scheme for the Lagrangian multipliers, such that they converge to λi
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0048(21)
The expression of the energy and the constraint equations completely defines the formulation of the AL of (16), governing the iterative optimisation of the left proximity map dl. As such, the constrained optimisation problem of (1) is transformed into an unconstrained optimisation problem. To solve this unconstrained optimisation problem, we use a classical numerical solving technique, proposed by Brent in [[33]]. Brent's method switches between inverse parabolic interpolation and golden section search. Golden section search [[34]] is a methodology for finding the minimum of a bounded function by successively narrowing the range of values inside which the minimum is known to exist. This range is also updated using inverse parabolic interpolation, but only if the produced result is ‘acceptable’. If not, then the algorithm falls back to an ordinary golden section step.
This optimisation method converges to a minimum within the search interval. Therefore it is crucial that a good initial value is available for all status variables. To estimate this initial value for the proximity field, the dense disparity map from stereo is used. The reason for this is that the camera displacement between the left and right stereo frames is well known and is fixed over time. As such, it is possible to warp the stereo data in the virtual central camera reference frame towards the left and the right image with high accuracy. Applying image warping following the perspective projection model, it is possible to define the equations providing an initial value for the left and right proximity maps dl and dr, based on a stereo proximity map dst
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0049(22)
As can be noted, (22) contains no real unknown data, next to the stereo proximity map dst.
The application of Brent's optimisation method also requires that the minimum and maximum boundaries where the solution is to be found be known. In our case, it means that a minimum and maximum proximity values must be available for each pixel of the left and right images. These minimum and maximum proximity maps are calculated based on the 3σ error interval of the initial value of the proximity maps
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0050(23)
where urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0051 are calculated according to (22).
For the right proximity map, a set of similar expressions can be found, starting from the AL
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0052(24)
and with the constraints
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0053(25)
urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0054(26)
Algorithm 1 details the constrained optimisation methodology (Fig. 3). As shown in Fig. 3, there are, in fact, two functions that are optimised at the same time: one using urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0055 which optimises the left proximity map dl and one using urn:x-wiley:17519632:media:cvi2bf00062:cvi2bf00062-math-0056 which optimises the right proximity map dr. In the proposed algorithm, these functions are optimised alternatively, hereby always using the latest result for both proximity maps.
Details are in the caption following the image

Constrained optimisation for binocular depth reconstruction using AL

An aspect which is not depicted by Algorithm 1 is the choice of the optimal framerate. The underlying structure-from-motion algorithm uses the geometric robust information criterion scoring scheme introduced by Torr in [[35]] not to assess the optimal framerate. This will have as an effect that if the camera does not move (no translation and no rotation) between two consecutive time instants, no reconstruction will be performed.

3 Results and analysis

3.1 Qualitative analysis using a real-world binocular video sequence

3.1.1 Evaluation methodology

The validation and evaluation of a dense stereo–motion reconstruction algorithm requires the use of an image sequence with a moving stereo camera. Hence, we recorded, using a Bumblebee stereo head, an image sequence of an office environment as illustrated Fig. 4, denoted here after as ‘Desk’ sequence. The translation of the camera is mainly along its optical axis (Z-axis) and along the positive X-axis. The rotation of the camera is almost only along the positive Y-axis.

Details are in the caption following the image

Some frames of the binocular desk sequence

a Frame 1, left image

b Frame 1, right image

c Frame 10, left imate

d Frame 10, right image

As it can be seen from Fig. 4, the recorded sequence consists of a cluttered environment, presenting serious challenges for any reconstruction algorithm:
  • Cluttered environment with many objects at different scales of depth.

  • Relatively large untextured areas (e.g. the wall in the upper left) making correspondence matching very difficult.

  • Areas with specular reflection (e.g. on the poster in the upper right of the image), violating the Lambertian assumption, traditionally made for stereo matching.

  • Variable lighting and heavy reflections (in the window on the upper right), causing saturation effects and incoherent pixel colours across different frames.

We will focus our evaluation on how the presented iterative optimisation methodology deals with these issues and how well it is able to reconstruct the structure of this scene. However, it must not be forgotten that this iterative optimiser is also dependent on an initialisation procedure, which can influence the reconstruction result.

The initialisation step of the iterative optimiser estimates an initial value for the left and right depth fields. This method consists of warping a stereo proximity image to the left and right camera reference frames. The initial values for the left and right proximity maps still contains a lot of ‘blind spots’, or areas where no (reliable) proximity data is available. These areas are caused by unsuccessful correspondences in the used stereo vision algorithm, which performs an area-based correlation with sum of absolute differences on bandpassed images [[26], [27]]. This algorithm is fairly robust and it has a number of validation steps that reduce the level of noise. However, the method requires texture and contrast to work correctly. Effects like occlusions, repetitive features and specular reflections can cause problems leading to gaps in the proximity maps. In the following discussion, we will evaluate how well the proposed dense stereo–motion algorithm is able to cope with these blind spots and see whether it is capable of filling in the areas where depth data are missing.

To compare our method to the state-of-the-art, we implemented a more classical dense stereo–motion reconstruction approach. This approach defines classical stereo and motion constraints, based upon the constant image brightness assumption, alongside the Nagel–Enkelmann regularisation constraint. These constraints are integrated into one objective function, which is solved using a traditional trust-region method. As such, this approach presents a relatively simple and straightforward solution. This methodology is used to serve as a base benchmarking method for the AL-based stereo–motion reconstruction technique.

Applying this more classical technique to the ‘Desk’ sequence shown in Fig. 4 results in a depth reconstruction as shown in Fig. 5. Overall, the reconstruction of the proximity field correlates with the physical reality, as imaged in Fig. 4, but there are some serious errors in the reconstructed proximity fields, notably on the board in the middle of the image. This leads us to conclude that this method is not suitable for high-quality 3D modelling. In the following, we compare these results with the ones obtained by the proposed AL-based stereo–motion optimisation methodology, using the same input sequence.

Details are in the caption following the image

Proximity maps for different frames of the desk sequence using the global optimisation algorithm

a Frame 1, left proximity d101

b Frame 1, right proximity d10r

c Frame 10, left proximity d101

d Frame 10, right proximity d10r

3.1.2 Reconstruction results

Fig. 6 shows the reconstructed left and right proximity maps using the algorithm shown in Fig. 3. The reconstructed proximity field correlates very well with the physical nature of the scene. Foreground and background objects are clearly distinguishable. The depth gradients on the left and back walls can be clearly identified, despite the fact that there is very little texture on these walls. The occurrence of specular reflection on the poster does not cause erroneous reconstruction results. The only remaining errors on the proximity field are in fact because of border effects. Indeed, at the lower left of Fig. 6a and the lower right of Fig. 6b, one can note some areas where the regularisation has smoothed out the proximity field. The reason for this lies in the lack of initial proximity data in these areas. Owing to the total absence of proximity information in these areas, the algorithm used the solution from the neighbouring regions. In general, this was performed correctly, but because of the lack of information, the algorithm estimated the direction of regularisation wrongly at these two locations. This is quite a normal side-effect when using area-based optimisation techniques, which can be solved by extending the image canvas before the calculations. The result of Fig. 6 can be compared with Fig. 5, which shows the same output, but using the global optimisation approach. From this comparison, it is evident that the result of the AL-based reconstruction technique is far superior to the one using global optimisation. The global optimisation result features numerous problems: erroneous proximity values, under-regularised areas, over-regularised areas and erroneous estimation of discontinuities. None of those problems are present in the result of the AL, as shown in Fig. 6.

Details are in the caption following the image

Proximity maps for different frames of the desk sequence using the AL algorithm

a Frame 1, left proximity dl1

b Frame 1, right proximity d1r

c Frame 10, left proximity d101

d Frame 10, right proximity d10r

To show the applicability of the presented technique for 3D modelling, the individual reconstruction results were integrated to form one consistent 3D representation of the imaged environment. Fig. 7 shows four novel views of the 3D model. From the different novel viewpoints, the 3D structure of the office environment can be clearly deduced, there are no visible outliers and all items in the scene have been reconstructed, even those with very low texture. This illustrates the capabilities of the proposed AL-based stereo–motion reconstruction technique, which allows the reconstruction of a qualitative 3D model.

Details are in the caption following the image

Reconstructed 3D model of the desk sequence

a Novel view 1

b Novel view 2

c Novel view 3

d Novel view 4

3.2 Quantitative analysis using standard benchmark sequences

For quantitative analysis, we compared the performance of the proposed approach with a traditional variational scene-flow-based method using standard benchmark sequences. The selected benchmarking sequences are the well known ‘Cones’ and ‘Teddy’ sequences created by Scharstein and Szeliski [[36], [37]], shown on the top row of Fig. 8.

Details are in the caption following the image

Quantitative analysis: input images and ground truth depth maps

Top row: left input image and bottom row: ground truth left depth image. Left column: Cones sequence and right column: Teddy sequence

a Cones sequence, left image at t0

b Teddy sequence, left image at t0

c Cones sequence, ground truth depth image at t0

d Teddy sequence, ground truth depth image at t0

As a baseline algorithm, the variational scene-flow reconstruction approach presented by Huguet and Devernay [[19]] was chosen, as the authors provided the algorithm online, which makes it possible to perform comparison tests. To be able to supply a correct comparison of the stereo–motion reconstruction capabilities of both algorithms, the same base stereo algorithm [[38]] was used to initialise both methods.

The results of any reconstruction algorithm depend largely on the correct initialisation of the algorithm and the selection of the parameters. In the initialisation phase of the proposed method, the estimation of the motion vectors τl, ωl and τr, ωr via sparse structure from motion plays an important role. To assess the validity of the motion vector estimation results, it is possible to compare the measured motion with the perceived motion between the subsequent images. For example, for the ‘Cones’ sequence, the main motion is a horizontal movement, which is correctly expressed by the estimated translation vectors: tl = [0.0800, 0.3151, 0.0988] and tr = [0.1101, 0.3131, 0.1255]. Ideally, both vectors should be identical (as both cameras follow an identical motion pattern), which gives an idea of the errors on the motion estimation process.

Parameter-tuning is a process which affects many modern reconstruction algorithms, as the parameter selection makes comparison and application of the algorithms in real situations difficult. For the proposed approach, one parameter is of major importance: the parameter μ deciding on the balance between the data and the regularisation term. In our experiments, a value of μ = 0.5 was chosen, based on previous [[25]] analysis. A remaining parameter of lesser importance is the threshold on ε for stopping the iterative solver. This parameter is somewhat sequence-dependent, with typical values somewhere between 10 and 20. With regard to the benchmark algorithm by Huguet and Devernay, all parameters were chosen as provided by the authors in their original implementation.

Fig. 9 shows a qualitative comparison of the reconstruction results of both methods. The quantitative evaluation is done by computing the root-mean-square (RMS) error on the depth map measured in pixels, as presented by Table 1. The proposed AL-based binocular structure-from-motion (ALBDSFM) approach has the convenient property of decreasing the residual on the objective function dramatically in the first iteration, whereas convergence slows down in subsequent iterations. For this reason, we also included the results after one iteration in the tables.

Table 1. RMS error in pixels on the different sequences using both methods
RMS error Cones Teddy
ALBDSFM (1 iteration) 2.411 5.002
ALBDSFM (convergence) 2.381 4.961
variational scene flow [[19]] 9.636 8.650
Details are in the caption following the image

Comparison of the reconstruction result using the traditional variational scene-flow method [[19]] and the proposed method

Top row: reconstructed left depth image using [[19]] and bottom row: reconstructed left depth image using proposed method. Left column: Cones sequence and right column: Teddy sequence

a Cones sqeuqnce, depth image at t0 using [[19]]

b Teddy sequence, depth image at t0 using [[19]]

c Cones sequence, depth image at t0 using proposed

d Teddy sequence, depth image at t0 using proposed method

As can be noted from Fig. 9 and Table 1, the proposed ALBDSFM algorithm performs better on both the ‘Cones’ and ‘Teddy’ sequences. On the ‘Cones’ sequence, the ALBDSFM approach is better capable of representing the structure of the lattice in the back, whereas this structure is completely smoothed by the ‘SceneFlow’ algorithm. On the ‘Teddy’ sequence, both reconstruction results are visually quite similar. It is clear that both reconstruction techniques suffer from over-segmentation. This is a typical problem of the Nagel–Enkelmann regularisation problem we used and can partly be remedied by fine-tuning the regularisation parameters; however, for keeping the comparison honest, we did not perform such sequence-specific parameter-tuning. The quantitative analysis on the ‘Teddy’ sequence in Table 1 shows an advantage for the ALBDSFM approach.

Table 2 gives an overview of the total processing times required for both algorithms. It must be noted that all experiments were performed on an Intel Core i5 central processing unit (CPU) of 1.6 GHz. The ‘SceneFlow’ algorithm is a C++ application (available on http://devernay.free.fr/vision/varsceneflow), whereas the ALBDSFM is implemented in MATLAB. While none of the algorithms can be called fast, it is clear that the ALBDSFM approach is much faster than the ‘SceneFlow’ implementation. The processing time is mostly dependent on the computational cost for a single iteration (within the ‘while’-loop of Algorithm 1) and the number of iterations, as the initialisation and stereo computation steps only take a few seconds. As the iteration step consists of a double optimisation step using the Brent's method, its computational complexity is of the order of O(2n2) with n, the number of image pixels. When high-resolution images are used, the computational cost quickly rises, which explains the relatively large processing times. However, none of these algorithm implementations make use of multi-threading or graphics processing unit (GPU) optimisations, so large speed gains could be obtained by applying such optimisations.

Table 2. Total processing time in minutes on the different sequences using both methods
Processing time (min) Cones Teddy
ALBDSFM (1 iteration) 24 24
ALBDSFM (convergence) 74 205
variational scene flow [[19]] 257 243

4 Conclusions

The combination of spatial and temporal visual information makes it possible to achieve high-quality dense depth reconstruction, but comes at the cost of a high computational complexity. To this extent, we presented a novel solution to integrate the stereo and motion depth cue, by simultaneously optimising the left and right proximity fields, using the AL. The main advantage of our algorithm is the ability to exploit all the available constraints in one minimisation framework. Another advantage is that the framework is able to incorporate any given stereo reconstruction methodology. The algorithm has been implemented and applied on real imagery as well as benchmarks. A comparison of the proposed method to the variational scene-flow method shows that the quality of the obtained results far exceed the quality of the results using the traditional method. The added quality with respect to normal stereo comes with a penalty of increased processing time, which is still important. Even though future optimisation of the implementation, by considering, for example, a GPU implementation, will certainly further reduce the processing time, we consider that the proposed approach can already be effectively used at this moment in an off-line production environment, where the proposed 3D reconstruction methodology presents an excellent reconstruction tool allowing high-quality 3D recosntruction from binocular video.

5 Acknowledgment

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 285417.