Adaptive enhanced afﬁne transformation for non-rigid registration of visible and infrared images

Non-rigid registration, performing well in all-weather and all-day/night conditions, directly determine the reliability of visible (VIS) and infrared (IR) image fusion. On account of non-planar scenes and differences between IR and VIS cameras, non-linear transformation models are more helpful to non-rigid image registration than the afﬁne model. However, most of non-linear models usually used on non-rigid registration are constructed by control points at present. Aiming at the issue that the adaptiveness and generalization of the control-point-based models are limited, adaptive enhanced afﬁne transformation (AEAT) is proposed for image registration, generalizing the afﬁne model from linear to non-linear case. Firstly, Gaussian weighted shape context, measuring the structural similarity between multimodal images, is designed to extract putative matches from edge maps of IR and VIS images. Secondly, to implement global image registration, the optimal parameters of the AEAT modal are estimated from putative matches by a strategy of subsection optimization. Experiment results show that this approach is robust in different registration tasks and outperforms several competitive methods on registration precision and speed.


INTRODUCTION
Non-rigid image registration, as we know, is a prerequisite for infrared (IR) and visible (VIS) image fusion [1] applied widely in many applications, such as night vision [2], medical imaging [3] and remote sense [4] etc. The goal of registration is to estimate a spatial transformation model between two images to be aligned. Without image registration, IR and VIS image fusion could not be implemented [5]. However, the differences between acquisition systems and intensity distributions of multimodal images raise a tough challenge to image registration, making multimodal image registration a hot topic. On this account, non-rigid registration of IR and VIS images is our focus in this work. Image registration could be considered as a fitting problem. Mutual features between VIS and IR images are regarded as known data points. Spatial transformation model is a fitting function. A fitting criterion, which is used to quantify the distance between two images to be aligned, is established by mutual features and transformation model. Then, image registration could be achieved by determining transformation parameters minimizing the fitting criterion [11]. Thus, mutual This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. Three types of mutual features would be introduced here: point, intensity and structure. Point feature is prevalent so that many approaches of point matching are used for mutual feature extraction [6]. Although some point-feature-based registration algorithms do not require any explicit set of point matches, the feature descriptors quantifying the degree of point correspondence still control the precision of registration directly. Also, intensity-based feature descriptors are widely employed for image registration, such as scale invariant feature transform (SIFT) [7], corner feature [8], histogram of gradient (HOG) [9] and speeded up robust feature (SURF) [10] etc. Even so, for IR and VIS image registration, intensity-based features could not produce accurate results due to the great difference between intensity distributions of VIS and IR images. Compared to intensity-based feature descriptors, structural feature descriptors are affected slightly by intensity distributions. Moreover, owing to the similarity between the global structures of VIS and IR images to be aligned [11], mutual features could be extracted accurately by structural feature descriptors, one of which called shape context (SC) [12] has been successfully used to measure the similarity between the structure of point sets.
In recent year, coherent distance SC (CDSC) [13], inner distance SC (IDSC) [14], rotation invariant SC (RISC) [15] and normalized weighted SC (NWSC) [16] have evolved from SC. However, these feature descriptors are developed for point set registration or shape matching. Besides, structures in IR images are likely to be missed or deformed in VIS images. Thus, a more powerful feature descriptor measuring the structural similarity between VIS and IR images to be aligned is required for image registration.
For image registration, spatial transformation models are used to represent the pattern of deformation between images to be aligned. Among these models, affine model [17] is a typical and widely-used linear model. However, it could not produce accurate alignment when there is deforming anisotropy between images in numerous applications, especially in multimodal image fusion. To address this problem, many controlpoint-based transformation models have been proposed, such as B-splines model [18], thin-plate spline (TPS) model [19] and a model within reproducing kernel Hilbert space (RKHS) [20]. Above-mentioned models could accurately describe the pattern of non-linear deformation between images to be aligned but with the weakness of heavy reliance on control points. In that the transformation parameters of control-point-based models are optimized in neighbourhoods of control points, the distribution and quantity of control points both have influence on the precision of global registration. Furthermore, control points or the ranges of selecting control points are regarded as preset parameters in general. But it is unpractical to select different optimal control points for various image scenes in real applications. Therefore, the control-point-based models are limited by low adaptability and robustness.
In this work, our contributions include following two aspects. Firstly, we develop Gaussian weighted shape context (GWSC) to quantify the structural similarity between IR and VIS images. Secondly, by generalizing the affine model from linear case to non-linear case, adaptive enhanced affine transformation (AEAT) model is proposed to adaptively determine the regular pattern of global deformation between VIS and IR images to be aligned. In brief, the key idea of our method (GWSC-AEAT) is to estimate the AEAT model from putative matches extracted by GWSC. The experiment results indicate that compared to the state-of-the-art methods, GWSC-AEAT has better performance on non-rigid VIS and IR image registration. Consequently, it could be employed to raise the performance of IR and VIS image fusion applications.
The remainder of this paper is structured as follows. Section 2 reviews the most important works related to our work. Section 3 describes how to extract putative matches from edge maps by GWSC. Section 4 presents the estimation of the AEAT model and how to apply it to global image registration. Section 5 shows comparison results of GWSC-AEAT and state-of-the-art methods on real datasets, followed by analyses and discussions of experiment results. Finally, concluding remarks and weaknesses of our work are presented in Section 6.

RELATED WORKS
If considering image registration as a fitting problem, the distance measure quantifying registration performance is a fitting criterion. According to various distance measures, registration methods could be classified into intensity-based methods, spectral methods and feature-based methods.
As for intensity-based techniques, the distance measures, such as mutual information (MI) [21] and normalized mutual information (NMI) [22], are defined directly by image intensity. Also, MI-based measures can be formulated by image gradient [23] and image patch [24]. The assumption MI-based measures mainly rely on is that the statistical regularities of intensity are similar between images to be aligned. However, there is significant difference between intensity distributions of multimodal images. Therefore, intensity-based measures are not suitable for multimodal image registration [25]. What is more, intensitybased measures are generally formulated as non-convex functions, thereby it is difficult to minimize then quickly and efficiently [26].
When it comes to spectral methods, which represent image data structures by spectral decomposition, they are the upgrading of intensity-based measures. Typically, the first embedding coordinate of diffusion maps and Laplacian Eigen maps were used for feature representation of multimodal images [27]. Then, the joint graph obtained by spectral decomposition [25] and Laplacian commutativity [11] were developed to measure the similarity between multimodal images in high dimensional Eigen spaces. Because spectral-based measures established by L 1 or L 2 distance measures are convex functions, the solution of spectral-based measures is more concise than MI-based measures. However, the derivatives formulas of spectral-based measures are rather complex and difficult to be deduced. Thus, the development and real application of spectral methods are limited.
Concerning feature-based methods, they measure the performance of image registration by using the distance between match features of images to be aligned [28]. Image features representing the correspondence between images, such as points, curves and surfaces [29], are typically used as match features. Actually, point feature is more popular because curves and surfaces could be discretized into point sets. Although featurebased image registration could be regarded as point set registration, there is a difference between image registration and point set registration. That is the transformation model estimated from two point sets is only used to align these points to achieve point registration. However, as for image registration, the transformation model aligning two feature point sets is used to align all of the points in images. Therefore, the transformation models used for image registration have to represent the pattern of global deformation between images to be aligned.
Some typical models have been employed to construct feature-based measures, including L 2 loss criterion [30], L 2minimizing estimator (L 2 E) [31], regularized Gaussian field criterion (RGF) [20] and Gaussian mixture model (GMM) [32]. As a result of the convexity of feature-based measures, these models could be minimized by gradient-based techniques, such as the gradient descent method and the quasi-Newton method. In addition, one more important issue for feature-based measures is how to find enough match features from multimodal images to be aligned, for the reason that the accuracy of point matching directly affects the precision of the feature-based measure.
Point matching used for feature-based measures could be divided into two types, soft-assignment and hardassignment. Generally speaking, feature-based measures using soft-assignment are established by an attribute matrix indicating the degree of matches between all pairs of points. During the estimation of transformation parameters, explicit correspondence is iteratively solved by the optimization with soft-assignment as initial matching. The typical methods include iterated closet point (ICP) [33], robust point matching using TPS (TPS-RPM) [30], coherence point drift (CPD) [34] and face registration using RGF [20]. Because explicit correspondence and transformation parameters are solved simultaneously, it is very likely to fall into local convergence and huge computation for the optimization of feature-based measures with softassignment.
Unlike soft-assignment, hard-assignment is to determine explicit matches before transformation estimation. Thus, the likelihood and the speed of reaching the global minimum of feature-based measures with hard-assignment are increased. Hard-assignment is commonly used in feature-based registration methods, such as robust point matching using L 2 E estimator (RPM-L 2 E) [31], robust point matching using manifold regularization (MR-RPM) [15], point registration using spatially constrained Gaussian fields (SCGF) [35] and image registration using affine and contrast invariant descriptor [36]. In general, feature-based measures with hard-assignment require a highly accurate method of point matching, the keys of which are feature descriptors and bipartite graph matching. Feature descriptors representing the similarity between two point sets have been discussed in Section 1. In addition, there are many typical approaches for bipartite graph matching, such as the Hungarian algorithm [37], the deferred-acceptance algorithm [38] and the shortest augmenting path algorithm [39].

PUTATIVE MATCH EXTRACTION BY GWSC
This section mainly describes the definition of GWSC and how to extract putative match from edge maps by GWSC. It should be noted that VIS images are considered as reference images in this paper, and thus our purpose is to align IR images to VIS images.

Problem formulation
The set of putative matches is represented as F = {(r k , s k )} K k=1 , where r k is an edge point in IR images and s k is the putative matching edge point in VIS images, r k , s k ∈ ℝ 2 , k ∈ ℕ + . K denotes the number of putative matches. In this work, canny edge descriptor is employed to extract the edge maps from IR and VIS images.
A transformation model with parameters c is represented as c , which indicates a mapping from IR image space to VIS image space. The parameters c are estimated by aligning point set {r k } K k=1 to {s k } K k=1 . This process could be considered as an optimization procedure. On this account, an objective function quantifying registration accuracy is essential.
In this work, Gaussian-fields-based model is chosen to construct the objective function since it is continuously differentiable and can converge to a global optimal solution quickly [20]. Therefore, the objective function is written as: where ‖ ⋅ ‖ represents the L 2 norm, e is a range parameter. The first term measures the distance between two point sets.
The second term S (c) is a stabilizer that establishes control over the transformation. ∈ ℝ is a normalized weight balancing the two terms. From Equation (1) we can see that the performance of the objective function is largely determined by the accuracy of putative matches. The reason is that incorrect point matches could not produce accurate measurement of registration performance. Moreover, compared with the estimation of transformation parameters, the computational cost of point matching is higher. Furthermore, IR and VIS image pairs with highly different intensity distribution and missing structures increase the difficulty of point matching. Therefore, in the next section, we would introduce a feature descriptor GWSC to improve the speed and accuracy of point matching between VIS and IR images.

GWSC
SC describing the neighbourhood structures of points is always used on point matching in edge maps. The SC-based similarity measure between two edge points of IR and VIS images is defined as: where denote the edge point sets of IR and VIS images respectively. I and J are the numbers of the edge points of IR and VIS images respectively. S t (⋅) represents the T-bin normalized histogram of an edge point and it is given by: Cs is a cost matrix between the edge points of IR and VIS images. Two edge points with the lower SC-based measure are more akin to each other. Because there might be some missing and deformed edges between VIS and IR images, SC often fails in point matching between VIS and IR images. This could be seen in Figure 1, which shows an illustration of point matching by SC. The points A and B are determined to be a putative match by SC, but actually, A is corresponding to C. On the one hand, the neighbourhood structures of the points A and B are similar within certain range. On the other hand, in the VIS image, the edge feature corresponding to the point B in the IR image is distorted. There is similar error on the matching among the VIS edge points D, E and the IR edge point F, as shown clearly in the top of Figure 2. Hence, point matching in the edge maps of IR and VIS images suffers from these problems.
For addressing the above problems, the feature descriptor GWSC is proposed to improve the discrimination between the points which are similar but not corresponding. GWSC is improved from SC and the cost matrix of GWSC is written as: where W r , W v and W rv are the Gaussian weight matrices and defined as follow: where s and rv control the range of interaction between points respectively. r , v and rv are the normalized coefficients, W v jb ) are the diagonal degree matrices of W r and W v , respectively. * denotes the Hadamard product.
The original SC measures the similarity between two points via the neighbourhood structure whereas that the radius of the neighbourhood of SC-based measure is too small to obtain correct point matching as shown in the top of Figure 2. However, increasing the radius could result in high computational complexity, and moreover, the SC-based measure might be easily disturbed by outliers. In the proposed GWSC, the first term in the bracket of Equation (4) is used to calculate the weighted average of the SC-based similarity between the neighbourhoods of r ′ i and s ′ j , as shown in the bottom of Figure 2. Accordingly, it is possible to enlarge the scope of SC without increasing the radius of the SC-based similarity measure Equation (2).
The second term W rv indicates the relative distances between all pairs of edge points in The assumption W rv relies on is that the distance between two points of an actual match in multimodal images to be aligned is not too large. In real applications, this could be easily achieved by controlling the displacement between the optical axes of VIS and IR cameras. In addition, the Gaussian weight matrices in the GWSC are used to reduce the negative impact of outliers on similarity measuring. Therefore, Cg indicates the degrees of correspondence between all pairs of edge points in VIS and IR images. The lower Cg i j is, the more similar r ′ i and s ′ j are.

Extraction of putative matches
On the basis of the cost matrix, point matching is considered as bipartite graph matching so that the Hungarian method and its similar approaches are employed to determine putative matches between point sets. However, these traditional approaches are not suitable for extracting putative matches from edge maps of VIS and IR images. In most cases, a large number of points in ALGORITHM  (4); 3. Find the minimum of each row in Cg, add the corresponding column indices into the set M v , and generate the putative match set C v by Find the minimum of each column in Cg, add the corresponding row indices into the set M r , and generate the putative match set C r by Finally, the putative match set F could be determined by edge maps of real images sharply increase the computational cost of bipartite graph matching. Moreover, for implementing VIS and IR image registration, it is unnecessary to find the maximum matching between all pairs of edge points. The major purpose of point matching employed for image registration is to extract putative matches from all pairs of edge points. Thus, the GWSC-based approach is designed for the extraction of putative matches and outlined as follows.
It could be known from Algorithm 1 that the point pairs with the lowest GWSC-based measure are extracted as putative matches so that the set F indicates a potential correspondence between VIS and IR images. In addition, the time complexity of extracting putative matches from the cost matrix (i.e. the steps 3-5) is O(IJ ). Thus, compared with bipartite graph matching, Algorithm 1 has an advantage on calculation speed.

ESTIMATION OF THE AEAT MODEL
Transformation model p is a map function, which represents the pattern of deformation between two images to be aligned. This section presents the AEAT model and how to apply it for global image registration.

The AEAT model
The framework of the parallel optical axis is widely employed for VIS and IR image fusion systems. As a result, the causes of the deformation between VIS and IR images to be aligned include the displacement between two cameras, the differences between the lens of VIS and IR cameras, the distinctions between VIS and IR sensors etc. Actually, most of them could be formulated by a regular pattern. As a result, in this work, we make an assumption that the deformation pattern between a VIS and IR image pair consists of various regular patterns. In other words, the global deformation could be formulated by a mixed regular pattern.
Typically, the affine model is able to describe the global regularity of linear deformation. The smoothness of the affine model can enforce the consistency of spatial transformations for all points. Moreover, due to the regularity of the affine model, control points are not required. Hence, we believe that the mixed regular pattern of non-rigid deformation between VIS and IR images can be represented by generalizing the affine model from linear to non-linear case.
Let r k = [x, y] T , which is a 2 × 1 dimensional coordinate vector. Meanwhile, let (n) c (r k ) = [x,ŷ] T denote the mapping coordinate vector by spatial transformation. The AEAT model is defined by: x, y, where A = [s x cos , − sin , t x ; sin , s y cos , t y ] is the affine transformation matrix, s x and s y are scaling coefficients, t x and t y are translation coefficients, is the angle of rotation. From Equation (6) we can see that the first term is the traditional affine model. The second term P (n) is the enhanced part of the AEAT model, which is given by: x, y, where P (n) is the 2 × 1 dimensional vector of mixed polynomial transformation. (n) and (n) are the 1 × n p (n p = n(n + 3)∕2) dimensional parameter vectors containing all of i, j and i, j respectively. n is the order of the AEAT model. The first term of Equation (6) is used to describe the linear deformation pattern between two point sets. The second term is the mixed polynomial model, used for representing the regular pattern of non-linear deformation. Hence, the AEAT model is actually evolved from the affine transformation by generalizing the affine model from linear to non-linear case. Furthermore, the order of the AEAT model is easily adjusted via the parameter n. In other words, the AEAT model can be adapted for various non-linear degrees of non-rigid deformation between two point sets. Meanwhile, as a result of the smoothness and regularity of the affine and polynomial models, the AEAT model is smooth and has no requirement of any control points.
Substituting Equation (7) into (6), the AEAT model becomes the following matrix form: where H (n) c is a2 × (n p + 3) dimensional matrix. In this work, the parameter vector c (n) of the AEAT model is represented as [ , s x , s y , t x , t y | (n) | (n) ] T , which is a (2n p + 5) × 1 dimensional vector. With Equation (8), the objective function Equation (1) becomes, where z (n) = [0, 1, 1, 0, … , 0] T is the (2n p + 5) × 1 dimensional vector and tr (⋅) denotes the trace. The stabilizer (i.e. the second term of Equation (10)), which describes the variation of the AEAT model during the optimization process, is used to prevent the spatial transformation from being out of control.

Optimization
With Gaussian fields, Equation (10) is a continuously differentiable function with respect to the transformation parameters c (n) . As a consequence, we could give the matrix form of the corresponding derivative: where the matrix from of H (n) c̃( n) k ∕ c (n) is written as: where 0 is the n p × 1 dimensional zero vector. On the basis of the derivative Equation (11), the optimization of Equation (10) can be solved by the quasi-Newton method, before which we have to figure out how to determine the optimal order n of the AEAT model. The purpose of optimization is to estimate the optimal AEAT model from the putative matches, which could be considered as a fitting problem in our work. The putative matches extracted by Algorithm 1 are the known data points of the spatial transformation model. The AEAT model is a fitting function based on polynomial. The objective function is an evaluation criterion. Therefore, the order of the AEAT model directly impacts on the performance of optimization. To be more specific, the AEAT model with low order could not represent the complex deformation pattern between multimodal images. But on the other hand, due to the mixed polynomial model P (n) , the AEAT model with high order is sometimes more accurate but suffers from Runge phenomenon. Figure 3 shows the examples of Runge phenomenon caused by the AEAT model with order 4. There are many distortions in the registration results by the AEAT model with excessively high order.
Therefore, to determine the optimal order and parameters of the AEAT model, we develop a strategy of subsection optimization. It can be seen in Algorithm 2.
As to be seen in Equation (6), the proposed AEAT model consists of two parts: the affine model and the mixed polynomial model. Since the affine model is a linear transformation, it is only used to coarsely describe the regular pattern of non-rigid deformation between images to be aligned. The mixed polynomial model with high order is able to finely represent the regular pattern of deformation due to high non-linearity. Therefore, in the estimation of the AEAT model (Algorithm 2), we use the affine model for coarse registration and then mainly use the mixed polynomial model for fine registration.
Although the mixed polynomial model with order 1 seems similar to the affine model, there is a difference between the parameters of the two models. It is apparent the parameters of the mixed polynomial model can be optimized without any constraint in the estimation of the AEAT model. If the affine model is considered as a linear polynomial, the polynomial parameters are the affine transformation matrix A = [s x cos , − sin , t x ; sin , s y cos , t y ]. We can see that except for t x and t y , the polynomial parameters of the affine model are constrained within reasonable ranges. This is very helpful for avoiding over-fitting during optimization. Thus, the estimation of the affine model can be considered as a constrained optimization problem, while the estimation of the mixed polynomial model is an unconstrained optimization problem.
In the initial case, the distance between images to be aligned is large. The optimization with large initial error typically suffers from an over-fitting problem. Thus, as shown in Algorithm 2, we mainly use the affine model for constrained optimization in the first round, in order to avoid the over-fitting problem and achieve coarse registration. Then, by using the result of coarse registration as an initial value, the mixed polynomial model with the optimal order is estimated iteratively to achieve fine registration.
Additionally, an image interpolation approach such as bilinear interpolation has to be employed to fill the blank areas in the images transformed by AEAT.

Computational complexity
Let N = n p + 3. The time costs of the objective function Equation (10) and the derivative Equation (11)

EXPERIMENT
Firstly, GWSC was evaluated for point matching. An ablation study was then performed on the AEAT model. Finally, our method was tested on real IR and VIS images and compared with the state-of-the-art algorithms. In this work, the algorithms were implemented in Matlab and ran on a quad-core CPU (3.9 GHz) with 4 GB RAM.

Dataset
In our experiments, we used a dataset consisting of 21 pairs of real VIS and IR images. Examples of the dataset (selected from CVC datasets [17]) can be seen in Figure 4. The difference between the intensity distributions of the VIS and IR images raises difficulties for putative match extraction. Moreover, to make the non-rigid registration harder, we increased the deformation degree between the VIS and IR images. Thus, GWSC-AEAT can be effectively assessed on our dataset. In addition, we manually constructed a set of point matches as ground truth for each VIS and IR image pair. By using ground truth, the metric such as recall could be employed to the quantitative evaluation in our work. Consider an actual match (r g , s g ) in the ground truth. The transformation result of r g is denoted as r t g . If the Euclidean distance ‖r t g − s g ‖ is less than a given threshold (e.g. 5 pixels), the actual match (r g , s g ) is considered to be aligned correctly. Hence, we define the recall as the ratio of the total numbers in actual matches aligned correctly to actual matches in the ground truth.

Parameter setting
In our method, there are four parameters to be set: s , rv , e and . The parameters s and rv are used for GWSC, while e and are employed in the objective function Equation (10). The parameter s is used to control the neighborhood size of the weighted average of SC-based similarity between two point sets. The range of the relative distances between two point sets is adjusted by the parameter rv . e is a range parameter which controls the scale of the objective function and is used to balance two terms of the objective function Equation (10). In this work, we set s = 80, rv = 300, e = 6 and = 0.02. The influence of the parameter setting can be seen in Figure 5. These parameters are the optimal settings determined through multiple experiments, which are kept constant throughout this experiment.
In addition, the codes of the other approaches which are used to compare with the proposed method were provided by their authors. According to the original papers, the corresponding parameters were tuned to find the optimal settings.

Evaluation of GWSC
In our work, GWSC is used to point matching in the edge maps of VIS and IR images. Thus, we did not compare GWSC with some SC-based descriptors developed for shape matching, such as CDSC [13], IDSC [14] and NWSC [16]. Here, GWSC was tested on point matching of ground truth and compared with the original SC [12] and RISC [15]. In addition, to show the differences among the various feature descriptors clearly, the Hungarian method [37] is uniformly used for SC, RISC and GWSC to achieve bipartite graph matching. The illustrations of point matching by the original SC, RISC and GWSC are shown in Figure 6, where the coloured lines indicate the point matches. Meanwhile, the quantitative comparisons on our dataset are reported in Figure 7.The total average recalls of SC and RISC are 0.5411 and 0.4046 respectively, while that of GWSC is 0.8714. The qualitative and quantitative results both show that GWSC has a good performance on point matching between VIS and IR images. Furthermore, compared with the original SC, GWSC improves the accuracy of point matching by 61.04%. Hence, we can see that GWSC is successfully used to point matching between multimodal images.
In addition, we also tested the runtimes of the Hungarian method with SC, RISC and GWSC. The total average runtimes of SC and RISC are 2.33 and 2.68 min separately, while that of GWSC is 0.47 min. This shows that the augmenting paths of the bipartite graph constructed by GWSC are less compared to SC and RISC. Hence, it is proved again that GWSC is able to describe the similarity between two point sets more accurately and robustly. Figure 8 shows the illustrations of putative match extraction by Algorithm 1, where the coloured lines indicate the putative matches. We can see that most of the putative matches are in line with the actual situation. Therefore, on the basis of the putative matches extracted by Algorithm 1, the objective function Equation (10) is capable of quantifying the precision of VIS and IR image registration accurately.

5.4
Ablation study of the AEAT model  Secondly, the AEAT model with adaptive order was compared to that with fixed orders. Figure 10 shows the total average recalls and matching errors of the AEAT models with various orders, where AEAT-2, -3, -4 and -5 separately represent the AEAT models with order 2, 3, 4 and 5. From comparison results, we can see that the AEAT model with adaptive order raises the accuracy of non-rigid registration. This demonstrates that the strategy of subsection optimization with the AEAT model is able to improve the possibility of reaching a global optimal solution. It is also a critical factor in helping our method to make the robust alignment.

Comparison with state-of-the-art methods
The proposed GWSC-AEAT was further tested and compared with CPD [33], MR-RPM [15], RGF [20], RPM-L 2 E [31] and SC-TPS [12] which are the state-of-the-art methods belonging to point-feature-based registration algorithm. Hence they are similar with GWSC-AEAT on the basic framework. To make a fair comparison, the feature point sets extracted by Algorithm 1 are applied as feature points in all of these methods. The qualitative comparison can be seen in Figure 11 and the quantitative results are reported in Figure 12.
Firstly, because all of the registration results in Figure 11 are implemented with the feature point sets extracted by Algorithm 1, it is proved that Algorithm 1 can find enough point features for estimation of transformation. Meanwhile, this demonstrates again that the feature descriptor GWSC works fine on point matching between VIS and IR images.
Secondly, it is obvious in Figure 11 and Figure 12 that there are less distortion and mismatch in the registration results of GWSC-AEAT. Moreover, our method is more accurate on global image registration and has better recall curves in most Meanwhile, it is proved that the AEAT model with adaptive order has higher generalization performance than the non-rigid transformation models used in the other approaches.
Thirdly, in [15], MR-RPM has better performance of point set registration than SC-TPS, RGF and CPD, but it fails in FIGURE 10 The quantitative results of the registration by the AEAT models with various orders. The numbers in the legend are the total average matching errors image registration in this work. This is caused by the difference between point set registration and image registration. The transformation model aligning two point sets is estimated from these points. But, the transformation model estimated from feature point sets is required to align all pixels of images. Hence, the transformation model used for image registration needs to represent the global deformation pattern between images to be aligned. Based on the above analyses, the failure of MR-RPM on image registration proves that the transformation model estimated by MR-RPM cannot describe global deformation between images to be aligned. Meanwhile, the comparison results demonstrate that the deformation pattern described by the AEAT model has highest accuracy and can be successfully used to global image registration.

Runtimes
The runtime of GWSC-AEAT is tested on real images. Image registration by GWSC-AEAT consists of three steps: putative match extraction by GWSC, estimation of the AEAT model and global image transformation. Hence, the runtime of our method can be divided into three parts. When the Hungarian method with GWSC is used to extract putative matches from two point sets each containing 1000 edge points extracted from VIS and IR images, the average computation time is about 10 min. At the same conditions, the average runtime of Algorithm 1 is about 1.31 min. Compared with Hungarian method, the steps 3-5 of Algorithm 1 can reduce the runtime by 86.9%. Undoubtedly, the robustness of GWSC is the internal factor of improving the extraction speed. Since RGF has the second-best registration accuracy, our method was compared with RGF on the runtimes of transformation estimation and image transformation. On our dataset, the total average runtime of estimation of the AEAT model is about 5.91 s, while that of transformation estimation in RGF is about 36.81 s. These test results agree very well with the analysis of computational complexity in Section 4.3. It is proved that the set of putative matches reduces the time cost of the objective function greatly.
The total average runtime of image transformation by the AEAT model is about 0.0544 s, while that of global image transformation in RGF is about 0.0483 s. According to the analysis of computational complexity in Section 4.3 and [20], we found that the time cost of image transformation is mainly determined by the length of the transformation parameter vector. In the AEAT model, the lengths of the parameter vectors c (2) and c (3) is 15 and 23. In this experiment, the number of control points used in RGF is set to be 15 so that the length of the transformation FIGURE 12 The quantitative results of SC-TPS, RPM-L 2 E, RGF, MR-RPM, CPD and our method (GWSC-AEAT). The numbers in the legend are the average matching errors FIGURE 13 IR images with low resolution from the TNO dataset and the work [40] parameter vector is 15. Thus, the global image transformation of our method is slightly slower than that of RGF.
Overall, the average runtime of non-rigid image registration by GWSC-AEAT on our dataset is about 1.5 min.

CONCLUSION AND DISCUSSION
In this paper, GWSC-AEAT is proposed for VIS and IR image registration. At first, the feature descriptor GWSC is developed to measure the structural similarity between multimodal images, and to extract putative matches from edge maps. Then, the AEAT model is designed to represent the deformation regularity between VIS and IR images. Finally, the strategy of subsection optimization is employed to efficiently determine the optimal AEAT model from putative matches. The experiment results demonstrate that GWSC-AEAT outperforms the stateof-the-art approaches and has the advantages of high precision and low cost. Admittedly, GWSC-AEAT performs excellently in our experiments evaluated from many aspects, but it might still not work for some particular images since it utilizes image edge as the feature. When IR images with low resolution have less and blurred detail textures as Figure 13, GWSC is invalid so that it is difficult to find enough putative matches correctly for the estimation of the AEAT model, which is also the common problem of feature-based registration approaches [11].
One of the ideas to improve the above situation is that if semantic segmentation performs on IR and VIS images to be aligned, it would provide robust feature to putative match extraction and improve the efficiency of mutual feature extraction from low resolution images. Fortunately, many excellent methods [41][42][43] have been proposed for foreground object extraction in IR and VIS images. Hence, in future, we mainly focus on improving GWSC-AEAT via semantic segmentation techniques.