Multi-focus image fusion via morphological similarity-based dictionary construction and sparse representation

: Sparse representation has been widely applied to multi-focus image fusion in recent years. As a key step, the construction of an informative dictionary directly decides the performance of sparsity-based image fusion. To obtain sufficient bases for dictionary learning, different geometric information of source images is extracted and analysed. The classified image bases are used to build corresponding subdictionaries by principle component analysis. All built subdictionaries are merged into one informative dictionary. Based on constructed dictionary, compressive sampling matched pursuit algorithm is used to extract corresponding sparse coefficients for the representation of source images. The obtained sparse coefficients are fused by Max-L1 fusion rule first, and then inverted to form the final fused image. Multiple comparative experiments demonstrate that the proposed method is competitive with other the state-of-the-art fusion methods.


Introduction
Cloud computing provides more powerful computation resources to process various images [1][2][3][4]. Owing to the limited depth-of-focus of optical lenses, the blurred objects always appear in the captured images. It is difficult to capture all-in-focus image in one scene [5,6]. As the development of image processing techniques, multifocus fusion is widely used to combine complementary information from multiple out-of-focus images.
In this decade, multi-focus image fusion is a hot research topic, and many related methods have been proposed and implemented [5][6][7][8]. Multi-scale transform (MST)-based methods are widely used in multi-focus image fusion.
Among image pixel transformation methods, wavelet transform [9,10], shearlet [11,12], curvelet [13], dual tree complex wavelet transform [14,15], and non-subsampled controulet transform (NSCT) [16] are commonly used to represent image features in MST-based methods. In the transformation process, image features are resolved into MST bases and coefficients. The fusion process of MST-based methods consists of two steps. One is the fusion of coefficients, and the other is the transformation of fused coefficients. Different methods have their own focuses, thus it is difficult to represent all features of source images in a single method.
In recent years, sparse representation (SR)-based image fusion has resolved the limitations of MST-based method. Compared with MST-based method, the fusion process of SR-based method is similar. However, SR-based method usually uses trained dictionary to adaptively represent image features. Thus, SR-based method can better describe detailed information of images to reinforce the effect of fused image. As the most commonly used method, K-means generalised singular value decomposition (KSVD) algorithm is applied to SR-based image fusion [17][18][19]. Yin et al. [17] proposed a KSVD-based dictionary learning method and a hybrid fusion rule to improve the quality of multi-focus image fusion. Nejati et al. [ 18] also proposed a multi-focus image fusion method based on KSVD. He optimised the learning process of KSVD to enhance the performance of multi-focus image fusion. According to image features, Kim et al. [ 20] proposed a clustering-based dictionary learning method to train a more informative dictionary. The trained dictionary can better describe image features, and improve the fusion performance. Zhang introduced a non-negative SR model to improve the detailed performance in image fusion [21]. Li et al. [ 22] proposed a group dictionary learning method to extract different features from different image feature groups. This method can improve the accuracy of SR-based fusion, and obtain better fusion effect. Ibrahim et al. [ 23] used robust principal component analysis to build a compact dictionary for SR-based multi-focus image fusion. The previously introduced SR-based methods have showed the state-of-the-art performance in multi-focus image fusion these years. However, the aforementioned methods do not consider morphological information of image features in dictionary learning processes.
This paper analyses morphological information of source images to do dictionary learning. Based on the morphological similarity, a different type of image information is processed, respectively, to increase the accuracy of SR-based dictionary learning. Geometric information, such as edge and sharp line information, is extracted from source image blocks, and classified into different image-block groups to construct the corresponding dictionaries by sparse coding.
There are two main contributions in this paper as follows: (i) Morphological information of source images is classified into different image patch groups to train corresponding dictionaries, respectively. Each classified image patch group contains more detailed morphological information of source images.
(ii) It proposes a principle component analysis (PCA)-based method to construct an informative and compact dictionary. PCA method is employed to reduce the dimension of each image patch group and obtain informative image bases. The informative feature of trained dictionary not only ensures the accurate description of source images, but also decreases the computation cost of SR.
2 SR-based image fusion framework

Dictionary learning in image fusion
For dictionary learning, it is important to build an over-complete dictionary that not only has a relatively small size, but also contains the key information of source images. KSVD [19], online dictionary learning [24], and Stochastic gradient descent [25] are popular dictionary learning methods. This paper applies PCA to dictionary learning. The learned dictionary of PCA is compared with corresponding dictionary of KSVD to show the advantages of PCA-based solution. A good over-complete dictionary is important for SR-based image fusion. Unfortunately, it is difficult to obtain such a small and informative one. In Aharon's solution [26], KSVD was proposed to train source image patches and adaptively update corresponding dictionary by SVD operations for obtaining an over-complete dictionary. KSVD used extracted image patches from globally and adaptively trained dictionaries during the dictionary learning process.
Clustering-based dictionary learning solution was first introduced to image fusion by Kim et al. [ 20]. Based on local structure information, similar patches from different source images were clustered. The subdictionary was built by analysing a few components of each cluster. To describe structure information of source images in an effective way, it combines the learned subdictionaries to obtain a compact and informative dictionary.

Construction of geometric similarity-based dictionary
Smooth, stochastic, and dominant orientation patches as three geometric types used in single image super-resolution (SISR) [27][28][29] are used to classify source images, and describe structure, texture, and edge information, respectively. Three subdictionaries are learned from corresponding image patches. PCA method is used to extract important information only from each cluster for obtaining corresponding compact and informative subdictionary. All learned subdictionaries are combined to form a compact and informative dictionary for image fusion [20,30,31]. Fig. 1 shows the proposed two-step geometric solution. First, the input source images I i to I k are split into several small image blocks p in , i [ (1, 2, ..., k), n [ (1, 2, ..., w), where i is the source image number, n the patch number, and w the total block number each input image. The obtained image blocks are classified into smooth, stochastic, and dominant orientation patch group on the basis of geometric similarity. Then, PCA is applied to each group to extract corresponding bases for obtaining subdictionary. All obtained subdictionaries are combined to form a complete dictionary for instructing the image SR.

Geometric structure-based image patch classification
According to the classified smooth, stochastic, and dominant orientation image patches, more detailed image information can be further analysed. The out-of-focused areas are usually smooth and contain smooth image patches. The focused areas usually have sharp edges and contain dominant orientation patches. Besides that, many stochastic image patches exist in source images. More detailed information can be obtained, when dictionary learning is applied to three different types of image patches. As an efficient way, it enhances the accuracy in describing source images.
In this paper, source images are classified into different image-patch groups first by proposed geometry-based method. Then corresponding subdictionaries are obtained from classified image-patch groups.
The input images are divided into w √ × w √ small image blocks P I = (p 1 , p 2 , ..., p n ) first. Then each image patch . Based on the obtained vectors, the variance C i of pixels in each image vector can be calculated. The threshold d as a key parameter is used to evaluate whether image block is smooth. If C i , d, image block p i is smooth, otherwise image block p i is not smooth [27].
Generally, the classified smooth patches have similar structured information of source images. Non-smooth patches are usually different and need to be classified into different groups.
For multi-focus images, the smooth blocks are not only include the originally smoothed area, but also include the out-of-focus area of the source images. As shown in Figs. 2a-c, image regions and blocks in orange rectangle frames and blue rectangle frames are the original smooth and out-of-focus image regions. Originally smoothed regions are smooth in both focused and out-of-focus area. In image blocks of originally smoothed regions, pixels are with little difference. In the blue frames of Fig. 2c, the out-of-focus edge blocks and number blocks are smoothed. Owing to the variance of the image, patches are small. In Figs. 2a, b, and d, the focused characters, numbers, and object edges are framed by the red rectangles. As shown in Fig. 2d, the focused blocks that with sharp edges and basis are clustered into the detail cluster of blocks.
Stochastic and dominant orientation patches belong to non-smooth patches, based on geometric patterns. The proposed solution takes two steps to separate stochastic and dominant orientation patches from source images. First, it calculates the gradient of each pixel. In every image vector v i , i [ (1, 2, ..., n), the gradient of each pixel k ij , j [ (1, 2, ..., w), i [ (1, 2, ..., n) is composed by its x and y coordinate gradient g ij (x) and g ij (y). The gradient value of each pixel k ij in image patch v i is g ij = (g ij (x), g ij (y)). The (g ij (x), g ij (y)) can be calculated by g ij (x) = ∂k ij (x, y)/∂x, g ij (y) = ∂k ij (x, y)/∂y. For each image vector v i , the gradient G i is where U i S i V T i is the singular value decomposition of G i .A sa diagonal 2 × 2 matrix, S i represents energy in dominant directions [32]. Based on the obtained S i , the dominant measure R can be calculated by the below equation When the R gets smaller, the corresponding image patch is more stochastic [33]. To distinguish stochastic and dominant orientation patches, a probability density function (PDF) of R can be calculated by (3) to get the corresponding threshold R * [34] P(R) converges to zero, when the value of R increases. When the value of P(R) reaches zero for the first time, the corresponding R is used as the threshold R * that is used to distinguish stochastic and dominant orientation patches in a PDF significance test [34].
Those image patches that have smaller R than R * are treated as stochastic patches. The proposed method separates stochastic and dominant orientation patches. Texture and detailed information is included in stochastic image patches, and dominant orientation image patches contain edge information.
According to the direction information, dominant orientation image patches can be classified into horizontal and vertical patch groups furtherly. The gradient field v 1 shown in (4) is used to estimate the direction d of dominant orientation image patch In (4), when d i sc l o s et o0o r+ 90°, the corresponding image patch is clustered into horizontal or vertical patch group, respectively.

Dictionary construction by PCA
After the geometric similarity-based classification, the principal components of each group are used to train a compact and informative dictionary. Since a small number of PCA bases can represent corresponding image patches in the same geometric group well, a subdictionary is obtained based on the top m most informative principal components [35]. All subdictionaries D 1 , D 2 ,…, D n are combined to form a full dictionary D by the below equation Fig. 3 shows the proposed two-step fusion scheme. First, compressive sampling matched pursuit (CoSaMP) is employed for sparse coding. Each input image I i is split into n image patches with the size of w √ × w √ . These image patches are resized to w * 1 vectors p i 1 , p i 2 , ..., p i n . According to the trained dictionary, CoSaMP sparsely codes the resized vectors to sparse coefficients z i 1 , z i 2 , ..., z i n . CoSaMP method improves the orthogonal matched pursuit algorithm. Since CoSaMP only does matrix-vector multiplication for sparse coding, it is more efficient in practice.

Input:
The CS observation y, sampling matrix F, and sparsity level K; Output: A K sparse approximation x of the target; 1: Initialisation: x 0 = 0(x J is the estimate of x) at the Jth iteration and r = y (the current residual); 2: Iteration until convergence: 3: Compute the current error (note that for Gaussian F, F T F is diagonal): 4: Compute the best 2 K support set of the error V (index set): 5: Merge the strongest support sets and perform a least-squares signal estimation: 6: Prune x J and compute for next round: In Algorithm 1, F * is the Hermitian transform of F. F ⊛ represents the pseudo-inverse of F. T is the number of elements. T c indicates the complement of set T.
Second, Max-L1 fusion rule is applied to the fusion of sparse coefficients [36,37]. Equation (10) demonstrates Max-L1 fusion rule to merge two sparse coefficients z i A and z i Based on the trained dictionary, the fused image is obtained by the inversion of corresponding fused coefficients.

Entropy:
Entropy of an image is about the information content of image. Higher entropy value means the image is more informative. The entropy of one image is defined as where L is the number of grey-level and P l the ratio between the number of pixels with grey values l and total number of pixels.

Mutual information:
The metric of MI measure the MI of the source images and fused image. MI for images can be formalised as where L is the number of grey-level, h A,F (i, j) the grey histogram of image A and F. h A (i) and h F (j) are edge histogram of image A and F. Equation (13) where MI(A, F) represents the MI value of input image A and fused image F; MI(B, F) the MI value of input image B and fused image F.

Q AB/F
: Q AB/F metric, as a gradient-based quality index, measures the performance of edge information in fused image [42], and can be calculated by where Q AF = Q AF g Q AF 0 , Q AF g and Q AF 0 are the edge strength and orientation preservation values at location (i, j). Q BF can be computed similarly to Q AF . w A (i, j) and w B (i, j) are the importance weights of Q AF and Q BF , respectively.  To evaluate the VIF of fused image, an average of VIF values of each input image and the integrated image is proposed [45]. Equation (16) shows the evaluation function of VIF for image fusion

Visual information
where VIF(A, F) is the VIF value between input image A and fused image F; VIF(B, F) the VIF value between input image B and fused image F.

Grey-level image fusion
In order to test the efficiency of the proposed method, the most commonly used ten multi-focus grey-level images are implemented for testing. Two sample groups of fused grey-level images with the size of 256 × 256 and 320 × 240 are picked for demonstration, which are shown in Figs. 5 and 6. The difference images of each fused image are also shown in Figs. 5 and 6. The difference images demonstrate the difference between the fused images and source images. In multi-focus image fusion, if the focused area of difference images is more clear, it means the fusion performance is better. The difference image can be obtained by where I d represents the difference image, I F is the fused image, and I s a source multi-focus image. Figs. 5 and 6 show the fused images of similar comparison experiments, respectively. It only chooses Fig. 5 for analysis. The source multi-focus images of 'two clocks' are shown in Figs. 5a and b, respectively. To show the details of fused image, two specified image blocks, that show the number in clock, are highlighted and magnified in red and blue squares, respectively. In Fig. 5a, the highlighted image block in red square is focused, and another highlighted part in blue square is out of focus. In Fig. 5b, the corresponding image block in red square is out of focus, and the corresponding image block in blue square is focused. Figs. 5c-e are the fused image of KSVD, JCPD, and proposed method, respectively. Figs. 5f-h are the difference images between source image (Fig. 5a) and fused image of KSVD, JCPD, and proposed method, respectively. Similarly, Figs. 5i-k show the corresponding difference images between source image (Fig. 5b) and three fused  images, respectively. Two specified image blocks are focused in all three fused images. However, it is different to differentiate the slight differences among each fused image and evaluate the corresponding fusion performance of each method.
To objectively evaluate the fusion performances of input multi-focus images, entropy, Q AB/F , MI, and VIF are used as image fusion quality measures. Table 1 shows the average  objective evaluation results of ten grey-level multi-focus images using three different methods. The best metric results in Table 1 are highlighted by bold faces. The proposed method achieves the best performance in all four objective evaluations. JCPD method only has the same performance as the proposed method in entropy.

Colour multi-focus image fusion
To compare the proposed method with KSVD and JCPD, 20 colour image pairs from Lytro data set are used to test fusion results. For visual evaluation, two sample groups of fused colour images obtained by three different methods are chosen to demonstrate in Figs It is difficult to figure out the differences in fused images by three different methods visually. Similarly, entropy, Q AB/F , MI, and VIF are also used as image fusion quality measures to evaluate the fusion performance objectively. The average quantitative fusion results of 20 colour multi-focus images using three different methods are shown in Table 2. The best results of each evaluation metric are highlighted by bold faces in Table 2. The proposed method reaches the best performance in all four types of evaluation metrics. Particularly, the proposed method higher result in Q AB/F than other two methods. As a gradient-based quality metric, Q AB/F measures the performance of edge information in fused image. It confirms the proposed method can obtain fused images with better edge information.    method has the best overall performance of processing time among three methods. JCDP has better performance than KSVD. Fig. 10a shows the time comparison of 320 × 240 and 256 × 256 grey-level image fusion. Fig. 10b shows the time comparison of 520 × 520 colour image fusion. In Figs. 10a and b, it can easily figure out the proposed method has lower computation costs than the compared two methods. Table 3 compares the processing time of 320 × 240 and 256 × 256 grey-level image fusion and 520 × 520 colour image fusion. The proposed solution has lower computation costs than KSVD and JCPD in image fusion process. When the size of source image increases, the processing time verifies that the proposed solution has much better performance than the compared two methods. Comparing with KSVD, the dictionary construction of the proposed solution is more efficient, that does not use any iterative way to extract the underlying information of images. Although JCPD and the proposed solution both cluster image pixels or patches based on geometric similarity, the proposed solution does not use steering Kernel regression in dictionary construction as JCPD, which is an iterative method, but time-consuming.

Processing time comparison
Additionally, as the proposed method shows great image fusion performance on both grey-level and colour multi-focus images with different size, it can infer that the proposed method is robust to the image size and colour spatial.

Effectiveness discussion
The obtained dictionary of the proposed approach is more compact than existing methods. The proposed solution takes less fusion time than existing methods. Although the proposed method is just slightly better than the compared approaches in objective evaluations, the results of comparison experiments verify that the proposed solution obtains high-quality fused image and performs high efficiency in image fusion.
According to each objective evaluation, this paper compares each increasing rate of proposed solution with four SR-based methods published in mainstream journals in 2015-2016. The experimentation environment of each compared method is different and not all compared methods publish their source codes, so it is difficult to compare each objective evaluation using the same standard. This paper can only compare the relative increasing rate among all approaches. The relative increasing rate is the difference between each proposed solution and the second best solution in the same paper. Table 4 shows the comparison results. The analysis of each objective evaluation is shown as follows: (i) Entropy: The proposed solution increases 0.3% and Yin's solution did not improve entropy. The other three solutions did not compare entropy.
(ii) Q AB/F : The proposed solution increases 3.4%. The increasing rate of proposed solution is greater than Kim's solution, but is less than Nejati's solution.
(iii) MI: The proposed solution has the best increasing rate 4.2%.
(iv) VIF: The increasing rate of proposed solution is 0.8%, that is, greater than Nejati's solution, but is less than Kim's solution.
Comparing with the other four existing solutions, the relative increasing rate of the proposed solution is convincing. The increasing rate varies from 0 to 4.3% in all compared solution. It is normal and reasonable that most of proposed solutions only slightly improve existing solutions in image fusion.

Conclusion
Based on the geometric information of image, an SR-based image fusion framework is proposed in this paper. The geometric similarities of source images, such as smooth, stochastic, and dominant orientation image patches, are analysed, and corresponding image patches are classified into different image patch groups. PCA is applied to each image patch group to extract the key image patches for constructing the corresponding compact and informative subdictionary. All obtained subdictionaries are combined into a fully trained dictionary. Based on the trained dictionary, source image patches are sparsely coded into coefficients. During the image processing, image block size is adaptively chosen and optimal coefficients are selected. More edge and corner details can be retained in the fused image. Max-L1 rule is applied to fuse the sparsely coded coefficients. After that, the fused coefficients are inverted to obtain the final fused image. Two existing mainstream SR-based methods, KSVD and JCPD, are compared with the proposed solution in comparative experiments. According to subjective and objective assessments, the fused images of proposed solution have better quality than existing solutions in edge, corner, structure, and detailed information.