A probabilistic collaborative dictionary learning-based approach for face recognition

Although Sparse Representation based Classiﬁer (SRC), a non-parametric model, can obtain an interesting result for pattern recognition , a reasonable interpretation has been lacked for its classiﬁcation mechanism. What is more, the training samples are used as off-the-shelf dictionary directly in SRC, which can make the feature hidden in the training samples hard be extracted. At the same time, the complexity of the algorithm is increased because of too many atoms of the dictionary. The authors ﬁrst explains in detail the classiﬁcation mechanism of SRC from the view of probabilistic collaborative subspace and offer the process to improve the stability of the algorithm using the joint probability in the case of the multi-subspace. Then, the authors introduce the dictionary learning (DL) and Fisher criterion into the model to further enhance the discrimination of the coding coefﬁcient. In order to ensure the convexity of the discrimination term and further enhance the discrimination, the authors add the L 21 -norm term into the Fisher discrimination term and offer the proof for its convexity. Finally, the experimental result on a series of benchmark databases, such as AR, Extended Yale B, LFW3D-hassner, LFW3D-sdm and LFW3D-Dlib, show that PCDDL outperforms existing classical classiﬁcation models.


INTRODUCTION
Recently sparse representation techniques have led to interesting results in pattern classification, such as face recognition, handwritten digit recognition, etc. The success of sparse representation-based classifier should attribute to the fact that a high-dimension image can be represented or coded by some representative samples from the same class in a low-dimension manifold, and the recent progress of L 0 -norm and L 1 -norm minimization techniques. In general, the sparse representation schemes are usually divided into two categories. The first one is the non-parametric methods, such as SRC [10,16], which can predict the query samples using the training samples directly and do not need to train a parametric model. Another, on the contrary, is the parametric methods, such as dictionary learning (DL) [24], which aim to train a parametric model used for predicting the testing samples.
For the non-parametric model, the classifier based on the distance is used widely in a series of vision recognition tasks (e.g. the nearest subspace classifier [25]). However, non-parametric model based on distance depends heavily on the pre-defined This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology distance or similarity metric criterion. The commonly used metrics (e.g. Euclidean distance, manifold distance and principal angle-based correlation) can describe intuitively the variations among samples, but it cannot reflect accurately the intrinsic similarity among objects. To better characterize the similarity, one appropriate choice is to introduce indeterminacy into the output of the classifier, such as probabilistic Support Vector Machines (SVM) [17][18][19]. The other choice that can replace the probabilistic SVM is the probabilistic subspace method, such as probabilistic principal component analysis [20,22] and probabilistic linear discrimination analysis [21], which reformulate the subspace method as a possible variable model and optimize the parameter by maximum likelihood estimation.
Besides, a critical problem is how to characterize the testing samples for the distance-based non-parametric classifier. For SRC, the linear combination of training samples is used to represent the testing samples and l1-norm regularization term is employed to impose sparsity constraint. Inspired by the probabilistic subspace methods [1], this paper analyses the intrinsic mechanism of SRC, and proposes probabilistic collaboration subspace-based sparsity representation classifier model. Though SRC shows promising performance in face recognition tasks, the dictionary used in it may not be effective enough to characterize the testing samples because of the potential indeterminacy and noise in training samples. What is more, using the original training samples as the dictionary can be hard to extract the hidden discriminative information [39] in the training samples. Likewise, using off-the-shelf bases [14] as dictionary, such as Haar wavelets and Gabor wavelets, may be universal to all kinds of images but will be not appropriate enough for the images of the specific types. Fortunately, all the problems mentioned above can be solved through learning a proper parametric dictionary from the training samples. Hence, Yang et al. [6] propose that one can get more promising performance by learning a class sub-dictionary from each class samples. Such sub-dictionary, however, might not capture the similarity feature among various class. Therefore, we introduce the DL [26,[35][36][37] into the probabilistic collaboration subspace-based SRC (PCSRC) to get Probabilistic Collaboration subspace-based Dictionary Learning (PCDL) model. To improve the effectiveness of DL, Yang et al. [5] and Ramirez et al. [29] propose to enhance the discrimination and the reconstructive ability of the dictionary through learning a class sub-dictionary and adding the irrelevant enhancement items. Yang et al. introduce Fisher discrimination criterion into the DL to get the Fisher Discrimination Dictionary Learning (FDDL) [2,13,15] model and obtain promising results. But, due to the intrinsic deficiency in the residual term, FDDL could not further promote the performance of the classifier. Inspired by this, we introduce the Fisher discrimination criterion into the PCDL model to get the Probabilistic Collaboration subspace-based Discrimination Dictionary Learning (PCDDL) model and apply the model to final classification tasks. In brief, the main contributions in our study are as follows: (1) Our study analyses the classification mechanism of SRC in detail from the perspective of probabilistic collaboration subspace, provides the processing procedure using joint probability to improve the stability of the algorithm in the case of multi-subspace and proposes the PCSRC model. (2) We first introduce the DL into the PCSRC model to get PCDL model, then add the Fisher discrimination term into the PCDL model to obtain the PCDDL model used for the final classification tasks. At the same time, to ensure the convexity of the Fisher discrimination term, we introduce the term ‖X ‖ 2,1 into it, and we also offer the proof of the convexity.
The rest paper is organized as follows: Section 2 briefly introduces the concept of probabilistic collaboration subspace in [1], and analyses the classification mechanism of SRC from the perspective of probabilistic collaboration subspace. Section 3 provides the processing procedure to improve the stability of the algorithm in the case of multi-subspace. Section 4 adds the DL and Fisher discrimination term to obtain the PCDDL model. Section 5 conducts experiments and analyses the results. Section 6 concludes the paper. The basic framework in the paper is illustrated in Figure 1.

Probabilistic collaboration subspace
Given a training sample sets X = [X 1 , … , X K ], where X k , k = 1, 2, … , K is the sample matrix of class k, and each column in X k means a sample vector. Let l X be the label sets of all the class in X and denote by S the linear collaboration subspace spanned by all the training samples. So, every data point x in subspace S can be represented as a linear combination of the samples in X , i.e. x = X , where is the coding coefficient vector. In general, different norms correspond to different distances, and the L 1 norm is a Manhattan distance (also known as the chessboard distance). As shown in Figure 2, the shortest path  between points A and B on the chessboard is not unique (there are paths P1, P2, …), but the shortest distance is unique (i.e. Length(P1) = Length(P2) = Length(P )). It is worth noting that this property persists even in non-orthogonal coordinate systems as long as all the axes have the same unit length. If this property is satisfied, the magnitude of the distance can be measured by the magnitude of the L 1 norm. Therefore, in a linear subspace (the probability subspace in this paper is also a linear subspace), the distance between a sample and the centre of the subspace is positively correlated with the size of the L 1 -norm of the sample. As shown in Figure 3, the sample x 1 represented by the coding vector 1 possessing smaller L 1 -norm is therefore closer to the centre of the subspace. As a result, the smaller L 1 -norm of the coding vector has, the bigger the probability belonged to the subspace will be. Therefore, we select the exponential function to define the probability: where c is a constant. Based on Equation (1), we call the subspace S a probabilistic collaboration subspace.

Probabilistic representation of samples outside the collaborative subspace
Because the testing sample y usually lies on the outside of the subspace S . In a bid to measure the probability that the label of sample y belongs to l X , i.e. P (l (y) ∈ l X ), Cai et al. proposed to compute the probability as follows: where P (l (x) = l (y)|l (x) ∈ l X ) can be measured by the similarity between y and x. And Cai et al. chose the Gaussian kernel to define the similarity. According to Equations (1)-(2), the task to maximize the probability P(l (y) ∈ l X ) will convert to the following object: where is a constant. We can see that the equation above offer a probabilistic representation that y lies in subspace S , and provide a reasonable interpretation from the view of probability.

Probability to each class-specific subspace
A sample x within the subspace S can be collaboratively represented as x = X = ∑ K k=1 X k k , = [ 1 ; 2 ; … ; K ], where k is the coding vector related to X k . Note that x k is a sample point in the subspace of class k, i.e. x k = X k k . Here we, using Gaussian kernel, define the probability that x belongs to the specific class k: where is a constant. For the testing sample y outside the subspace, the probability P (l (y) = k) can be calculated as follows: where is a constant.

PCSRC model
To determine the label of y by checking which subspace the sample y belongs to will make the method become instability because of change of the data point x which is from S and corresponds to the sample y. Therefore, it is critical that we need to find a common data point x from S and maximize the joint probability P (l (y) = 1, … , l (y) = K ). Then, the class label of y can be determined by checking the largest probability. By assuming the events l (y) = k are independent, we have Applying the logarithmic operator to Equation (6) and ignoring the constant term, we have In Equation (7), the first two terms ‖y − X ‖ 2 2 and ‖ ‖ 1 are collaboration representation term, which aims to find a data point x = X from collaboration subspace S as close to y as possible. The last term ∑ K k=1 (‖X − X k k ‖ 2 2 ) is used to find a data point X k k from every class subspace S k as close to common data point x as possible. The parameter and balance the role of the three terms, which can be set based on the prior knowledge of the problem.

Probabilistic collaboration subspace-based DL (PCDL) model
As mentioned above, in a bid to ensure the stability of the classification results, we need to maximize the joint probability P (l (y) = 1, … , l (y) = K ) in PCSRC model, which results from that the unknown label testing samples, in its classification rule, are represented by the dictionary constructed by the training samples. Instead, PCDL needs to learn a parametric dictionary and the samples used for training dictionary have known labels, so we do not need to use joint probability anymore, but use the single probability P (l (y) = k). The specific expression of PCDL model is formulated as follows: where D is the dictionary trained from training samples, is the coding coefficient and X k i is the coding coefficient related to D k .

Probabilistic collaboration subspace-based discrimination dictionary learning (PCDDL)
To improve the effect of the DL, we introduce Fisher discrimination criterion into PCDL model to get PCDDL model and use it for face recognition.

The PCDDL model
As mentioned above, the specific expression of PCDDL model is formulated as follows: and are the adaptive hyperparameter. Equation (11) defines the hyperparameter of the residual term in the model in Equation (9), and the values of the two terms in the residual term change dynamically during iteration. Since it is the first term that plays the leading role, in each iteration, we take the ratio of the smaller term to the larger term as the balance factor (hyperparameter), so as to dynamically adjust the balance of the two terms, where T 1 is a constant. Equation (12) is same as Equation (11). In f (X ), the first term is the Fisher discrimination term and the second term is ‖X ‖ 2,1 . The term ‖X ‖ 2,1 cannot only to ensure the convexity of the first term, but to enhance the effect of clustering. And residual term have more intuitive interpretation in the two-dimensional vector space as follows: As shown in Figure 4, where A i is FIGURE 4 Representation of residual term in the two-dimensional vector space the training samples, F , which is just to ensure that the samples A i could be promisingly reconstructed by D. And the other in Figure 4 means the second The proof of the clustering term convexity is as follows: where diag(T ) is to construct a block diagonal matrix with each block on the diagonal being matrix T .
The convexity of f i (X i ) depends on whether its Hessian matrix ∇ 2 f i (x i ) is positive definite [29] or not. ∇ 2 f i (x i ) will be positive definite if the following matrix S is positive definite: where 10: end for 11: until maximum iterations or convergence.

Output: D, X
After some derivations, we have is positive definite. Because the maximal eigenvalue of E i i is n i , we should ensure For

Optimization
Since the objection is convex to D or X , respectively, so the optimization process alternatively updating D or X can be designed. Here we employ the method in literature [6,7,9,34]. Detailed algorithm steps are shown in Algorithm 1.

The classification approach of PCDDL
For a given testing sample, we first reconstructed the sample by the dictionary D learned from Equation (9), then we get a where is a constant. Here we define the following rule:

EXPERIMENTAL RESULTS AND ANALYSIS
In this part, our main task has three aspects. First of all, we discussed whether the L 1 -norm or the L 2 -norm is a better constraint on coding coefficients. We also verify the convergence of the terms in the model. Second, we discuss the effect of the number of dictionary atoms on model performance. Finally, we further validated the model's effectiveness on AR, Extended Yale B and LFW databases.
As can be seen from the definition of the L 2 -norm, its sensitivity comes from the square operation, allowing elements with large values to dominate the final result. This means that the L 2 -norm is more sensitive to image changes than the L 1norm. In addition, the L 1 norm can lead to certain sparsity, which is helpful to improve the discriminability of the model. At the same time, the sparsity caused by the L 1 -norm of coding coefficient X does not affect its cooperative representation. As shown in Figure 5, by comparing the two sub-graphs, it can be seen that under the L 1 -norm constraint, the energy of the encoding vector of the sample in the dictionary space is more concentrated in the subspace of its category, while under the L 2 -norm constraint, there is no such effect. As can be seen from Figure 6, the recognition rate of the model has a smaller error when the coding coefficient is constrained by the L 1 -norm. To intuitively observe the convergence of the PCDDL model, we draw the changing curve of objective function and its each term in Figures 8 and 9. We can see that the objective function and its each term gradually converge with iteration. As shown in Figure 7, PCDDL has stronger convergence than classical FDDL. That is, in the same circumstances, PCDDL employs smaller convergence value than FDDL.
In order to test the performance of PCDDL for face recognition, we conduct the experiment in AR, Extended Yale B and LFW, and compare the results with classical methods, such as SRC [10,23], Collaborative Representation based Classification (CRC) [16], Nearest Neighbour (NN), SVM, Discriminative KSVD (DKSVD), Dictionary Learning with Structure incoherence (DLSI), FDDL, Support Vector Guided Dictionary Learning (SVGDL) [4]. All the experiments are performed on a computer with an Intel Core i5-8500, 8GB RAM and 64-bit Windows 10 operating system.  For parameter selection, if there is no specific note, the hyperparameters, including 1 , t 1 , t 2 , 1 , are set to 1 = 0.005, t 1 = 0.05, t 2 = 0.05, = 0.005. And we employ fivefold crossvalidation method to refrain from overfitting.
Besides, there is a special and critical parameter in the DL, i.e. the number a i (i = 1, 2, … K ) of atoms in the sub-dictionary D i . We use SRC as the baseline method, and analyse the influence of a i on the performance of PCDDL. We take face recognition tasks on LFW as an example. SRC directly uses the training samples as the dictionary and we randomly select a i training samples used as the ith sub-dictionary. Since the recognition rate will be various due to the diverse sub-dictionary. Therefore, we conduct 10 experiments to get the average recognition rate of SRC. As illustrated in Figure 10, we can see that in all cases PCDDL and FDDL have about 2% improvement over SRC and SVDL. Especially, PCDDL with a i = 3 can still have higher recognition rate than SRC with a i = 8. Besides, with varying the number of atoms from 3 to 8, PCDDL's recognition rate fluctuates by around 2%, while that of SRC is 13%, which demonstrates that PCDDL is able to compute a compact and representative dictionary, and has higher recognition rate than SRC.
I. AR: AR database [8] consists of more than 4000 frontalface images of 126 individuals. For each person, take 26 photos in two different stages. As in [2], we selected a subset of 50 male and 50 female subjects for the experiment. For each subject, seven images from session 1 with changes in lighting and expression were used for training, and another seven images from session 2 with the same conditions were used for testing. An example in AR is shown in Figure 11. The size of the original face image is 60 × 43. A comparison of the competitive methods is shown in Table 1. We can see that PCDDL is at least 0.5% better than other methods.
II. Extended Yale B: The expanded Yale B database includes 2414 frontal facial images from 38 people (about 64 for each subject), taken under a variety of laboratory-controlled lighting conditions. For each subject, we randomly selected 20 images for training and the remaining images for testing (the experimental setup was more difficult than [2]). All the images are normalized 54 × 48. A comparison of the competing methods is shown in Table 2. It can be seen that PCDDL is at least  1% better than other methods. As shown in Figure 12, we also provide the confusion matrix of PCDDL on the extended Yale B database. III. Labled Faces in the Wild (LFW): LFW face database [11], which is collected in the wild environment, is a popular database at present. It included 13,233 face images from 5749 people, with a different number of samples for each subject. In addition, face recognition faces some challenges due to uniform lighting, inconsistent backgrounds, changes in posture and expression and the number of faces in each image. At present, the common method is to pre-process the samples and then use the processed samples for face recognition. The pre-processing techniques [12] include face selection from images, alignment [28] (i.e. marking feature points), normalization [3,27,30,31], clipping and down-sampling, etc. Fortunately, in the past few years, scholars at home and abroad have made great achievements in the field of face image alignment and normalization. Therefore, ready-made normalized database can be obtained, such as lfw3d-hassner [32,33,38] made by developer Hassner of the original 3D normalization algorithm, and lfw3d-sdm based on Supervised Descent Method (SDM) feature point detection method. The same samples from both above-mentioned databases are shown in Figure 13.
The specific experimental settings are as follows. We first down-sample the each normalized face image from LFW3Dhassner and LFW3D-sdm to 90×90. As you can see from  ber of atoms. The more atoms there are in SRC, the higher its recognition rate will be. Hence, the number of atoms in SRC and SVDL is set to the maximum (i.e. the number of training samples for each class) in the following experiments. As for PCDDL and FDDL, since they are less affected by the number of atoms, and the complexity of coding is positively correlated with it. So, we next begin to find the appropriate number of the sub-dictionary atoms.
In the part, we conduct two experiments. First, as shown in Table 4, on the each feature dimension, we obtained the recognition rate of PCDDL with varying number of atoms, and we selected the number of atoms which can enable PCDDL to obtain the maximum accuracy on most feature dimensions. We can see that when the number of atoms is 5, PCDDL can obtain the maximum recognition rate on the most feature dimensions. In addition, we conduct another experiment to verify the stability of the algorithm. We randomly select eight samples used for training for each subject from the LFW3D-sdm database, and the rest samples used for testing. With varying the number of atoms from 3 to 8, we repeated it 30 times. To intuitively observe the distribution of the recognition rate, as listed in Table 3, we calculated its mean and variance. As illustrated in Figure 14, we as well as draw the boxplot of the recognition rates with varying number of atoms. As can be seen that from Figure 14, when the number of atoms is 5, the length of the box is relatively short, which proves that the results have smaller variance than that with other number of atoms. Meanwhile, the height of the red line in the box is relatively high, which proves that the results employ a larger mean than that with other number of atoms. In short, when the number of atoms is 5, the recognition rate has larger average, smaller variance and fewer outliers than with other numbers of atoms. Therefore, the number of atoms is set to 5 for PCDDL. To get a fair comparison, the counterpart in FDDL is also set to 5.
We next choose 50, 75, 100 subjects that include at least 10 samples, respectively, from LFW3D-hassner and LFW3D-sdm, and every subject kept only 10 samples. Randomly selecting eight samples from each individual are used for training, and the rest used for testing. We repeated the experiment 10 times on each various dimensions and calculated their mean accuracy and variance. The experimental results are given in Tables 5-10, and the bold font data is the highest average recognition rate achieved for each feature dimension. We can see that our model can get the better recognition result on the higher feature dimension, and this should be the characteristic of DL. Not only that, the performance of our model can improve with increasing the feature dimension. And this is a good trait. In addition to the average recognition rate, we also list the highest recognition rates that each method can achieve, and our model almost always achieves the highest recognition rate. And the highest recognition rate of our model is positively correlated with the feature dimension. To intuitively observe the change of the recognition rate, we also drew the average accuracy line as shown in Figures 15 and 16. Figures 15 and 16 show the recognition rate curves of various methods on the LFW3D-sdm and LFW3D-hassner, respectively. The sub-figures (a), (b) and (c) correspond to the

FIGURE 14
The recognition rate boxplot with various number of dictionary atoms experimental result of that including 50, 75 and 100 subjects, respectively. The horizontal axis of the line graph represents the feature dimension and the vertical axis represents the recognition rate.
As we can see, the performance of all the methods gradually decreases with the increasing of the number of classes. And the recognition rates of other methods fluctuate with the increasing of feature dimension, while that of PCDDL show the basic upward tendency in the same case. At the same time, to further evaluate the stability of the algorithm, we also carried out the following experiment. We tested the results of every method on the same feature dimensions (400 dims) on LFW3D-sdm, and repeated it 30 times for each method. At every time, we randomly selected eight samples used for training, and the rest used for testing. We next calculated their mean and variance of recognition rates and listed them in Table 11. In order to intuitively observe the distribution of the data, we as well as draw the boxplot of various algorithms illustrated in Figure 17. We can see that the box length of PCDDL is relatively shorter than that of other methods, and the number of the outliers is also fewer than that of others, which proves that the results of PCDDL have less variance than that of others, i.e. stronger stability. Meanwhile, the height of the red line in the box of PCDDL is rela-tively higher than that of others, which proves that the results of PCDDL have a larger median than that of others, i.e. better effect. Hence, compared with other algorithms, PCDDL has stronger stability and better performance.

DISCUSSION
According to the experimental results on the AR, Yale and LFW databases, we summarize the following conclusions: (1) The experimental results show that the proposed method PCDDL outperforms the other methods in terms of recognition performance on the AR, Extended Yale B and LFW databases. According to the results in Tables 5-10, we can see that PCDDL is able to obtain the highest recognition rate in the highest dimensions feature space. From Figures 15 and 16, as the dimension increases, the recognition rate curves of PCDDL, compared with other methods, present a basic upward trend, which is a great property. However, as the number of the classes increases, we find that the recognition rates of all the methods show a downward trend, which proves that the current pattern classification methods are not good at the classification tasks that possess a large number of class. So, this should be the problem that we will solve next. (2) The proposed PCDDL has stronger stability than other methods. By Figure 17 and Table 11, we can see that PCDDL possesses smaller variance, larger mean and fewer outliers than other methods at the same time, which proves that PCDDL, in most cases, can get better results than other methods.

CONCLUSION
In this study, the classification mechanism of SRC, first, is clearly interpreted from the view of the probabilistic subspace, and the PCSRC model is offered at the same time. To get over the challenges that the feature hidden in the training samples cannot be fully extracted and the complexity of coding is too much high resulting from the overcomplete basis, we introduce the DL into the PCSRC model to get PCDL model. By learning a dictionary with as less number of atoms as it can, the above problem can be

FIGURE 15
The recognition rates of various methods on the LFW3D-sdm database (a) 50 subjects (b) 75 subjects (c) 100 subjects

FIGURE 16
The recognition rates of various methods on the LFW3D-hassner database (a) 50 subjects (b) 75 subjects (c) 100 subjects solved. Meanwhile, in order to further enhance the discrimination of the coding coefficient, the Fisher criterion is introduced into our model PCDDL used for classification tasks, which can guarantee discrimination of both dictionary and the coefficient. Finally, we test the validity of our model in AR, Extended Yale B, LFW3D-hassner and LFW3D-sdm.

FIGURE 17
The recognition rate boxplot of various methods