Model compression of SDM-based face alignment for mobile applications

: Face alignment could be widely used in face recognition, expression recognition, face-based AR applications etc. Cascaded-regression-based face alignment algorithms have been popular in recent years for their low computational costs and impressive results in uncontrolled scenarios. Unfortunately, the size of the trained model is quite large for cascaded-regression- based methods which makes it unsuitable for commercial applications on mobile phones. In this study, the authors proposed a data compression method for the trained model of the supervised descent method (SDM). Firstly, the distribution of the model data was estimated using a non-parametric method. Then an adaptive quantisation algorithm was proposed to quantise the model data. Finally, their adaptive quantisation algorithm was tightly coupled with the SDM training process to fine tune the results. The quantitative experimental results proved that their proposed method could compress the data model to <20% of its original size without hurting the performances. The proposed method has been integrated into a mobile AR application, subjective evaluations proved that the proposed compression method could provide similar visual effects compared with the uncompressed counterpart.


Introduction
Face alignment is important for facial image analysis. It uses a face image as an input and automatically localises facial feature points such as eyes, nose, eyebrows mouth etc. Fig. 1 shows an example of face alignment with the supervised descent method (SDM) algorithm [2].
Current methods could be roughly divided into two categories: generative methods and discriminative methods [3].
Generative methods explicitly construct generative models for the shape and appearance of the face. Feature locations are derived according to the best fit of the model to the test image. The classical active appearance models [4] which contain shape model, appearance model and motion model belong to this category. Its defect is that this method is not robust against occlusions. Cootes et al. [5] propose to calculate an appearance model for every facial part separately. In [6], the authors propose Gauss-Newton deformable part model which could construct generative models for each facial part simultaneously.
Discriminative methods aim to estimate the mapping between the facial appearances and the feature locations directly. Constrained local models (CLM) are able to learn independent local detectors for each feature point [7]. Then a shape model is used to regularise these local models. Yang et al. [8] propose a nonlinear correlation filter which can be used for individual face landmark detection and as a local expert for face alignment algorithms. Different from CLM, cascaded regression methods learn a vectorial regression function directly to calculate the face shape stage-by-stage. Explicit shape regression [9] is one of the first cascaded regression algorithms. It is a two-level boosted regression framework. Burgos-Artizzu et al. [10] introduce occlusion information into the regression process. Kazemi and Josephine [1] propose to use regression trees instead of random ferns and achieve super speed. Besides the abovementioned twolevel boosted regression framework, Xiong and De la Torre [2] present a cascaded linear regression method using hand-crafted features. The contribution of [2] is a provable SDM. They also extend SDM to global SDM in order to cope up with the problem of conflicting gradient directions [11]. The SDM is a popular face alignment method especially for mobile applications since it could achieve state-of-the-art results in real 2D scenarios and still achieve real-time performances.
With the development of deep learning technology, deep neural networks have been successfully applied in many computer vision tasks in recent years. Sun et al. [12] are the first to use deep convolutional network cascade for face alignment. Trigeorgis et al. [13] propose a recurrent neural network approach. Recently, there have been studies to achieve 3D face alignment by fitting a 3D morphable model through convolutional neural networks [14,15]. Bulat and Tzimiropoulos [16] propose a 3D face alignment network by stacking four hourglass networks. Although these methods perform better on the images with large head poses than traditional methods, they are still not easy to be applied in mobile platforms. The main reasons are as follows: (i) Deep learning model needs huge amount of training data which are not easy to collect for the face alignment task. (ii) The deep learning model is usually several hundred MB which is too big for mobile applications. (iii) The computational cost is still quite high and can hardly achieve real-time processing on the mobile platform.
As mentioned before, SDM can achieve satisfactory results while maintaining relatively low computational costs. However, the trained model size of SDM could easily achieve more than 80 MB. It is still unacceptable for a commercial mobile application to pack such a large data file. The compression efficiency is rather limited by standard lossless compression technology such as: Huffman coding and run-length coding.
In recent years, some research studies about deep learning network compression emerge. Iandola et al. [17] propose new kinds of convolutional operations so that parameters could be reduced. Li et al. [18] compress the network by pruning unimportant filters according to weight analysis. Rastegari et al. [19] and Courbariaux et al. [20] transfer the weights into binary values in order to reduce the size of the model. Most of the above methods are aimed at specific tasks or network structures and are questionable to be applied directly to face alignment networks.
In this paper, we propose an adaptive data compression method for the trained model of the SDM which could reduce the size to <1/5 of the original size without obvious performance reduction. This method opens the gates to mobile applications using SDMbased face alignment technology. This paper is organised as follows: Section 2 briefly describes the SDM algorithm; Section 3 explains our proposed algorithm in detail; Section 4 shows qualitative and quantitative experimental results; and Section 5 draws the conclusion.

Brief introduction of SDM
In order to make this paper easy to follow, we briefly introduce the workflow of the SDM in this section. Please refer to [2] for details.
We assume that the face feature points are represented by N 2D landmarks s = [x 1 , y 1 , …, x N , y N ] T . Given a face image I and the initial 2D landmarks s 0 , We define a series of regressors: , A d is the projection matrix which could also be called the descent direction. b d is the bias term. In the SDM, D is usually chosen between four and six. Thus the estimated 2D landmarks in the dth step can be calculated through: f(I, s d−1 ) is the shape-related features which are SIFT values [21] or Histograms of Oriented Gradients (HoG) features [22] for better performances [23]. It could be calculated from landmarks s d−1 in the image I. The final estimated face landmarks are s D .
Given the M training images, the SDM aims to minimise a series of the following equations and gets r d : where Δs d i are the shape residuals of the ith training image at the dth regression step: s * i are the ground truth landmark locations of the ith training image. Equation (2) is a standard linear least squares problem and could be solved in the closed form.
The training process of the SDM is listed as follows (See Fig. 2):

Data compression algorithm
The key components of the SDM are r d for all D steps. Thus the trained model of the SDM stores all the information of r d for all D steps. In a typical scenario, face landmarks include 68 points which is to say N = 68.  Table 1 shows the data ranges of A d and b d for a typical HoG-based sixstep regressor.
Since the data ranges vary in different regression steps, we compress the data separately for each step. Furthermore, since the data range of A d is very different from that of b d , and b d only contains 136 floating point numbers in each step which is much smaller (only about 3KB for all six steps) than the size of A d . In this paper, we propose to only compress the data in A d .

Data distribution estimation
The data distribution of A d could not be described by a parametric model. So in this paper, we apply non-parametric methods to estimate the data distribution of A d . Specifically, we use Parzen window method [24]. Assuming the number of elements in A d is T, window size is h, the probability density function (PDF) of the elements x in A d could be estimated through: φ(x) is the square window function:

Adaptive data quantisation
In order to compress the element which is represented by a singleprecision floating point value in A d , we propose to quantise each element in A d so that each element could be represented by fewer bits. As shown in Fig. 4, the bell-like distribution of the elements in A d makes it inefficient to use a uniform quantiser.
In this paper, we propose an adaptive data quantisation method which is to say the quantisation step size is inversely proportional to the probability density distribution of the data. Thus, we have where Q is the number of quantisation levels. v k−1 and v k are lower and upper bounds of the kth quantisation step, respectively. So the optimal quatisation strategy could be acquired using the following equation: where V min and V max are the minimum and maximum values of A d as shown in Table 1. Unfortunately, the above equation is very difficult to solve. In this paper, we propose an approximate solution to the above problem as follows (See Fig. 5): With the above algorithm, we could efficiently get all the quantisation step bounds. However, we do not choose the midvalue between the lower and upper bounds as quantised value like traditional methods do since the data distribution inside the quantised region is not uniform. Instead, we set the quantised value as the mean value of all the elements falling in the same quantised region as follows. First, we calculate the set of all the elements which belong to the kth quantisation step: Then the quantised value of the kth quantisation step could be calculated as follows:

Compressed model data storage organisation
The final compressed model consists of the stacked version of all D steps of regressors r d . Each regressor r d consists of three parts: the quantised values for all Q quantisation levels; the quantised projection matrix AQ d which corresponds to A d ; the bias term b d .
The quantised values are stored in the order of quantisation levels with single-precision floating point numbers. There are Q floating point numbers for each regression step.
The quantised projection matrix AQ d has the same dimension as that of A d . Each element in AQ d is an index to the quantised value which is described before. Since there are Q different quantised values, each element in AQ d only consists of log 2 Q bits which is usually much smaller than 32-bit floating point number. Through AQ d , we could use the index to get the corresponding quantised value and reconstruct the approximate projection matrix. In such a way, we achieve the data compression purpose. Throughout this paper, we choose Q = 64 which is justified in the experimental results section.
The bias term b d is stored directly as floating point numbers. Fig. 6 illustrates the outline of compressed data storage arrangement.

Training process embedding
If we quantise the final learned result of A d , the quantisation process will introduce extra errors to the feature localisation results. In order to constrain the errors introduced by data quantisation, we modify the traditional training algorithm of the SDM described in Section 2 and propose the following algorithm (See Fig. 7).
In the above algorithm, we embed the data quantisation process into the training process so that the errors caused by quantisation in the previous step will be propagated into the next regression step. As a result, the projection matrix in the next step could partially correct the errors introduced by quantisation and the final results could be improved.

Experimental results
In this section, we compare our method with the standard SDM on the 300 W dataset [25]. We use the open source implementation of the standard SDM by Patrik Huber [26].

Choice of Q
In this section, we evaluate the face alignment accuracy according to the average distance between the detected landmarks and the ground-truth, normalised by the inter-ocular distance [25]. We vary the number of bits for each element in AQ d and calculate their corresponding normalised mean error loss against the standard SDM algorithm [26]. Fig. 8 shows the result.
From the above figure we can conclude that if each element in AQ d uses more than six bits, the loss is small and changes smoothly. However, if each element is represented by <6 bits, the error loss increases rapidly. In this paper, we choose 6 bits considering the balance between the error loss and compression efficiency. This means Q = 2 6 = 64.
In this paper, we use HoG features and localise 68 feature points. The number of regression steps is six. So the dimensions of

Qualitative experimental results
In this section, we compare our feature localisation results with the uncompressed training model [26]. Fig. 9 shows the results on 300 VW dataset. The left column is the result by running the SDM algorithm with an uncompressed training model. The right column is the result with our compressed training model. Red line in both columns is ground-truth feature locations. From this figure we can conclude that our compressed training model can generate very similar results with an uncompressed counterpart. They can both fit the ground-truth very well. Fig. 10 shows the feature localisation results on our dataset. Since this dataset contains no ground truth information, we only show results with the uncompressed training model and our proposed compressed training model in the left and right columns, respectively. Again, we can conclude that our compressed training model could get similar results to the uncompressed one and our method is also robust to occlusion scenarios.

Quantitative analysis
As shown in Fig. 3 Fig. 11. From the figure we can conclude that our proposed compressed training model could get very close feature points compared with the uncompressed training model especially for eyes and face contour parts. These two parts are very important for AR-based applications. Fig. 12 shows the cumulative error distribution curve of the two methods. It is obvious that the two curves are very close to each other. This again confirms the similar performance of both methods despite our training model being much smaller.

Ablation study
In this section, we study the effect of adaptive data quantisation and the training process embedding proposed in Sections 3.2 and 3.4, respectively.
In the first experiment, we compare adaptive data quantisation with uniform quantisation which is to say that the quantisation step is uniformly distributed inside the data range. We use the training process embedding step for both methods. The results are shown in Fig. 13. From the figure we can find that with uniform quantisation, the normalised mean error for each feature point will increase about 1%. Considering that the average of normalised mean error is about 3.5%, the effect of adaptive data quantisation is obvious.
In the second experiment, we analyse the effect of the training process embedding. We compare the results of using the training process embedding with the method without using training process embedding. We apply adaptive data quantisation for both methods. The results are also shown in Fig. 13. From the figure we can conclude that without training process embedding, the normalised mean error for each feature point will increase about 2%. This shows that training process embedding is even more important.

User study for mobile applications
We develop an AR mobile application with the SDM using our proposed compressed trained model. Sample effects of this application are shown in Fig. 14. This application virtually adds some interesting decorations to the face video in real-time.
We pre-record 20 short face videos of different people. The length of each video is about 30 s. We generate AR effects for  We recruited 20 people to score the results of two methods generated by 20 face videos. The range of the score is from one point to ten points. Half of the people are male and the other half are female. The ages of the test people range from 19 to 40. They are undergraduate students, graduate students and teachers. They have no relationship with this research project.
In this experiment, we show two output videos simultaneously on the monitor. The test subject scores the visual effects. The results demonstrate that 14 out of 20 people score exactly the same for two methods in all the videos. The scores of other six people are listed in Table 2.
From this table we can conclude that the visual effects generated by our compressed model are very similar to those of the uncompressed counterpart. This also proves the effectiveness of our proposed algorithm.

Computational cost analysis
The computational cost overhead for online feature tracking of our proposed method is to uncompress the data file. The face alignment process is exactly the same for our proposed method and Superviseddescent [26]. Fortunately, the uncompressing process only needs to be done once before feature tracking. The computational time of our uncompressing process is about 20 ms on iPhone 6 which is negligible compared with the loading process of the mobile application.

Conclusion
This paper proposes an adaptive data compression method for the training model of the SDM face alignment algorithm. We propose an efficient method to quantise the data according to the probability density distribution of the data. Furthermore, we couple our quantisation method with the training process so that the accuracy loss is minimised. The experimental results proved that our proposed method could achieve comparable results with the standard method while only needs <1/5 of the original storage space. Our proposed method shows potential for the SDM algorithm to be applied in mobile applications.