Novel multi-scale deep residual attention network for facial expression recognition

: Recently, the deep convolutional neural networks (CNNs) have shown great success and for facial expression recognition (FER). These CNN-based approaches have made breakthroughs in the accuracy by using deeper networks. As the amount of train data is very less, these models easily fall into the overfitting in the training. In this study, a novel multi-scale deep residual attention network (Ms-RAN) is proposed for FER. The proposed Ms-RAN is mainly based on the multi-scale residual attention unit, which consists of two different scale sub-units. Each sub-unit is composed of the convolutional layers, parametric rectified linear units (PReLUs), and the residual attention connection. By focusing on the relationship between channels and automatically learning the importance of different channel features, the proposed Ms-RAN can make the proposed model pay more attention to the most informative channel features, while suppressing those unimportant channel features. Owing to the unique design of Ms-RAN, the combination of various levels features can be enhanced in the proposed method, and valuable and different ranges of expressive information can also be provided for recognition. The experimental results demonstrate that the proposed method achieves superior performance than other state-of-the-art approaches on five databases, CK+, Oulu-CASIA, BU-3DFE, BP4D+, and MMI.


Introduction
Facial expression is one of the most important signals to convey emotional states and intentions for human being [1]. Recognising the expression from a facial image is a classic problem in computer vision, and has a wide range of research prospects, including intelligent medical treatment, polygraph detection, internet education, and so on. Various facial expression recognition (FER) methods and systems have been proposed in the field of computer vision and machine learning. In these works, they usually defined six basic emotions, which contain anger, disgust, fear, surprise, sadness, and happiness.
Most FER works focus on feature extraction and classifier construction, which can be classified into two categories: static image-based and image sequence-based approaches. Static imagebased models [2,3] are applicable to a single static image for FER, which include random forests, support vector machines classifiers, and Softmax classifiers. Image sequence-based approaches are applied to facial video sequences. By extracting useful temporal features from the image sequence, these methods can increase the recognition performance, such as hidden Markov model and constant rate factor. In practical application, based on these two kinds of methods, other modalities have also been used in multimodal systems to assist the FER, such as the physiological and audio channels.
In most traditional methods, they used handcrafted features or shallow learning for FER, such as the histograms of oriented gradients (HOG) [4], histograms of local binary patterns (LBP) [5], scale invariant feature transform (SIFT) [6], LBP on three orthogonal planes (LBP-TOP) [7], histograms of local phase quantisation, sparse learning [8], and non-negative matrix factorisation [9]. In addition, there are also some other spatiotemporal approaches, such as interval temporal Bayesian network [3], spatiotemporal covariance descriptors (Conv3D) [10], temporal modelling of shapes [11], and expressionlets on spatiotemporal manifold (STM-ExpLet) [12]. Due to the sparse feature representations, the performance of these methods should be improved. Since the emotion recognition competitions, like the Emotion Recognition in the Wild [13,14] and the FER2013 [15], have been held successfully, sufficient training data have been collected from challenging real-word scenarios. The transition of FER is also implicitly promoted from lab-controlled to in-the-wild settings. In addition, with the dramatical development of the chip processing abilities and deep convolutional neural networks (CNNs), studies in various fields have begun to transfer to deep learning methods.
Due to the ability of automatically extracting useful representations from image data, deep neural networks (DNNs) and CNNs have shown excellent performance in face-related recognition tasks [16][17][18][19][20][21][22][23][24] in recent years. For example, Sun et al. [25] applied the region-based CNN (R-CNN) to extract features for FER. By generating high-quality region proposals, Li et al. [26] used faster R-CNN to identify facial expressions. Moreover, with the sequence image, Li et al. [27] proposed 3D CNN to capture the motion information that encoded in multiple adjacent frames. However, despite recent significant progress, FER is still a challenging task due to many difficulties, such as different subjects often display the same expression with visual appearances and diverse intensities. Moreover, there is a limit when applying the deep CNNs directly to FER databases. The major reason is that the amount of train data is too small. During the training process, the deep network is easy to fall into overfitting.
Although the DNN architectures have shown great success and superior performance in FER tasks [28][29][30][31][32][33][34]; nevertheless, there are still some important limitations. For example, most approaches ignore the intrinsic correlations between each pair of samples and just consider each sample independently during learning, which limits the discriminative capabilities of the learned models. In addition, a DNN that has many parameters can easily fall into overfitting with a small amount of data in training. The overfitting problem will become more crucial when the training data is high dimensional. Even when regularisation techniques such as dropout [35] and batch normalisation [36] are used for network training, the results are not satisfactory.
Moreover, despite significant recent progress, FER is still a challenging problem with many difficulties, such as the different subjects often display the same expression with diverse intensities and visual appearances. For example, an expression will first appear in a subtle form and then grow into a strong display of the underlying feelings in a video stream. Second, the peak and non-J. Eng peak expressions from the same people can have significant variation in terms of attributes, such as facial wrinkles and mouth corner radian. The non-peak expressions are more commonly displayed than peak expressions. Obviously, the critical and subtle expression details from non-peak expression images will be difficult to be captured in this way, which can be hard to distinguish across expressions.
Motivated by this observation and in order to solve the problem, we propose a novel multi-scale deep residual attention network (Ms-RAN) in this paper. This model is committed to improve the feature extraction and recognition ability of the network. There are two special designs in the proposed model to improve the performance and avoid the overfitting. First, the multi-scale network structure is adopted in the proposed model, which can be applied to combine the features from multiple scale and provide various features for the FER. Second, the residual attention mechanism is proposed and applied in each unit. This design can improve the performance of feature extraction. Moreover, the concatenation is used to connect the features from each branch and original facial image input. The experimental results show that our proposed method can achieve better performance in feature extraction and recognition with small amount of data.

Related work
Recently, the methods that are based on the deep CNN have been proposed and used for FER. For example, Yang et al. [37] proposed a weighted mixture DNN to automatically extract features that are effective for FER task. Kuo et al. [38] proposed a compact frame-based FER framework for FER which achieves very competitive performance with respect to state-of-the-art methods while using less parameters. Yang et al. [39] proposed a de-expression residue learning (DeRL) method to recognise the facial expression by extracting information of the expressive component. Yu [40] proposed an ensemble of multiple deep CNNs for FER. Mollahosseini et al. [41] used three inception structures [42] in convolution to achieve the recognition of facial expression.
In addition, the VGGNets [43] and inception models [42] showed that increasing the depth of a network could significantly increase the quality of representations that it was capable of learning. By regulating the distribution of the inputs to each layer, batch normalisation can add stability to the learning process in deep networks and produced smoother optimisation surfaces [44]. Based on these works, residual network demonstrated that it was possible to learn considerably deeper and stronger networks through the use of identity-based skip connections [45]. This design not only can eliminate the degradation during the stochastic gradient descent optimisation process when training, but also can enable the data to flow across layers. In this way, the residual network can be used to counteract the internal co-variate shift and train deeper CNN efficiently. Compared with the single deep CNN, ResNet can avoid the overfitting and achieve the superior performance. Thus far, the residual network has been widely used in deep network to deal with the computer vision problem from low-level to high-level tasks, such as image classification and detection. After that, many state-of-the-art algorithms based on ResNet are proposed and achieved excellent performance in facialrelated recognition tasks. Owing to the efficient residual connection, these residual-based methods become increasingly better and unceasingly refresh the recognition accuracy records.
Until now, many studies have been conducted for FER. For traditional local features, such as LBP, HOG, and SIFT, which have been applied to FER successfully and have extended in order to apply to video. Additionally, by using conditional random fields and shape-appearance features created manually, Jain et al. [46] attempted to improve the accuracy through temporal modelling of each facial shape.
Besides the traditional methods, deep CNNs have also been widely used for FER and achieved excellent performance in recent years. Many models that based on the CNN architecture have been proposed for FER. For instance, Yu [40] applied an ensemble of multiple deep CNN model to identify the facial expression. Scovanner et al. [47] adopted three inception structures in the model network for FER. Moreover, the recent utility of generative adversarial network (GAN) [48] shows success in FER. For example, Zhou and Shi [49] used the neutral faces to synthesise facial expression images with cGANs.

Proposed methods
In this section, we will introduce the proposed method in detail. In the first part, the architecture of the proposed multi-scale residual attention unit is shown. The proposed model structure will be discussed and analysed in the following part.

Multi-scale residual attention unit
In this work, a multi-scale deep residual attention unit is proposed to perform he feature learning and exaction. As shown in Fig. 1, the Ms-RAU consists of two sub-units, which is composed of the convolutional layers, parametric rectified linear units (PReLUs), fully connection layer, and the residual attention connection. For each unit, the size of the filters is different, which are designed to exact the multi-scale feature and learn a hierarchical feature representation. Then each unit is defined as where σ represents the functions of the fully connection layer. x is described as the input of each unit. h denotes the transformation of PReLu and the convolution layers, which can be described as here, f 1 and f 2 represent the functions of the first convolution layer and second convolution layer of each unit, respectively. Then g is considered as the PRelu layer. w 1 and w 2 are defined as the weight parameters of the first and second convolution layers in each unit, respectively.

Model architecture
As shown in Fig. 2, the proposed network consists of convolutional layers, Ms-RAUs, max-pooling layer, PRelu layers, fully connection layer, and skip connection. Given a facial expression image input, we use the first convolutional layer to extract the feature maps, which will be sent to Ms-RAUs. After that, a skip connection is used to concatenate the original facial expression input and the output from the Ms-RAUs. In this way, different levels feature maps can be combined. Lastly, we use the fully connection layer and softmax to complete the recognition of the facial expression.

Datasets
To assess our proposed model, extensive experiments have been conducted on two public facial expression databases: CK+ [50] and Oulu-CASIA [51]. These two datasets have been widely used for evaluating FER in other works. The CK+ is a dataset which has the labelled emotion number for the frame of sequences from neutral to peak states. A total of 123 subjects participated and 593 image sequences were included, with 327 of them being labelled with seven universal emotions (anger, contempt, disgust, fear, happiness, sadness, and surprise). Fig. 3 shows the CK+ dataset examples of frontal facial image. Moreover, to further demonstrate the proposed algorithm generalises to other recognition approaches, we also evaluate the experiment performance on FER over other three popular databases, including JAFFE [52], MMI [53], BU-3DFE [54], and spontaneous expression database BP4D+ [55]. The JAFFE is a dataset consisting of grey-scale frontal facial expression images of ten Japanese women. It contains a total of 213 images including seven facial expressions. Fig. 4 shows examples of the JAFFE dataset.

Training details and parameters
During experiments, we use five landmark points to align the face region. Considering the databases do not provide the landmarks, the TSM [11] is applied for face detection and landmark localisation. After that, the aligned face region is cropped and resized to the size of 70 × 70 for model training. Additionally, a data augmentation method is used to generate more training data to avoid over-fitting. The data augmentation methods that are used in this paper include random crop, mirror, colour jitter, noise, rotate, and translation. With the data augmentation, the training dataset is 110 times larger than the original one. For the network training, the proposed model is trained with ADAM. The parameters are set as β 1 = 0.9, β 2 = 0.99, and ε = 10 −8 . In addition, the mini-batch is set to 16. All variants in our network are trained with 300 epochs with an initialised learning rate 10 −4 . Moreover, all the experiments are performed under Ubuntu14 and Tensorflow running on a computer with a NVIDIA GTX1080 GPU.

Comparisons with state-of-the-art methods
We provide the qualitative and quantitative comparisons between the proposed method and other state-of-the-art approaches in this part. In order to examine the performance of our model fairly, the results and the average expressions recognition accuracy that are provided in this section are described as the average of the ten experimental runs. In Table 1, we illustrate the comparisons between LBP-TOP [7], HOG 3D [56], STM-Explet [12], DTAGN [57], DeRL [39], DCN [58], Kuo et al. [59], and the proposed  model on the CK+ and Oulu-CASIA databases. In these works, we compare the proposed method with other FER models that are based on sequence and image. As shown in the experimental results, the proposed algorithm performs well for FERs on static images and achieves over 99.06 and 91.89% recognition rate on CK+ and Oulu-CASIA datasets, respectively, outperforming the compared state-of-the-art approaches. Meanwhile, we show the confusion matrix of our proposed method on the CK+ and Oulu-CASIA databases in Figs. 5 and 6, respectively, where the fear and anger show the lowest recognition rates with 95 and 82%, respectively. In order to validate the proposed model further, a cross-database validation is conducted in this section, as shown in Tables 2 and 3. In this table, the proposed method is compared with the state-ofthe-art works on JAFFE datasets. Lopes et al. [61] suggested a representative method that used a CNN in the FER. They learned data with shuffle training in order to change the order so that the method could be learned with less data. The accuracy of the basic CNN algorithm using only the static feature was 84.48%. Liu et al. [62] proposed a method to extract features of salient areas by using LBP and HOG features with gamma correction, which resulted in 90% accuracy. Goyani and Patel [63] used feature vectors by constructing the Haar wavelet of multiple levels with the face, eyes, and mouth. Finally, Kim et al. [64] proposed a hierarchical deep learning method for FER. The accuracy was 91.27%. Hamester et al. [65] proposed the MCCNN for FER and achieved 95.80% accuracy. From Table 4, we can see that our proposed model Ms-RAN achieved the best performance among the state-ofthe-art works, which resulted in 96.03%. Table 2 enumerates the average accuracy of six expressions recognising runs ten times on the MMI database. In Table 3, the average experiment results on BP4D+ and BU-3DFE database are reported. For all experiments, we only used the 2D texture images. Compared with the DeRL, which is also an image-based and CNN-based model, the proposed model shows around 4.06% improvement. The proposed method achieved best performance among the state of the art, which resulted in 78.65, 86.10, and 78.23% recognition accuracy on MMI, BU-3DFE, and BP4D+ datasets, respectively. In addition, we also show the confusion matrix of our proposed method on the MMI and BU-3DFE databases in Figs. 7 and 8, respectively, where we can see the happiness and surprise expression are easy for recognition, and the fear expression is hard for recognition.
Furthermore, we have also conducted a cross-database validation, as shown in Table 4. Meanwhile, the result of the confusion matrix of the proposed algorithm is shown in Table 5. The results are similar to those of the CK+ dataset.    As we can see from Fig. 9, the models converge after iteration to a certain algebra. Taking the test set as an example, the following conclusions can be drawn from the analysis of Fig. 9 in our experiments. Firstly, as shown in Fig. 9, the proposed method began to converge after 25 generations of training, the convergence speed of the DeRL method was relatively slow, and the convergence speed of the CNN algorithm was slowest. Secondly, the proposed algorithm Adap-SE-ResNet can achieve higher recognition rate than other three methods after iteration to a certain extent. Therefore, it can be seen from Fig. 9 that the proposed approach can improve the recognition rate and be used for FER in complex background to a certain extent.
In summary, the proposed method outperforms the other algorithms in terms of the objective accuracy of the FER. From the experimental results, the proposed algorithm exhibits superior performance and ability to classify the facial expression than other recognition works. In addition, our proposed method can be used in future IoT systems. For example, we can control the lights in the house by recognising the facial expression. Moreover, we can make use of the facial expression to establish the health monitoring system.
Lastly, we will make short discussion about security aspects. As we all know, image recognition is important and useful for our daily life, such as the face recognition. However, since hackers have begun looking for duplicate image to trick recognition systems, there are also great risks in image recognition, same as the FER. So, we should improve the recognition rate of facial expression to prevent the cyber attacks.

Conclusion
In this study, a novel Ms-RAN is proposed for FER. Our model mainly consists of the multi-scale convolution blocks. Each multiscale convolution block contains two different scale sub-blocks. For each sub-block, we proposed the residual attention to connect the input and output, which is firstly proposed in this study. Besides, the skip connection is applied to combine the original input facial expression image and the output form multi-scale blocks. The experimental results validate that our proposed algorithm can improve the visual feature representation and achieve high-accuracy recognition than various state-of-the-art methods on public datasets. In the future, we will continue to study and analyse the problem of FER, and expect to find better rules and strategies for superior performance.  Table 4 Comparisons between our method and the state-of-the-arts FER algorithms on the JAFFE dataset Method Accuracy CNN [62] 84.48 salient feature [63] 90.00 multi-level Haar wavelet [64] 90.56 hierarchical deep learning method [58] 91.27 MCCNN [65] 95.80 Ms-RAN (ours) 96.03