Improved softmax loss for deep learning-based face and expression recognition

: In recent years, deep convolutional neural networks (CNN) have been widely used in computer vision and significantly improved the performance of image recognition tasks. Most works use softmax loss to supervise the training of CNN and then adopt the output of last layer as features. However, the discriminative capability of the softmax loss is limited. Here, the authors analyse and improve the softmax loss by manipulating the cosine value and input feature length. As the approach does not change the principle of the softmax loss, the network can easily be optimised by typical stochastic gradient descent. The MNIST handwritten digits dataset is employed to visualise the features learned by the improved softmax loss. The CASIA-WebFace and FER2013 training set are adopted to train deep CNN for face and expression recognition, respectively. Results on both the LFW dataset and FER2013 test set show that the proposed softmax loss can learn more discriminative features and achieve better performance.


Introduction
Convolutional neural networks (CNN) have achieved great success in many image recognition tasks [1][2][3][4]. DeepFace [5] is one of the classic CNN methods for face recognition and employs the Siamese [6] network trained using a dataset with 4 million pictures captured from 4000 individuals. As a metric learning-based approach, the network receives image pairs and compares the output features with the Euclidean distance. The purpose of this method is to minimise the intra-class samples distances and maximise the distances of inter-class samples. In addition to the adoption of large training dataset, DeepFace uses 3D face alignment method to preprocess faces and integrates several CNN to generate the final prediction. DeepFace achieved the best results on the LFW [7] and YFW [8] datasets at that time. Standing on the shoulder of DeepFace, Sun et al. 's DeepID series [9][10][11][12] continuously improve the performance on the LFW and YFW datasets. Instead of 3D face alignment, DeepID series use 2D affine transformation to align the faces. DeepID1 [9] generates many patches for each face image and input them into different networks, then combines the output features of different networks into one long feature and finally uses Joint Bayesian [13] approach to reduce the dimension of the feature. Based on the Siamese network, DeepID2 [10] uses multi-task learning framework to learn face identification and verification at the same time. DeepID2 + [11] further uses the face verification loss after each convolutional layer. Inspired by the GoogleNet [4] and VGGNet [3], DeepID3 [12] uses advanced structures to build very deep CNN for face recognition.
As the winner of the ImageNet [14] Classification Competition in 2014, GoogleNet builds a deep CNN by repeating the carefully designed Inception structure. Inspired by the work of [15], the inception structure adopts a large number of 1 × 1 convolutional kernels to greatly reduce both parameters and computational complexity. Then, Google's research team train a network, called Facenet [16], using a large number of Inception structures and the Triplet loss with a large dataset of 200 million faces and 800 million pairs. The goal is to make the distance between positive sample and anchor in a triple unit smaller than the distance between negative sample and anchor, which is very similar to the real life scenario.
VGGNet is the runner-up of the ImageNet Image Classification Competition in 2014. It repeatedly uses the 3 × 3 convolutional layers and the 2 × 2 max pooling layer to build deep CNN and improves the performance of the network. Compared with GoogleNet, VGGNet has a simpler structure and is widely used in many image recognition tasks. VGG-Face [17] adopts the structure of VGGNetto recognise faces. Furthermore, it collects and refines a large dataset of >2 million faces from the Internet. This work also obtains comparable performance on the LFW and YTF.
ResNet [2] is the champion of the ImageNet Image Classification Competition in 2015, which proposes the residual block and simply repeats it to build very deep CNN. The residual block not only accelerates the convergence of the network but also improves the performance of the plain CNN. Wen et al. [18] use the residual block and design Face-ResNet for face recognition. Liu et al. [19] also use this structure and propose the SpherefaceNet.
Most approaches above used the softmax function with cross entropy loss to train their networks. However, as described in [18], the softmax loss encourages to learn the separation ability, rather than discrimination ability of input features. For facial identity and expression recognition tasks, features with large inter-class variation and compact intra-class distribution are strongly required. While contrastive loss [20] and triplet loss [16] are widely used in literature, both impose Euclidean margin to features. Wen et Al. [18] proposed the centre loss to jointly learn discriminative features with the softmax loss supervision. Liu et al., Hadsell et al., and Liu et al. incorporate angular margin into softmax loss to achieve this purpose in a more natural way. Following [21], the softmax loss is used to represent the combination of softmax function with crossentropy loss and the last fully connected (FC) layer. The value of softmax loss is only related to the length of connection weight W i , the length of input feature x i , and their included angle θ.
In this paper, we propose three different strategies to improve the discriminative ability of original softmax loss. Specifically, we restrict the length of feature x i to decrease its affection when the feature x i is correctly classified, we replace the cos θ with a linear function to ensure the gradient optimisation of θ even if θ is close to 0 or π, and a margin m is introduced to explicitly encourage decreasing cos θ , namely increasing θ.
Our main contributions are summarised as follows: • An improved softmax loss is proposed to learn discriminative features. The proposed method combines three different strategies, and each one of them can be used to improve the capacity of the networks. • The proposed loss function is easy to implement and optimise.
Experimental results on both face and expression recognition tasks demonstrate the effectiveness of the proposed loss function.

Related work
When the CNNs with softmax loss achieve the state-of-art results in many image classification tasks, more researchers start to work on improving the softmax loss to learn discriminative features. The main drawback of traditional softmax loss is that it only focuses on classification, instead of discrimination ability. Large Margin softmax (L-softmax) loss [21] introduced a margin parameter m to explicitly encourage intra-class compactness and inter-class separability between the learned features. Based on the L-softmax loss, Liu et al. [19] further proposed Angular Margin softmax (A-softmax) loss to normalise weights and zero biases, these two methods achieved 98.71 and 99.32% verification performance on LFW [7] dataset, respectively. Different with the A-softmax loss whose margin is incorporated into the loss with a multiplicative way, the Additive Margin softmax (AM-softmax) loss [22] introduced the margin parameter m in an additive way. Furthermore, the AM-softmax loss performed the input feature normalisation and introduced a scale parameter s to control its learning process. Without feature normalisation, the AM-softmax loss is very similar to our approach (shown in (5)), these two methods even have same decision boundary when addressing binary classification problems. The AM-softmax loss introduces the angular margin to increase the intra-class cosine similarity, but we mainly use the margin to increase the inter-class separation ability between learned features, which is the major difference between the AM-softmax and our approach.
Many methods in literature do not directly improve the softmax loss. Wen et al. [18] proposed the centre loss to jointly learn discriminative features with the softmax loss supervision. The centre loss is trying to minimise the intra-class variations in each training batch, meanwhile a parameter α is introduced to control the learning rate of the centres and avoid large perturbations caused by few mislabelled samples. Contrastive loss [20] and triplet loss [16] incorporate Euclidean Margin into the loss function, their optimisation involves selection of image pairs and image triplets. Compared to image samples, the number of image pairs and triplets grows dramatically, which brings much inconvenience to the training process and significantly increases the computational cost.

Method
As described before, the softmax loss is used to represent the combination of softmax function with cross-entropy loss and the last FC layer.
where x i ∈ R d denotes the ith input features with ground truth label y i , d represents the feature dimensions. W j ∈ R d denotes connection weights W ∈ R d*n of the jth neuron from the last FC layer. Then, b ∈ R n denotes the bias term, mrepresents the number of examples in each training iteration, n denotes the number of categories. θ is the included angle between W j and input feature x i . For simplicity, following the A-softmax loss [19], we use a binary classification problem (e.g. y ∈ 1, 2 ) to analyse the softmax loss. The decision boundary in original softmax loss is [19], the formulation of softmax loss becomes (2), the new decision boundary becomes ∥ x ∥ cos θ 1 − cos θ 2 , which only depends on ∥ x ∥, θ 1 , and θ 2 .
The proposed approaches are designed to learn more discriminative features by manipulating the cosine value and feature length.

Constrain the length of x
The length ∥ x i ∥ denotes its activation on the ith category. When an input feature x i is correctly classified and its length ∥ x i ∥ is large enough, we believe it is not necessary to further optimise ∥ x i ∥. Larger ∥ x i ∥ will only decrease the value of softmax loss, instead of increasing the discriminative ability of features. Therefore, a threshold parameter t ∈ 0, + ∞ is employed. When ∥ x i ∥ > t, ∥ x i ∥ will stop updating during back propagation. The constraint will encourage to update the cos θ i and features with ∥ x i ∥ < t, as shown in (3).

Replace the cos function
When θ is close to 0 or π, the gradient of cosine distribution will tend to saturate (blue line in Fig. 1), so the cos θ is replaced with f θ = − k ⋅ θ − π/2 . Another advantage of this replacement is that when the value of k goes larger, the gradient of function f θ will as well be greater. As a result, the networks will prone to optimise the included angle θ relatively. Fig. 1 shows the distribution of f θ with k = 1, the variation of different k will be described in following section.

Introduce the margin
A margin m is adopted to explicitly enlarge the inter-class separation ability between features. This approach is very similar to the work of AM-softmax loss [22]. However, while the AMsoftmax loss tends to increase the intra-class similarity in a multiclass classification scenario, our approach mainly decreases the inter-class cosine similarity.
The final improved softmax loss, as a combination of these three individual methods, is given by (5).

Visualisation of the learned features
To better explain the usefulness of the proposed approach, the LeNets + + network [18] is trained to classify the MNIST dataset. The employed LeNets + + has six convolution layers and one FC layer with two neurons. During training, all images are normalised by subtracting 127.5 then dividing 128. The initial learning rate is set to be 0.01 and reduced in the 10th epoch and 15th epoch with a factor of 10. Batch-size is 256. We illustrate the feature scatter maps using features learned from the original softmax loss and the improved softmax loss, as shown in Fig. 2a and b. Although there are ten clusters presented in the original softmax feature scatter map, features from different categories are not well separated. While features learned from the improved softmax loss have better intra-class compactness and larger inter-class margins. Both the feature scatter map and recognition accuracy (i.e. 99.41%) prove the effectiveness and discrimination ability of the improved softmax loss.
We also evaluate the performance of different parameters, i.e. the length t of input feature x, the slope k of f θ , and the margin m. As shown in Fig. 2d, when t is equal to 25, as high as 98.53% accuracy is achieved, the intra-class features are more compact compared to that of the original softmax loss. We believe that if t is too small, the softmax function may not be fully optimised, and if t is too large, the constraint will not likely affect the optimisation of cos θ . Therefore, when t equals to 15 or 25, the approach achieves a promising accuracy, i.e. 98.41% or 98.49%.
Similarly, when the parameter k of f θ becomes larger, the network might be focusing on the optimisation of the angle θ, resulting a smaller range of ∥ x i ∥, e.g. from [-80,60] to [-20,20] as shown in Figs. 2f-h. The max accuracy 98.87% is achieved with k = 4.
The margin m is designed to enlarge the inter-class separation ability of features. With the increasing of m, the network will tend to enlarge the inter-class distance between features as well as decrease the intra-class distance, as shown in Figs. 2i-k. The samples are well separated when m = 0.15 and have more compact and discriminative distribution.

Network architecture
Face and expression recognition tasks are used to evaluate the proposed softmax loss. A deep residual convolutional network proposed by [18] is trained for these two tasks. This network consists of 27 convolution layers and one FC layer. We replaced the original softmax loss and centre loss with the proposed softmax loss. Fig. 3 shows the parameters of convolution layers, i.e. '‹kernel size› conv, ‹number of kernels›/‹strides, padding›'.

Dataset
The face recognition network is trained by CASIA-WebFace dataset [23]. This dataset has 494,414 images from 10,575 identities. These images are randomly split into training and validation sets with a ratio of 9:1. The MTCNN [24] is employed to detect the face and facial landmarks in each image. All faces are aligned by affine transform and then cropped into the resolution of 112 × 96. Each pixel in RGB images is normalised by subtracting 127.5 then divided by 128.
We also verify the performance of our Improved softmax loss on the similar-looking (SLLFW [25]), cross-age (CALFW [26]), and cross-pose (CPLFW [27]) datasets. The three datasets were set up as much more challenging options to LFW. While 3000 similarlooking negative face pairs were selected to replace the random negative pairs in LFW, the 3000 positive pairs of SLLFW are the same as LFW. CALFW selects 3000 positive face pairs with age gaps to introduce intra-class aging variances, and selects 3000 negative pairs with same gender and race. CPLFW selects 3000 positive face pairs with pose difference to introduce intra-class pose variances, and selects 3000 negative pairs with same gender and race. Compared to LFW, the datasets of SLLFW, CALFW, and CPLFW are much more challenging.
FER2013 dataset [28] is employed for the training of expression recognition model. This dataset contains 35,887 gray images, i.e. 4953 'Anger' images, 547 'Disgust' images, 5121 'Fear' images, 8989 'Happiness' images, 6077 'Sadness' images, 4002 'Surprise' images, and 6198 'Neutral' images. According to the test protocol, the whole dataset is divided into three subsets, i.e. 28,709 training images, 3589 public test and 3589 private test. Since face images of FER2013 dataset were released with the resolution of 48 × 48, we perform no further facial alignment and directly resize these faces to 120 × 120. All images are normalised by subtracting 127.5 and then divided by 128.

Face recognition
This network is implemented by Pytorch and trained with four Nvidia Tesla P100 GPUs from scratch. The weights are updated by SGD with 0.9 momentum and 0.0005 weights decay. Batch-size is set to 512. The initial learning rate is set to 0.1 and reduced in the 16th epoch and 24th epoch with a factor of 10. We stop the training at the 28th epoch.
We notice the proposed softmax loss with margin m is difficult to coverage during training the face recognition model. Hence, a new parameter λ is employed to control its learning process. This strategy was proposed by L-Softmax loss [21], given by The initial λ is set to 1000 and gradually reduced to smaller value. Following the unrestricted with labelled outside data protocol for LFW dataset, we test 6000 image pairs and report the final verification performance inTable 1. Each test image is processed using the same procedure applied to training images. Then, we feed each test image and its mirror image to the network and extract the FC features. Finally, the two 1 × 512D feature vectors are added together and cosine similarity is employed to compute the similarity between two test images.
As can be seen in Table 1, using the same network architecture, the proposed softmax loss outperforms the centre loss [18] and NormFace [29]. Compared to the original softmax loss, an improvement of 5.5% is achieved. Note that, the performance of A- softmax Loss and centre loss are not from their original paper. Since these methods employed various training datasets and test methods. For fair comparisons on the usefulness of different losses, the results reported by Wang et al. [22] are adopted.
As shown in Table 2, our method achieved 96.87, 92.72, and 86.13% accuracies on the SLLFW, CALFW, and CPLFW datasets, which outperforms the original softmax loss with 9.04, 5.85, and 7.85% margins. The accuracies achieved by our Improved softmax loss are even higher than that of human. Note that, the parameter settings of our Improved softmax loss on these three datasets are consistent with that listed in Table 1, e.g. t = 25, k = 1, m = 0.5, and λ = 1.

Expression recognition
The expression recognition model is trained by fine-tuning a trained face recognition model. All images are online augmented by random cropping to 112 × 96 during training.
For testing, each 120 × 120 image is augmented to ten 112 × 96 images by cropping four corners and centre and horizontal flipping. Then, the ten sub-images are fed into the network to generate ten features. The mean feature is used to predict the final label with cosine distance based nearest neighbour classifier.
The results are listed inTable 3. Compared to the original softmax loss, the network trained with the proposed softmax loss can significantly improve the accuracy from 67.79 to 70.91%. With the increase of margin m, the learned features shall be more discriminative, while the network trained with m = 0.7 achieves the lowest accuracy, we believe this is caused by the stochasticity of network training.

Conclusion
In this paper, an improved softmax loss is proposed to enhance the discriminative capacity of deep CNN. Results on both face recognition and expression recognition prove the effectiveness of the proposed approach. However, there are still some drawbacks in our work, e.g. we introduce some new hyper parameters, which need to be manually tuned. Hence, designing better algorithms to automatically select the optimal parameters will be our future work.