Crowd counting with segmentation attention convolutional neural network

Deep learning occupies an undisputed dominance in crowd counting. In this paper, we propose a novel convolutional neural network (CNN) architecture called SegCrowdNet. Despite the complex background in crowd scenes, the proposeSegCrowdNet still adaptively highlights the human head region and suppresses the non-head region by segmentation. With the guidance of an attention mechanism, the proposed SegCrowdNet pays more attention to the human head region and automatically encodes the highly refined density map. The crowd count can be obtained by integrating the density map. To adapt the variation of crowd counts, SegCrowdNet intelligently classifies the crowd count of each image into several groups. In addition, the multi-scale features are learned and extracted in the proposed SegCrowdNet to overcome the scale variations of the crowd. To verify the effectiveness of our proposed method, extensive experiments are conducted on four challenging datasets. The results demonstrate that our proposed SegCrowdNet achieves excellent performance compared with the state-of-the-art methods.


Introduction
Recently, crowd counting has become a research hotspot owing to its wide applications including public safety management [1], crowd analysis [2], and urban planning [3]. Besides, the methods of crowd counting have important references in vehicle counting, cell counting, and other objects counting [4]. However, due to the crowd's complexity such as severe occlusion, high diversity, scale variation, view-point variation, and non-uniform distribution, the accuracy of crowd counting still has significant room for improvement. Lots of efforts have been done in crowd counting. We generally classify these methods into detection-based methods and regressionbased methods. The detection-based methods [5][6][7][8] obtain the crowd count by counting the number of positive detections. The regressionbased methods [6,[9][10][11] utilize the mapping between extracted features and cropped patches to regress the crowd count. However, the aforementioned methods are complex, because their features need to be extracted by extremely numerous and cumbersome designs. In recent years, deep learning has obtained breakthroughs in many fields [12][13][14]. Many CNN-based methods [15][16][17] have been proposed in crowd counting. They learn the non-linear mapping between training images and crowd counts. Researchers employ these CNN models to generate a density map that records the count and spatial information of the crowd at each pixel location, then the density map is integrated to get the crowd count.
Although deep learning has achieved great improvement in crowd counting, we find that the background is extremely complex in the crowd scene, and we believe that the accuracy of crowd counting can be further improved with the segmentation attention mechanism which highlights the foreground, suppresses the background and makes the proposed SegCrowdNet pay more attention to the foreground. Some example images are shown in Fig. 1. Let's carefully distinguish the foreground and background in crowd counting. We obtain the crowd counting result by counting the number of the human head. The human head belongs to the foreground. Everything apart from the human head belongs to the background. Therefore, the background is extremely complicated. In Fig. 1, we can also observe that the crowd count changes dramatically across images, which is the second challenging problem. To address these two challenging problems, firstly, we predict a segmentation map to adaptively highlight the human head region and suppress the non-head region. With the guidance of the segmentation results, the adaptive attention weights are used in the estimation of the density map to guide our network called SegCrowdNet to pay more attention to the human head region and generate a highly refined density map. Integrating the refined density map can get the crowd counting result. Unfortunately, the present datasets don't provide the ground truth of segmentation. We propose a simple but effective method in which the ones template is pasted on a binary map to encode the ground truth of segmentation. Secondly, to adapt the large variation of crowd counts, we utilized a classification task where the crowd count of each image is classified automatically. Meanwhile, to extract the multi-scale features to adapt the multi-scale crowd, our proposed SegCrowdNet not only utilizes different convolution kernels to encode the image but also fuses rich hierarchies from different depths of convolutional layers where the lower layer can extract discriminative features of the

Fig. 2:
The proposed architecture of our SegCrowdNet. It is a multi-task model including the classification task (Cla-task), segmentation task (Seg-task), and density estimation task (Des-task). The convolutional layers' parameters are represented as "Con(kernel size)-(number of filters)-(dilation rate)". represents the element-wise add. The estimated density map (outlined in blue) is encoded in the intermediate supervision process.
pedestrian and the higher layer can learn semantic concepts of the same pedestrian.
In this paper, we propose a novel end-to-end framework called SegCrowdNet. To the best of our knowledge, the proposed SegCrowdNet is the first network to utilize the segmentation attention mechanism in crowd counting. It adaptively highlights the human head region and suppresses the non-head region by way of optimizing a novel loss. With the guidance of the segmentation attention mechanism, the proposed SegCrowdNet pays more attention to the human head region and automatically encodes the highly refined density map. Our proposed SegCrowdNet can also automatically adapt the variation of crowd counts by learning a classification function. Extensive experiments are conducted on ShanghaiTech Part_A dataset [15], ShanghaiTech Part_B dataset [15], UCF_CC_50 dataset [11] and WorldExpo'10 dataset [18]. The results demonstrate that our proposed method outperforms many state-of-the-art methods.

Counting by detection
In early researches, many kinds of detection frameworks are proposed in crowd counting. The pivotal features from human body are extracted by the well-trained classifiers such as HOG [5], Random Forest [6], and Haar wavelets [7], then the classifier outputs the positive samples. The total number of positive samples represents the crowd count. [25][26][27] complete the crowd counting task by using the aforementioned method. And [22] leverages adaptive thresholds to binarize the image to detect the crowd. They can get a good result in the sparse crowd scene. However, when the crowd density becomes high, some persons are too small to be detected.

Counting by regression
In the highly congested crowd scene, the regression-based methods are usually chosen. They have two important stages. The low-level features are first extracted from the input image, then the crowd count is regressed according to the low-level features. To record the space information of the crowd, Lempitsky et al. [10] proposed a good method that could learn a linear mapping between local area features and corresponding density maps. Based on Lempitsky's work, Pham et al. [6] proposed a more applicable method. The linear mapping was replaced with a non-linear mapping based on random forest regression. In [24], SVM is proposed to map the features extracted by AdaBoost [28] to the crowd counting result.

Counting by CNN
In recent years, CNN has achieved great success in the computer vision task including crowd counting. Boominathan et al. [29]   map. In this paper, we propose a novel segmentation attention mechanism to guide the end-to-end architecture called SegCrowdNet to pay more attention to the human head region. Our segmentation is based on the dice coefficient and we employ the CNN and groundtruth density map to finetune the final density map. Moreover, our proposed SegCrowdNet can adapt the variation of crowd counts by a classification task.

SegCrowdNet Architecture
An overview of the proposed SegCrowdNet can be seen in Fig.  2. The backbone is designed to extract multi-scale features. The proposed SegCrowdNet contains the classification task, the segmentation task, and the density estimation task. The configurations of the backbone are shown in Fig. 3. To extract multi-scale features to overcome the scale variations of the crowd, firstly, four different receptive fields are inserted into our system at the beginning of this backbone. Each of them has 16 filters. The input image is mapped at different scales by them synchronously, then the results are fed to the following modules. Secondly, inside this module, we design several 2x2 max-pooling layers to extract multi-scale features. Thirdly, we fuse the feature maps with complementary information from different depths of convolutional layers. Every convolutional layer is followed by ReLU. In order to increase the depth of the network to enhance its learning ability without introducing too many parameters, the parameters of the shared module are shared to alleviate overfitting caused by excessive parameters. Since the dilated convolution [34] can increase the receptive field with fewer parameters, the dilated convolution is widely used in our SegCrowdNet.
In Table 1, it can be observed that the crowd count changes greatly across images in the 'Range' column. For example, the UCF_CC_50 dataset only contains 50 images. However, the crowd counts range from 94 to 4,543. So the classification task is employed. The classifier can automatically learn the crowd count distribution to adapt the variation of crowd counts. Inspired by [35], the crowd count of each image in the crowd dataset is quantized into several groups. As shown in Fig 2, the fully connected (FC) layers are connected at the end of the backbone. The two fully connected layers followed by PReLU separately have 64 neurons and 5 neurons. The 5 neurons represent five count groups. To avoid the distortion of images and maintain the original distribution of the crowd, we do not resize the input image. The Spatial Pyramid Pooling (SPP) [36] layers are placed between convolutional layers and FC layers. Arbitrarily sized feature map extracted from the input image can be fed to the SPP layer, and the SPP layer produces fixed size outputs to feed the FC layers. In the classification task, the crowd counts of each dataset are classified into five count groups to adapt the variation of crowd counts. We choose to minimize the cross-entropy loss to optimize this process.
Two other tasks are also shown in Fig. 2. In the segmentation task, the segmentation map which adaptively emphasizes the head region and suppresses the non-head region is encoded with the supervision of ground-truth segmentation. In the density estimation task, the estimated density map (outlined in blue) and the final estimated density map are predicted with the supervision of ground-truth density map. Integrating the final estimated density map can get the crowd count. As shown in Fig 2, we add the segmentation map to the estimated density map (outlined in blue) and the results are fed to the following convolution layers and ReLU layers to automatically encode the final estimated density map. With the guidance of the segmentation map where the human head regions have higher weights, more attention is paid to the human head region in the density estimation task. In the segmentation task, a novel loss based on the dice coefficient is employed. The novel loss and the ground truth of segmentation will be elaborated in Sec 3.2. In the density estimation task, the Euclidean distance loss is employed to optimize the estimated density map and the final estimated density map.

Model Learning
Ground truth generation: The ground-truth density map is extremely important for the supervised methods. For any training image, the 2D point p located at the center of each human head is provided. The ground-truth density map is encoded by employing a normalized Gaussian kernel centered on each point p, which is defined as: where c represents the pixel location. A j represents a series of 2D points annotated in the image j. N (p; µ, σ) represents the normalized Gaussian kernel with mean µ = 0 and isotropic variance σ = 4. The window size of the Gaussian kernel is 15×15. We utilize the above simple method to generate the ground-truth density map to ensure that the improvement of the result comes from our novel method instead of the innovation of generating a ground-truth density map. It is extremely expensive to manually label the ground truth of segmentation, owing to the huge amount of people in datasets. For example, there are 330,165 people in the ShanghaiTech dataset. So a simple but effective method is proposed to encode the ground-truth segmentation map which should have the same foreground and background with the ground-truth density map. First, we construct the ones template that keeps the same size with the Gaussian kernel, which is set to 15×15. Second, the ground truth of segmentation is encoded by pasting the ones template that is centered on each annotated point p on a binary map. They are visualized in Fig. 4. The experiments using different scale templates are conducted on the challenging ShanghaiTech Part_A. Results are given in Table 2. It can be found that performance with 15×15 is the best. We use the same scale template (15×15) on all four datasets to show the robustness of our proposed method. Model Optimization: The proposed SegCrowdNet is a multi-task model including classification task, segmentation task, and density estimation task. The multi-task learning [37] can serve as a regularization to alleviate overfitting, which requires our proposed SegCrowdNet to consider every task synchronously, not only one of them. The proposed SegCrowdNet is optimized by minimizing four loss functions in a synergistic manner, including an intermediate supervision loss. In the density estimation task, the Euclidean loss is utilized to optimize the estimated density map (outlined in blue) and the final estimated density map, whered i represents the estimated density in the intermediate supervision process.D i represents the final estimated density. D i represents the ground-truth density. U represents the number of pixels in the ground-truth density map.
In the segmentation task, inspired by [38,39], we introduce a novel loss in crowd counting which is based on the dice coefficient. The loss is optimized to predict the segmentation map for the human head. The dice coefficient D(ŝ i , s i ) is between 0 and 1. In the process of optimization, D(ŝ i , s i ) is expected to maximize, and the loss Lseg is expected to minimize. They are formulated as; where U represents the total number of pixels in the ground truth segmentation map. s i represents the i th value in the ground truth segmentation map.ŝ i represents the i th value in the predicted segmentation map.
In the classification task, the crowd counts are quantized into five groups. For example, if the crowd counts of a dataset range from 1 to 500 and they are quantized into five classes, the images with populations between 1 and 100 belong to the first class, and the images with populations between 401 and 500 belong to the fifth class. The cross-entropy loss function is utilized, where M represents the total number of training samples. K represents the total number of classes. y a b represents the ground-truth class.ŷ a represents the output of classification. The final weighted loss is given by equation (7). Empirically, we set λ 1 as 0.01.

Experiment setting
We conduct extensive experiments on four challenging datasets. The statistics of the four challenging datasets are summarized in Table  1. Some example frames from these four datasets are shown in Fig.  6. In the process of creating training data, 9 patches are cropped from each original image randomly, and the size of each patch is 1/4 of the original image. Horizontal flipping and noise addition are also used in the training data. We utilize Adam optimization with a fixed constant momentum of 0.9 to train our model. And we set the learning rate as 1e-6.

Evaluation Metric
The mean absolute error (MAE) and the mean squared error (MSE) are widely used to measure the crowd counting error on different methods [15,16,32,33]. They are defined as follows: Where z i represents the ground-truth crowd count.ẑ i represents the estimated crowd count. V represents the number of test images,

Ablation study using ShanghaiTech Part_A
To verify the effectiveness of each task in the proposed SegCrowd-Net, we perform ablation studies on ShanghaiTech Part_A [15] GT which is a large scale and high-density dataset, the dataset contains 482 images with 241,667 annotated persons. In other datasets, similar results can be observed.
To investigate the ability of this classification task where the crowd count of each image is classified, we conduct a set of comparative experiments. In one experiment, the classification task can function normally. In the second experiment, the classification task is removed. The estimation errors are given in Table 3. We can observe that the performance with Cla-task is better, with MAE/MSE 12.4/19.9 lower than that without Cla-task. We think that the proposed SegCrowdNet can learn the crowd count distribution to adapt the variation of crowd counts, which contributes to reducing the crowd counting errors. In the proposed SegCrowdNet, we classify the crowd counts of images in each dataset into five categories according to extensive experiments. From Table 4, we find that our proposed SegCrowdNet with five categories perform best. We think that the crowd counting data are limited and five categories are most suitable. Last but not least, we can further observe that adopting the classification task with different categories in Table 4, the MAE is further alleviated compared with the performance without Cla-task in Table 3, which further indicates that the Cla-task can assist in reducing the estimation error of crowd counting.
The proposed segmentation attention mechanism is the most important idea in this paper. It is included in the segmentation task.
To demonstrate the effectiveness of the segmentation task, we also conduct a set of comparative experiments. In one experiment, the segmentation task can function normally. In the second experiment, the segmentation task is removed from our proposed SegCrowdNet and the segmentation attention mechanism is subsequently removed. The qualitative results are illustrated in Table 5. We can observe that the SegCrowdNet greatly reduces the error of crowd counting with the guidance of segmentation results, with MAE/MSE 17.5/24.5 lower than that without Seg-task, which reveals the power of the segmentation attention mechanism. In Fig. 5, we demonstrate the estimated crowd count of each image in the comparative experiment. It can be observed that the blue circle (predicted crowd counts with the guidance of the Seg-task) is closer to the red triangle (actual crowd counts) than the green pentagram (predicted crowd counts without Seg-task) in most images, which indicates that the segmentation attention mechanism plays a vital role in alleviating the errors of crowd counting.
We visualize the effectiveness of the Seg-task in Fig. 7. In the third column, it can be observed that the human head regions are highlighted and the non-head regions are suppressed excellently in the estimated segmentation map, which is important to guide our proposed SegCrowdNet to focus on the human head region. In the fourth and fifth columns, we can observe that with the guidance of the segmentation results, the proposed SegCrowdNet pays more attention to the human head region and the crowd counts are predicted beautifully. Similar results can be found in Fig. 8. From these results, we can see that every task in the proposed SegCrowdNet is necessary. They work collaboratively to better accomplish the task of crowd counting.

ShanghaiTech dataset
The ShanghaiTech dataset [15] is highly challenging because of its large-scale crowds. It contains 1,198  with 300 training images and 182 test images, Part_B with 400 training images and 316 test images. Part_A consists of the high-density crowd images downloaded from the internet. Part_B consists of the relatively low-density crowd images collected from Shanghai streets. As shown in Table 6, we compare our proposed method with the other fourteen state-of-the-art methods. In the LBP+RR method, the LBP features were extracted by hand to regress the crowd count. Owing to the scale variation in the crowd, the work of MCNN [15], CAFN [40], TDF-CNN [32], Switch-CNN [16], and CP-CNN [17] employed multi-column CNNs to extract the multi-scale features to cope with the scale variation in the crowd. Instead of using the multi-column CNN, in [30], Hossain et al. proposed the scaleaware attention network to encode the multi-scale density maps. In [41], IG-CNN adapted the multi-scale crowd scenes by increasing its capacity. The work of Cascaded-MTL [35], Marsden et al. [4], Deci-deNet+R3 [42], and DDCN [43] introduced the multi-task learning to assist in reducing the count estimation errors. The work of SE Cycle GAN [33] designed the data collector and labeler to generate the crowd data with the corresponding labels to alleviate the overfitting caused by limited data for training. In [44], the deep negative correlation learning was utilized to extract the general feature representation. The estimation errors are illustrated in Table 6. It can be observed that, firstly, the performance of deep learning is far better than the hand-crafted features. Secondly, lots of efforts are done to extract the multi-scale features of the crowd, which is very important to reduce the estimation errors to some extent. Finally, our proposed SegCrowdNet achieves the best performance in MAE and MSE on both datasets.

UCF_CC_50 dataset
The UCF_CC_50 dataset [11] contains 50 crowd images collected from the internet. The number of people in each image ranges from 94 to 4,543. There are only 50 images in this dataset. We follow the * '-' represents that the result of this dataset is not reported in that paper. standard protocol discussed in [11]. The 5-fold cross-validation is utilized to evaluate the performance on this dataset. In Table 8, we compare our method with other recent state-of-theart methods. The hand-crafted features were extracted in [11,45]. The methods which have been compared on the ShanghaiTech are still compared in this dataset. The detailed comparison results are shown in Table 8. In the same way, it can be observed that all of the CNN-based methods outperform the methods of traditional feature extraction [11,45] significantly. The proposed SegCrowdNet leads the best performance in MAE. More specifically, its MAE is 38.0 lower than the second best method. However, the MSE which indicates the robustness of a method is not the best. We think that the root of causing this result lies in the limited data, as there are only 50 images in this dataset.

WorldExpo'10 dataset
The WorldExpo'10 dataset [18] contains 1,132 video sequences that are captured from 108 scenarios and it consists of 3,982 annotated frames with a size of 576×720. The frames with the region of interest (ROI) are divided into a training set with 3380 frames and a test set with 600 frames. The test set consists of 5 five different scenes (S1-S5) and each scene contains 120 frames. For fair comparisons, we followed the methods in [18] to generate the density map.
The results of recent state-of-the-art methods are summarized in Table 7. The MAE is utilized to evaluate these methods. We can observe that the CNN-based methods are still superior to traditional methods. Our proposed SegCrowdNet obtains the best result on the S2, S5, and Average. While it achieves comparable performance on the other three scenes. By reviewing the test images of these three scenes, we find that some people are majorly gathered in these three scenes, which is challenging for our proposed SegCrowdNet to further improve their accuracy.

Conclusions
In this paper, an end-to-end architecture named SegCrowdNet is designed in crowd counting. We propose a novel segmentation attention mechanism to guide our SegCrowdNet to pay more attention to the human head regions. The proposed SegCrowdNet can also automatically adapt the variation of crowd counts by optimizing a classification task. Moreover, the proposed novel four-loss optimization improves the generalization ability of the SegCrowdNet. We verify our method on four popular crowd counting datasets (ShanghaiTech Part_A dataset, ShanghaiTech Part_B dataset, UCF_CC_50 dataset, and WorldExpo'10 dataset). Extensive experimental results demonstrate that our proposed method outperforms many state-of-the-art methods.