Human behaviour recognition with mid-level representations for crowd understanding and analysis

Crowd understanding and analysis have received increasing attention for couples of decades, and development of human behaviour recognition strongly supports the application of crowd understanding and analysis. Human behaviour recognition usually seeks to automatically analyse ongoing movements and actions in different camera views by using various machine learning methodologies in unknown video clips or image sequences. Compared to other data modalities such as documents and images, processing video data demands much higher computational and storage resources. The idea of using middle level semantic concepts to represent human actions from videos is explored and it is argued that these semantic attributes enable the construction of more descriptive methods for human action recognition. The mid-level attributes, initialized by a cluster processing, are built upon low level features and fully utilize the discrepancies in different action classes, which can capture the importance of each attribute for each action class. In this way, the representation is constructed to be semantically rich and capable of highly discriminative performance even paired with simple linear classiﬁers. The method is veriﬁed on three challenging datasets (KTH, UCF50 and HMDB51), and the experimental results demonstrate that our method achieves better results than the baseline methods on human action recognition.


INTRODUCTION
Visual recognition and understanding [1][2][3][4][5] of human behaviours have been widely used in crowd analysis [6][7][8] and security surveillance. Moreover, the development of human action recognition can support the application of crowd understanding and analysis. In simple terms, the objective of human behaviour recognitions is to correctly classify the video into its action category, and the video is fragmented to contain only one implementation of human movement. In more general cases, the purpose of human action recognition is focused on performing the continuous recognition on every appeared human action of the input video from start to end. Human action recognition task is of significant meaning in several applications. For instance, a train station surveillance system can automatically recognize suspicious activities like "people suddenly run panicky "or "a person waving his/her harms with swords in hands." Recognition of human actions is also helpful in the real-time monitoring of patients, children and elderly persons. What is more important, a high performance action recognition system can make the construction of vision-based intelligent environments and gesture-based human computer interfaces become possible.
Though a lot of enthusiasm has been generated in the vision community by recent advances in machine recognition of human actions [9][10][11][12], there are still several important issues left to be addressed. One of the most crucial issues is to find algorithms that can robustly overcome the variability of features with the same action class label. This is because there are large variations in performance. For example, hands waving motions can differ in speed and radian length. In addition, there are anthropometric differences between individuals. Moreover, similar observations can be made for different actions, especially for non-period actions or actions that are adapted to the environment (e.g. walking with a dog or handing somebody something). A thorough human action recognition framework should be able to tolerate variations within one class and distinguish actions of different classes. For increasing numbers of action classes, this will even be more challenging as the overlap between classes will be higher.
Many previous action recognition frameworks directly match low-level features with action class labels [13][14][15][16], in which abundant visual spatiotemporal information can hardly be generalized by the raw low-level features. To overcome the aforementioned drawback, recent works show that attributes built upon the raw low-level features can act as higher semantic concepts and bridge the gap between low-level features and action class labels. Some works [17][18][19][20][21] even treated attributes as middle-level features for classification or learning problems. Lampert et al. [22] proposed the direct attribute prediction model which was learned to predict the presence of each attribute, and then used these predictions to train object models. Liu et al. [23] treated the action attributes as latent variables, whose classifiers are pre-trained by linear SVMs with outputs being the inputs of latent SVM. The aforementioned methods have proved that the attributes are functional in many ways and effective for recognizing actions.
In this work, we simultaneously consider the relationship between low-level features and attributes to bridge the gap between low-level features and action class labels. Attributes can put up a semantically expressive bridge between raw data and higher level representations as they focus on depicting characters of all instances within the same class rather than naming one directly. But the intra-class variability can still make the attributes to be inaccurate for specific members of the training classes. As a result, some action samples may be associated with slightly different sets of attributes even if they share the same action class label. To address this problem, we construct a mid-level representation by combining low level features and attributes together. Figure 1 shows the procedure to generate the mid-level representation of our method. Each action attribute corresponds to a set of lowlevel features which share the same action class label. Meanwhile, every mid-level representation contains both intra-class and inter-class information, which make it very discriminative. The contributions of this paper would be summarized as follows: Illustration of our method. Our method not only introduces a set of attributes to bridge the gap between low-level features and action classes, but also formulate the connection between attributes and action classes 1. The mid-level representations can conquer the appearance variation across cameras to some extent and represent the discriminative information of motion patterns in the video sequences 2. The proposed model is discriminative by making the same actions more similar after the middle-level representation while different motion patterns are more distinct. In this case, it can reduce the negative impact of the similar actions 3. Extensive experiments have been conducted on three action datasets to verify the performance of the proposed method.
The experimental results show that our mid-level representation of human action patterns is capable of constructing a powerful action recognition framework. In addition, our method can give prominent performance on several standard datasets (KTH, UCF50, HMDB51).
The rest of this paper is organized as follows. In Section II, we review recent works on action recognition. In Section III, we introduce the proposed method in detail, including the construction of mid-level representation and the whole algorithm framework for classifying human actions. Then we present experimental results on several data bases for video based action recognition in Section IV. Finally, we conclude this paper in Section V.

RELATED WORKS
The recognition of human activities from video sequences is one of the most encouraging applications of computer vision which effectively support the crowd understanding and analysis. In recent years, this task has attracted the attention of researchers from academia, security agencies, industry and the general populace as well. According to the features used for recognizing human actions, the methods in related works can be mainly classified into two categories: global featurebased methods and local feature-based methods. And method based on neural network also obtain attention because of their outstanding performance in many computer vision tasks.
Moreover, methods of crowd understanding and analysis are essential to be stated.

Global feature based method
Global feature based methods [24][25][26] encode the visual observation as a whole and the region of interest (ROI) is usually detected through background subtraction or tracking. Common global features are obtained from edges, silhouettes or optical flow [27]. The representations are powerful since they encode much of the information. The work of Bobick and Davis [27,28] is one of the earliest methods utilizing silhouettes. They extracted silhouettes from a single view and accumulated the differences between subsequent frames of a motion clip with the generated binary motion energy image (MEI) and motion history image. Instead of silhouette, the observation within the ROI can also be described with motion information-optical flow [28]. Flow methods are usually executed when background subtraction cannot be performed. Efros et al. [28] calculated optical flow in person-centered video frames on sports footage, where persons in the frame were very small. There are also a lot of work combining flow and shape descriptors together, which can overcome the limitations of single representation. Optical flow was employed in ref. [29], and it researched what makes a flow algorithm good for action recognition. In addition, the spatio-temporal volume approach is also a global representation, despite the fact that this approach shares many similarities with local approaches.

Local feature based method
Local representations describe the object as a collection of individual patchs. To estimate the local features, spatiotemporal interest points should be detected first. After that, local cubes will be calculated around these points. Then, the cubes are combined into a final feature. Finally, bag-of-feature approaches are always followed. Local features are less sensitive to noise and partial occlusion, and do not strictly require background subtraction or tracking. However, as they depend on the extraction of a sufficient amount of relevant interest points, pre-processing is sometimes indispensable.
To detect the interest points or salient regions, a large variety of methods [30][31][32][33][34] have been put forward. Laptev et al. [28] modified the 2D spatial interest points to detect 3D spatiotemporal interest points for action recognition. The detected local features preserve some rotation invariance, which is robust to the succeeding action recognition. Wang et al. [34] presented a detector to obtain spatio-temporal interest points by employing a Geometric Algebra for action recognition in multi-channel videos. Yi et al. [32] developed the salient point detector by taking advantage of trajectory space information based multiple kernel learning.
After the detection of interesting points or salient regions, many methods have been proposed to describe them. Scovanner et al. [35] extended the SIFT descriptor to 3D-SIFT, by using the spatiotemporal information which is denoted by a sub-histogram. Laptev et al. [36] applied histograms of gradient orientations (HoG) and histograms of optic flows to describe the information of local motion and cubes of the space-time neighbourhoods of detected interest points, and Klaser et al. [37] ameliorated HoG to the 3D case.

2.3
Deep learning based method [38] can effectively perceive the spatial features, temporal dependencies and motion structure. Zufan et al. [39] address the human action recognition by applying Conv-LSTM and FC-LSTM (full connection) which are combined with different attention mechanisms. AGC-LSTM [40] (attention enhanced graph convolutional) can explore the relationship and capture discriminative features in time-space domain. Meng et al. [41] propose a novel sample fusion network to achieve data augmentation for skeletonbased human action recognition (SHAR). Graph convolutions technology are considered to address SHAR, G3D [42] applys cross-time-space edges as skip connections which can propagate direct information for disentangling the importance of nodes.

Crowd understanding and analysis
Besides human behaviour recognition, crowd understanding and analysis have many subproblems such as crowd counting and person re-identification. Wang et al. [6] proposed a largescale benchmark for crowd localization and counting named NWPU-Crowd. Synthetic data [7] are also adopted for pixelwise crowd understanding. In ref. [8], pixel shuffle decoder and density-aware curriculum learning are designed to obtain the potential of crowd counting models. Moreover, Zhao et al. [43] present PISNet to address the interference of other pedestrians for person re-identification. And Li et al. [44] focus on different and similar components for person re-identification.

PROPOSED METHOD
In this section, details about the proposed method are introduced. Framework of our method is shown in Figure 2. In III-A, we illustrate the 'low-level' features used in this work. In III-B, we present our method for human action recognition in detail, which explicitly employs the attribute-based mid-level representation. In III-C, we give the optimization procedures for the recognition method.

Low-level features
To extract low-level features, we follow the procedure described in ref. [45] and use the features named 'action bank.' In this implementation, action bank is obtained using the action spotting detector [46] following the previous work [45]. As described in [45], action bank is comprised of many individual action detectors sampled broadly in semantic space as well as viewpoint space and is constructed to be semantically rich.
What's more, it is capable of highly discriminative performance. So it is capable of being the basis of a powerful action recognition method.

Attribute-based mid-level representations
Let X ∈ R d ×N represent the raw low-level features extracted from action video clips. Each column of X corresponds to the feature representation of a human action and N is the number of training samples. The low-level action bank feature x i ∈ X represents the i-th sample from training set. y i is the corresponding action class label of x i . Most previous works just represent actions with raw extracted features and simply connect the feature vectors to action class labels. In this case, the raw features cannot convey all the information the recognition needs, so that there exists a gap between the low-level features and class labels. In this paper, we consider that human actions can be better described by attribute-based mid-level representations, since these mid-level representations can act as higher semantic concepts and bridge the gap between low-level features and action class labels.
Attributes are mainly used to describing an instance (e.g. rather than naming one. In one human action recognition context, for example, a 'leg motion' may be shared between 'jogging' action and 'running' action. This kind of descriptions of shared components or parts between similar or different classes can be called as attributes. And attribute learning has been popularized recently as an effective method for image and video understanding [22]. In this paper, we design an action attribute space A d ×r as a semantic metric space in which each column denotes one kind of information of an action class. The attribute space is composed of r attributes and each attribute is mined from a group of low-level features that share the same action class label by using clustering algorithm. The clustering centers are defined as attributes because they are more discriminative than low-level features and able to capture the inherent intra-class variably of each action class. Thus, they are capable of representing the complete information of action classes. Given a training set where +1 denotes the positive samples and −1 denotes the negative ones, we try to build a classification model to recognize an unknown action x. To this end, both attributes and low-level features are used to generate mid-level representations Φ C (X ), which represent the action video clips with higher-level concepts. The mid-level representation Φ C (x) is defined as where c i ∈ A d ×r is the attribute generated from X by Kmeans clustering, and r is the total number of cluster centers. s(x, c ) is a similarity measure function, and we measure the similarity by Gaussian Hausdorff function, that is, where is set as the average distance between all pairs of instances. For the training set D, we define where N is the number of training samples. For each action class l , there is a label matrix and where +1 denotes the positive samples and −1 denotes the negative ones. To recognize all different kinds of action classes, we learn a classifier for each action class. And the classifier is denoted as where w 1 ∈ R r×1 is the parameter vector, w T l denotes the transpose of w l , f l is a linear model. And given the training set D, the maximum likelihood estimate for the weight vectorŵ, is that which minimizes the sum squared error: Let W denote the matrix of parameter vectors, [w 1 , w 2 , … , w m ]. To find the best parameter matrix W , we optimize all loss functions and regularize them with l 2,1 -norm: where and m is the total number of action classes. If Z ∈ R D×T , then ‖Z ‖ 2,1 is thel 2,1 -norm of matrix Z , that is, This norm first computes the l 2 norm of the loss function matrix in each dimension, then computes the l 1 norm which is well known for sparse solutions. The regularization achieves the minimum if and only if for all corresponding elements. Thus, the l 2,1 -norm has an important property so that it can provide the score of the mid-level representation, and find the best parameter matrix W . However, there is still a latent defect in this formulation. The K-means clustering technique is a non-convex method, so the process to generate action attributes may result in a suboptimal performance. To address this problem, the proposed method ameliorates the formulation in the following subsection.

Optimization method
So far, we assume that the attributes C are available. However, it may lead to a suboptimal performance because of leaving the learning model out of accounting this situation. One alternative way is to learn W, C simultaneously. So our model is extended as follows: where is the parameter to measure the influence of the clustering process, and Θ({X } m 1 , C ) is the optimization function to solve the sub-optimization problem of clustering process. The rationale behind this model is that by minimizing the integrated objective function, we can find a set of perfect attributes C and the classification model W . The model W can predict the data correctly with a large margin and minimize the loss of mutual ALGORITHM 1 Attribute-based mid-level representation method Input: Φ C (X ), Y, , Output: W, C 1: Perform clustering process to the positive features of each action class label to initialize attributes C . 2: While not converged, do 3: While not converged, do 4: Fix C , update W ← Equation (4) 5: End while 6: Fix W , update C ← Equation (6) 7: End while information caused by feature merging. In this paper, we exploit the positive instances to upgrade the attributes C as where {X } p l represents the positive instances for the l − th action class label, and is the square of average Euclidean distance.
Although the objective function of Equation (5) is notconvex for a given W or C only, it is convex with respect to both W and C . Therefore, it is difficult to get a global optimal solution directly. In this case, it is necessary to solve the objective function of Equation (5) by iterating over one variable, while fixing the other variable. It is not guaranteed that the global solution will be obtained, while the local optimal solution is satisfactory likewise. To simplify this process, we adopt an efficient block coordinate descend algorithm. Specifically, we first optimize the objective function with respect to W when attributes C are fixed, then optimize it with respect to C when W are fixed. These two procedures are repeated until convergence. Algorithm 1 summarizes the pseudo-code of block coordinate descend algorithm.
When W is fixed, we need to solve a optimization which is non-convex and non-differentiable. Let g(C ) denote the objective function of Equation (5) and C (0) denote the current solution. Here, we employ the sub-gradient method to find a refined solutionC , that is, g(C ) ≤ g(C ). Let ∇C denote the subgradient of C , we update C (u) as, where is the step size and v is the number of iterations. Because the subgradient method is not a descent method, it is common to keep tracking C until we find the best candidate, that is, the one with the smallest function value,

FIGURE 3
Diagram of the one-versus-rest vote evaluation method. Each action mid-level representation is treated as the input of all action class classifiers, then the outputs of the classifiers are used as inputs of the response evaluation function to find the maximum output, whose corresponding classifier action label is the mid-level representation's action label

FIGURE 4 Illustration of the 'Precision/Recall' evaluation method
It is evident that we will have g(C ) ≤ g(C ). When g(C ) = g(C ), the algorithm stops and outputsC is as the final result.

EXPERIMENTS
Before learning the optimal process, an important step is feature extraction. In this paper, as mentioned above, we use the action bank features as the low-level features. After extracting lowlevel features, block coordinate descend algorithm is employed to train the model. To verify the effectiveness and robustness of the proposed method, we conduct experiments on three representative datasets, that is, KTH, UCF50 and HMDB51 using different experimental settings. For KTH, UCF50, and HMDB51 datasets, we employ two methods to predict the label of the test samples in order to better evaluate the algorithm. The first method is called 'oneversus-rest vote' method. As can be seen from Figure 3, the mid-level representation Φ C (x) is treated as the input of all classifiers, action label of the representation is chosen to hold the maximum output of all classifiers by using the 'response evaluation function. ' Another evaluation method is called 'Precision/Recall' method. Figure 4 shows the illustration of the 'Precision/Recall' method. In this method, for the response of every action classifier, we will calculate the recall (R) and precision (P), then the average precision (AP) is calculated as For the aforementioned evaluation methods, average classification accuracy is served as the final measurement of the classification performance in this paper. y i denotes the true label of the unlabeled feature x i ,ŷ i denotes its predicted label, n denotes the number of the testing samples of the action class. The average classification accuracy of an action class is defined as if The larger the accuracy is, the better the classification performance is.

4.1
Performance and discussion 1) Performance on different datasets: In this subsection, experimental results on KTH, UCF50 and HMDB51 datasets with different training size and action classes are displayed. All evaluations were done with 3-fold-cross-validation. To verify the performance of the proposed framework, we compare our method with other human action recognition methods. For KTH datasets, Figure 5 lists the confusion matrix for six actions with varied performance when using 'one-versus-rest vote' evaluation method. The column of the matrix represents the instances to be classified, meanwhile each row represents the corresponding classification results. In the confusion table some 'hand-waving' actions are misclassified to 'clapping' since both classes share 'alternate hands motion'. While 'hand waving' representation does not have strong enough information to differentiate itself from 'clapping,' but the leg-related actions are perfectly classified even when they have similar appearance exhibitions (jogging and running). Table 1 compares the average accuracies of our method with other excellent methods. We can see from Table 1 that our method achieves the highest recognition accuracy of 96.1%, and is over 9% better than SA baseline, which proves the effectiveness of our method commendably. When using 'Precision/Recall' evaluation method, the final experimental results are shown in Table 2. To further verify the performance of the proposed method on the UCF50 dataset, one-versus-rest vote evaluation method is exploited in the experiments. As can be seen from Figure 6, the darker the colour of the block is, the higher rate the accuracy of the action is. The errors of this confusion matrix are randomly distributed across action labels with no apparent trends, and the average accuracy of the UCF50 dataset is 68.9%. Since there are many considerable variations in action performance, camera movement, human appearance, viewpoint, illumination and background in this database, the classification result is acceptable. Table 3 shows the comparison results of our method with other methods. In order to further demonstrate the effectiveness of the method on this database, we also use the 'Precision/Recall' evaluation method and list the average accuracy rate of each action class in Table 4. It can be seen that the majority of action classes still remain high classification accuracy, while actions like 'PizzaTossing,' 'SoccerJuggling,' 'WalkingWithDog,' 'TennisSwing' and 'Basketball' have awful results lower than 50%. We also conduct the experiment on the HMDB51 dataset. Figure 7 shows the confusion table of our proposed method when using 'one-versus-rest vote' evaluation method, and the average accuracy of our action classes is 45.1%. As mentioned before, HMDB51 database is a large and most challenging video database for human motion recognition. And it can capture a substantial degree of the richness and the full complexity of  Sadanand et al. [45] 57.9 Gist3D [54] 65.3 Our method 68.9 video clips commonly found in the movies or online videos. Table 5 shows the comparison of our method with other methods. It can be seen from Table 5 that the performance of our method is better enough than the baseline method whose classification accuracy is 23.2%. Furthermore, when using 'Precision/Recall' evaluation method, the experimental results are shown in Table 6. It can be seen that only actions like 'golf,' 'pullup,''pushup' and 'situp' get high classification accuracy, 2) Parameter selection: In this part, we conduct a series of experiments on two datasets, that is, UCF50 action dataset and HMDB51 action dataset, in order to ulteriorly verify the effect of mid-level representations, and to evaluate the impact of the number of attributes, training sizes and action classes. Moreover, we also make experiments to study the influences of different   values of the parameter. In order to simplify our expression, the training size p is defined to represent the number of training samples in each action class, and the scale of cluster centers r is represented as the percentage of training set for initializing the attributes. We firstly study the impact of the training sizes and the number of attributes. When the size of training samples is changing from 20 to 50 with fixed cluster sizes, the experimental results on UCF50 dataset and HMDB51 dataset can be seen in Figures 8 and 9. On both datasets, the number of testing samples is fixed to 20 and the number of action classes is also 20. Figure 8 shows the average recognition accuracy with different values of cluster scales r when the training size is equal to 20, 30, 40 and 50 on 20 action classes, which are selected randomly from 50 actions in the UCF50 dataset. It can be observed from Figure 8 that the experimental results is getting better when the value of r is increasing. It can be infered that the proposed method is a little sensitive to parameter r in a reasonable range on UCF50 dataset.
As for the HMDB51 dataset, the performance of the experiment is significantly affected over a range of factors such as camera position and motion as well as occlusions. Figure 9  shows that as the cluster scale r increases, the experimental performance also becomes better for all different training sizes.
Then we test the performance under different values of parameter and different number of action classes. Experimental results are shown in Figures 10 and 11. We employ 3-foldcross-validation method in these experiments, and the training size is 60 in each fold, meanwhile the number of testing samples is fixed to 20. Figure 10 shows that when the number of action class increases, the average classification accuracy decreases significantly, which means our method is sensitive to the number of action class. When the action class number increases, the margins of different classes become blurred, which can make a great difference to the accuracy of the classification results. We can also see from Figure 10 that the performance of the proposed method remains stable when is smaller than 10 no matter what the number of action class is, which shows the robustness of our action recognition framework. As shown in Figure 10, a larger value of parameter will cause a lower overall matching rate, and it seems that the proposed method is insensitive to the small parameter value of . In Figure 11, the experimental results indicate that the result is the best for all different numbers of action class when = 0.001, and the performance becomes unstable when the value of is larger than 10.

CONCLUSION
In this paper, for support the application of crowd understanding and analysis, we explore the idea of using middle level semantic concepts to represent human behaviours from videos and argue that these semantic attributes enable the construction of more descriptive methods for human action recognition. Compared with previous works in action recognition domain, the proposed mid-level representation exhibits unique performance in fully utilizing the discriminative information among diverse action features. Each mid-level attribute is focused on the dis-similarity between the action itself and other actions, and is built from the low-level features. This method can make the representation more discriminative by driving the same action's features more similar while different action's features more distinct. As reported in section IV, the experimental results conducted on the KTH dataset demonstrate that the proposed method can obtain better action recognition results compared with the state-of-the-art. We also demonstrate that our method is comparable with other traditional methods on the challenging UCF50 and HMDB51 datasets, even with their limited number of training examples and the diversity of the actions. Especially, our method can reduce the negative effect of the similar actions even in the difficult cases, and overcome the illumination and view angle change to large degree, which verifies the effectiveness and robustness of the proposed method. For the future work, methods based on deep learning which integrate middle-level feature for better performance will be presented.