A speciﬁc perspective: Subway driver behaviour recognition using CNN and time-series diagram

Urban rail transit, especially the subway, has been booming in China for a decade, imposing safety challenges on all related parties. Drivers’ behaviours are particularly crucial. Typically, drivers’ actions are recorded by cameras, and the surveillance videos are evaluated manually. Current driver behaviour recognition methods mostly target the bus or car drivers and can hardly be implemented for subways, because subway drivers follow a rigid working code that needs a time sequence of movements to describe. In this study, we propose a recognition model to automatically recognise behaviours from single-frame images that are extracted from surveillance videos; second, we convert the recognition results into time series diagrams, thus the recognised behaviours can be interpreted and analysed statistically and effectively. The validation experiments demonstrate that the convolutional neural network model can recognise 96.20% driver behaviours, and time series diagrams add time information to the behaviours, providing a convincible reference for subway driver evaluation


INTRODUCTION
Rapid urbanisation in China has caused a huge surge in rail transit [1], and it, in turn, has created concerns of safe operations. Among the influential factors, drivers' behaviours are crucial and can directly impact the safety of passengers [2]. Typically, a surveillance camera is deployed in the cab to record driver's activities, and drivers' behaviours will be evaluated through analysing those videos. This process, however, has to be completed manually, which is laborious and ineffective. This job has become particularly burdensome in China, given the increasing amount of trains and drivers in recent years. Therefore, a system specifically for subway trains that can intelligently identify and evaluate drivers' behaviours is quite tempting, and algorithms of human behaviour recognition exhibit the potential to fulfil this task.
Human behaviour recognition has received extensive attention in transportation systems. For instance, Wang et al. proposed sampling motion trajectory [3] to identify human behaviours and then proposed an improved version of the dense trajectory extraction method [4], which once occupied a leading position in the field of behaviour recognition. Ohnbar et al. [5] proposed a framework for leveraging three cues obtained from videos capturing car drivers' hands and heads in order to provide activity recognition and prediction. Yang et al. [6] collected images that record the movements of car drivers' upper-body joints from surveillance videos and identified their distractions and dangerous behaviours. A similar method was proposed by Chao Yan et al. [7] that monitors drivers' hand movements and considers them as a predictor of safe driving. These methods can accurately identify human behaviours and extract movement trajectories, but their cameras hold a broader and clearer view and can capture detailed movements of drivers. Subway trains' cabs, however, are more cramped and views of the surveillance cameras are more limited, where current trajectory-based methods can barely function.
Some researchers found another way that does not need a clear view as before [8]. They classified images before sending them to behaviour recognition. For example, Poppe et al.

FIGURE 1
Technical route for subway driver behaviour recognition transformed the problem of behaviour recognition into a classification of image sequences and discussed various representation and classification methods for images [9]. Fortunately, convolutional neural network (CNN) model can lead this way further, because it can find the most effective features directly from the input data [10]. CNNs can classify and detect twodimensional images through convolution and pooling operations [11,12].
Many scholars have noticed CNNs' expertise and applied it to drivers' behaviours recognition. Behera et al. integrated the body posture into the existing deep networks, and the performance of driver activity recognition was significantly improved [13]. Yan et al. [14] and Xing et al. [15] used the Gaussian mixture model (GMM) to extract images of car drivers' skin area. Then they adopted deep learning methods, either R*CNN or Alexnet, to classify behaviours. Pang et al. [16] also directly classified car drivers' behaviours into eight categories and constructed a deep CNN framework to identify and classify them. Chen et al. [17] processed the raw RGB images with a GMM-based segmentation algorithm and proposed a driving-related activity recognition system based on the deep CNN model. Although these deep learningbased methods are effective, they all performed on separate frames without time-series information, and their targets, that is, car drivers, follow more flexible operation codes. Subway drivers' working specification, however, is more rigid and only sequences information being incorporated can feed safety requirements.
Therefore, we utilised CNN model's recognition proficiency in single-frame images, specifically targeting at videos with certain angles taken from subway trains' cabs. Then we adopted time-series diagrams to further investigate the previous recognition results and provided a direct and statistical reference for subway drivers' behaviour evaluation.

Technical route
The proposed approach works in three steps as shown in Figure 1. First, a surveillance video is decomposed and a sequence of frames is extracted; second, these frames are input into a CNN model and behaviours of these frame can be recognised. Third, the recognition results are converted into a time-series diagram, where the drivers' behaviours can be further calculated and analysed, and the judgement whether the driver follows the rules can be achieved.

2.2.1
Preprocessing of input data A video recording a driver driving a subway train is obtained as input data. First, the video is converted into a sequence of consecutive frames, which are used to construct the dataset of subway driver behaviours. Second, some sequential frames are selected from that dataset as test samples, and the rest are taken as training samples. Training samples are not necessarily sorted in a sequence but need to be classified and labelled according to the manual of 'Daily Operation for Drivers'. The manual defines four types of the driver job: Mainline driving, inbound operation, platform operation and outbound operation. Their detailed descriptions are shown in Figure 2.
Thus, driver behaviours can be classified into six categories accordingly: Standing on the platform, pointing call, hanging gear, pressing the start button, driving between two stations and leaving the seat, and images in training samples can be labelled  and classified into the six categories. We assign each category a number as its label as illustrated in Table 1.

CNN architecture
CNNs are a class of feedforward neural networks with convolutional computation and deep structure. A typical CNN contains three layers as illustrated in Figure 3 [18]. The first is the input layer; the second, the combination of the convolutional layer and the pooling layer; and the third, the fully connected output layer. First, the input layer accepts data including original audios, RGB images, and so forth. Then the convolution layer uses a convolution kernel to convolve the local image area and obtains local information of the image, that is, to extract features. The calculation for each convolution operation is as follows: where J, I are the width and the height of the convolution kernel, respectively; M, N are the width and the height of the input image, respectively; w is the weight; b is bias; X is the pixel value at (m, n) position in the input image, and y is the calculation result of the convolution operation.
Second, the convolution layer utilises the pooling layer for feature mapping. The obtained feature maps are subsampled, so both the data size and the amount can be reduced.   Third, the final layer, the fully connected layer, acts as a classification to map features to the tag space of the data. Among these layers, the convolved and the fully connected layers contain neurons whose weight is learned and adjusted. Thus, the input data can be better represented after training.

CNN model establishment
Considering the excellent performance of CNN in image recognition, this study takes the behaviour data of subway drivers as the input of a CNN model and the behaviour category as the output, and thus obtains the mapping relationship between them as shown in Figure 4. We choose Alexnet model to identify behaviours of subway drivers. Alexnet network structure is shown in Figure 5. First, the first convolutional layer filters the 150 × 150 × 3 input image pixels with 96 kernels of size 11 × 11 with a stride of 4, resulting in 96 feature maps of size 148 × 148. Second, a max pooling with a kernel size of 3 × 3 and a stride of 2 is used to filter maximum activations from the output 96 feature maps of the first step. Third, another convolution layer with a filter size of 3 × 3 and a stride of 1 is applied, followed by another max pooling layer with a kernel size of 3 × 3 and a stride of 2. By this analogy, the Alexnet network used in this study can be established as explained in Table 2, where 'f' is the size of a convolution kernel or a pooling kernel, 's' the step size, and 'd' the number of convolution kernels in a layer.
Thus, the Alexnet network is ready for training. We have to ensure that the mapping relationship is stable, that is, any input can correspond to a certain output that is infinitely approaching the expected value. During training, one can adjust the parameters to improve accuracy, reduce the loss function value and converge loss function gradually.

Time-series diagram of subway driver behaviours
The CNN model can identify drivers' actions through singleframe images, but it is not enough. Individual actions cannot reveal actual behaviours if they are not connected by time. Therefore, we convert the recognition results output by the CNN model into time-series diagrams and recommend decisions based on them. First, the recognition results of the CNN model are converted into a time-series diagram. Second, through analysis on the diagram, driver behaviours information and parameters of the running train are obtained, including the number of behaviours of each category, duration of behaviours of each category, and status of the train. Third, this information, combined with the standardisation operation manual, is evaluated and decisions on whether the subway drivers' behaviour is qualified are recommended.
where i is the number of frames of the image, and R i is an array of probabilities of each image. The argmax function is to find the label that corresponds to the maximum probability in the array, and the label is taken as the classification result. The number of frames of each type can then be calculated as follows: , a = 0, 1, 2, 3, 4, 5 where i represents the number of frames, and y the classification result.

FIGURE 6 Time-series diagram analysis
We take the number of frames i as the abscissa and the obtained result y as the ordinate. Thus, a time-series diagram is constructed. Figure 9 exemplifies a time series of certain frames.

Time-series graph analysis
The behaviour of a subway driver is a continuous process. Only the number of frames of each type of behaviours can hardly describe the driver's actions. Therefore, we adopt the number of occurrences of each type of behaviours and the duration of occurrence to further analyse drivers. For example, Figure 6 illustrates how to calculate the number and the duration of the behaviour of 'standing on the platform', labelled as class 4. There are two occurrences with the starting points f1, f3 and the endpoints f2, f4.
The number of occurrences is defined as And the duration of each occurrence can be calculated as where f b , f a are the endpoint and the starting point of an occurrence, respectively; and FPS is the frame rate per second.

Dataset of subway driver behaviours
We took a video clip from a cab camera installed in a train on Beijing metro line 9. The clip was shot on 7 February 2018, from 8:25:26 to 9:00 A.M., lasting 35 minutes with a total of 64,651 frames. During this period, the train passed 13 stations, and the running between two stops lasted for 1 to 2 min. As shown in Figure 7, the camera was installed above the head of the driver; thus, the driver's face was partly out of the frame while his hand movements and other behaviours could be clearly recorded. The monitoring screen also displayed time and the train number. Given the same standards for the operation that all the subways in China use, the same uniform that all the drivers wear during work, and the same view that surveillance cameras in the cabs capture, our model could be easily applied to any other subway system in China. Therefore, choosing a video containing all the behaviours of a driver as a training dataset is sufficient.
Given the small difference between two adjacent frames, we extracted every five consecutive frames at regular intervals and included these frames into a dataset. In order to balance the sample size of each type of behaviours, we took 1830 frames as CNN training samples and 1801 frames as CNN test sam-  Training set  215  215  216  217  219  214   Validation set  92  89  91  84  78  93   Training samples  307  304  307  301  297  307 Test samples 1801

FIGURE 8
Accuracy of the training process ples. We classified and labelled images for training by using the method described in Section 2.2.1 and took 1290 of them into a training set and the rest 540 into a validation set. Table 3 shows the grouping of network training samples and test samples.

CNN model accuracy
We established a CNN model on keras 2.1.5 platform by following the rules described in Section 2.2 and trained it by using the images in the training set, which were classified and labelled. We compared the parameter adjustment technique with the multiple training results and set the batch size as 16 and the epochs as 10, which improved the training accuracy. Figure 8 exhibits the accuracy of the training process. The accuracy of validation is 0.9620, and the accuracy of training is 0.9790. Therefore, the final model is satisfying.

Time-series diagram analysis result
We selected a sequence of frames that recorded the train passing through four stations from the test dataset and input them into the trained CNN model. The recognition output was then converted into a time-series diagram as shown in Figure 9.
Based on the diagram, we calculated the number of occurrences of each behaviour and the duration of each occurrence, and the results are listed in Table 4. Table 4 shows that behaviours of 'pointing call' and 'hanging gear' last shorter, while 'pressing the start button' lasts 3 s and the driving time is the longest. When the driver is 'standing on the platform', indicated by category '0', the train stops at the platform; when the driver is 'driving between two stations', indicated by category '4', the train is running. Therefore, the diagram also tells the durations when the train stops and runs as listed in Table 5.

Validation of subway driver behaviour recognition model
The test samples, a total of 1801 images, were input to the subway driver behaviour recognition model. The result of each frame was an array of probabilities [a, b, c, d, e, f] of the six kinds of behaviours the driver might conduct. Table 6 exemplifies part of the results.
The numbers in red in Table 6 are the most possible categories they belong to.
We manually checked each image to validate behaviour recognition. Among the 1801 frames, only 77 false positives occurred. The accuracy of the proposed network reached 95.7%, which is consistent with its training results and also reflects its superior performance.

Analysis of the behaviour of subway drivers
The manual of 'Daily Operation for Drivers' defines the sequence of actions a subway driver should take: Mainline driving, inbound operation, platform operation and outbound operation. We interpreted this sequence into the order of the behaviours as 4→1→2→5→0→5→1→2→3→4, that is, driving between two stations → pointing call →hanging gear → leaving the seat →standing on the platform → leaving the seat → pointing call →hanging gear → pressing the start button →driving between two stations. This sequence of behaviours should repeat during the driver's whole working time. Therefore, we used this sequence as a standard to analyse drivers' behaviours. The results of those test samples are shown in Table 7. When all the driver's actions are in compliance, we can determine that the driver follows the rules; otherwise, the relevant personnel will be informed.

Accuracy comparison
To further validate the superiority of our CNN model, we compared it with other deep learning methods proposed in [3,4,14,16,17], and the results are shown in Figure 10. Methods in [3,4] using positions of hands or joints achieved accuracies of 82.4% and 89.62%, respectively, which are within a reasonable range but not very high. Methods in [14, 16, and 17] used deep learning for behavioural classification and achieved accuracies of 97.76%, 94%, and 98.89%, respectively, which are a huge leap from those in [3,4]. Our approach, the proposed Alexnet network, holds an accuracy of 96.20%, not the highest, but it goes further. It converts the recognition results into a time-series diagram for later analysis and decision making.

CONCLUSION
In this study, we proposed a behaviour recognition model to intelligently identify the behaviours of subway drivers. Our   approach combines a CNN model and time-series diagrams, targeting at the restricted angles of surveillance videos and strict operation codes of subway drivers. Through the validation experiments, conclusions can be drawn as follows: 1. The CNN model we construct can achieve a high accuracy of 96.20%, which is impressive, especially when the drivers' joints and hands movements are not clear. 2. Time-series diagrams we borrow for better interpreting behaviours that are recognized by the CNN model can provide statistical and time-related information for drivers' actions, which is a solid reference for driver evaluation. 3. Our approach is the first to add time information to behaviour recognition, which provides not only a widely required automatic tool but also an effective reference for a safe driving evaluation.
We intend to further analyse drivers' behaviours and divide them into more categories so that the behavioural evaluation model will be more convincing. And we are also interested in improving the practicality of our approach in multiple fields, for example, in cities and on highways where drivers' movements are more flexible.