Driver activity recognition using spatial-temporal graph convolutional LSTM networks with attention mechanism

Driver activity engagement while driving plays a vital role that leads to negative outcomes of driving safety. To reduce trafﬁc accidents and ensure driving safety, real-time driver activity recognition architecture is proposed in this study. Speciﬁcally, a total of eight kinds of common driving-related activities are identiﬁed, which include the normal driving, left or right checking, texting, answering the phone, using media, drinking, and picking up objects. Raw experiment videos are collected via onboard monocular cameras, which are used for the upper body skeleton information extraction of the driver. Then, the graph convolutional networks (GCN) are constructed for spatial structure feature reasoning in a single frame, which is consecutively followed by long short-term memory (LSTM) networks for temporal motion feature learning within the sequence. Moreover, the attention mechanism is further utilised to emphasise the keyframes to select discriminative sequential information. Finally, a large-scale driver activity dataset, consisting of both naturalistic driving data and simulative driving data, is collected for model training and evaluations. Experimental results show that the general recall ratio of those eight driving-related activities reaches up to 88.8% and the real-time recognition efﬁciency can reach up to


INTRODUCTION
According to the statistical data of the World Health Organization, over 90% of traffic casualties were directly or indirectly caused by human factors, such as drunk, fatigue, distraction, mal-operation [1,2]. However, traffic accidents can be expected to be reduced via the effective driver activity recognition and giving hints or corrections accordingly while driving [3][4][5], even in the era of automatic driving. According to the definition of driving automation from the Society of Automotive Engineers International (SAE International), human drivers in level 2 or 3 automated-driving vehicles are allowed to participate in secondary tasks (e.g. texting, phone talking) while driving [6], but the driver still has the responsibility to take over the vehicle in emergencies. Thus, it's necessary to monitor the driver activities and evaluate the driver's take-over ability to assist automated driving systems [7,8].
tial movements, as well as narrow durations in the temporal dimension. Various activity analyses and feature extraction methods have been presented in previous researches by traditional machine learning approaches [9][10][11]. For instance, Braunagel et al. proposed five contextual features consisting of saccades, fixations, blink, head position, and rotation is to recognise driver activity using a Support Vector Machine (SVM) model [12]. Liang et al. utilised the driver's eye movement and the vehicle dynamic data to recognise driver cognitive distractions in real-time through an SVM model and a logistic regression model [13]. Moreover, deep learning methods have become extremely popular in recent years because of its powerful feature extraction ability. For example, Okon et al. implemented a pre-trained convolutional neural network (CNN) model with a novel triplet loss to improve the performance of the driver activity classification [14]. Yan et al. captured pictures of the driver's skin region through a Gaussian mixture model (GMM) and transferred them into an R*CNN model to explore driver motion clues for driver activity recognition [15]. A sensor-rich The framework of the proposed spatial-temporal graph convolutional LSTM networks (ST-GCLSTM) with attention mechanism architecture was proposed by Jain et al. for driver manoeuvre anticipation via LSTM networks to utilise temporal characteristics of human body motion, recurrent neural networks (RNN) are adopted for the driver activity recognition [16]. Nevertheless, considering that human action is generally a continuous process both in temporal and spatial domains, the aforementioned work of driver secondary task recognition seldom takes both spatial structure representation and temporal motion features into consideration. Recently, the spatialtemporal feature of the human skeleton sequence has been proved to be vital for human action recognition. For example, Si et al. propose novel attention enhanced convolutional LSTM networks to establish the co-occurrence between the spatial and temporal domain [17]. Rui et al. proposed the Bayesian graph convolution LSTM network, which not only constructs spatialtemporal dependency but also explores the variation among subjects [18]. Those skeleton-based human activity recognition methods achieve promising performances by constructing the hybrid of GCN and LSTM to deeply extract the spatialtemporal feature of the human motion. However, the results are usually obtained from open-source human skeleton datasets, whose skeleton data are generally of great quality that was captured by the depth camera indoors, and action recognition algorithms are implemented offline. Also, the numbers of those samples with different categories are usually well balanced, which is not advisable for the driving condition because normal driving generally accounts for most of the samples.
Therefore, we intend to present a driver activity recognition framework that can steadily extract the driver skeleton data from low-cost monocular cameras, meanwhile, accurately recognise different driving-related tasks in real-time. However, several dif-ficulties should be overcome to achieve this goal. First, the video clips collected by the onboard monocular camera always contain lots of noises due to non-negligible environmental factors such as illumination and camera motions, which would directly affect skeleton joints detection. Second, the driver activity recognition network should be of a concise framework with accurate classification performances as the onboard computing resource is usually limited. Third, there usually exists a huge data imbalance problem among normal driving and other secondary driving tasks in real-time applications, which would especially affect the recognition performance of small-scale categories.
To overcome those difficulties, we proposed an end-toend driver activity recognition system, namely, the spatialtemporal graph convolutional LSTM (ST-GCLSTM) networks. The whole architecture is depicted in Figure 1. Positions of the driver's upper body joints are firstly captured in original frames. Aiming to correct the lost or misrecognised joints, the temporal exponential mean filter is utilised to smooth skeleton data through the sliding window. To formulate the discriminative spatial-temporal feature, the concise five layers of the GCN is proposed to build the driver skeleton graph, meanwhile, the spatial structure representation among joints is derived. Then, spatial features captured by the GCN are transferred into one single-layer of the attention-enhanced LSTM networks for temporal motion feature extractions among consecutive frames. To solve the data imbalance problem, the focal loss function is utilised to balance the loss value among the normal driving and other driving-related tasks, eventually, a well-structured dataset is collected for model training and evaluations.
In summary, the main contributions of this study are listed as follows, (i) An end-to-end driver activity recognition system is introduced. With the combination of the GCN and the LSTM, typical spatial-temporal dynamics and dependencies of driver motions are effectively captured from the smoothed skeleton data. A total of eight different kinds of common driving-related tasks, namely, normal driving, left or right checking, texting, answering the phone, using media, drinking, and picking up objects, would be recognised in realtime. (ii) A large-scale dataset consisting of naturalistic driving data and simulative driving data is collected. Essential data processing methods, including the temporal exponential mean filter, transfer learning, and the focal loss function, are implemented to adapt to practical applications, meanwhile, solving the data imbalance problem.
Finally, the rest contents of this study are reminded as follows. Related work is presented in Section 2. The detailed modelling of the ST-GCLSTM is interpreted in Section 3. While Section 4 focuses on the experiment setup, model evaluations, and discussions of classification results. Finally, conclusions and future work are presented in Section 5.

RELATED WORK
Driver activity recognition has been widely studied in existing publications from different aspects. We focus the review on the classification methods, and related researches could be generally divided into two main methods, namely, the singleframe-based method and the consecutive-sequence-based method.
The single-frame-based method utilises one frame to predict driver activity, ignoring the influence of adjacent frames. For example, Vicente et al. used a 3D geometric reasoning and SVM approach to detect the driver's eye-off-road activity [19]. Liang et al. utilised Bayesian networks for real-time detection of driver cognitive distraction using the driver's eye movement and driving performance [20]. Due to the development of large-scale datasets such as the Southeast University Distracted Driver Detection Datasets and the State Farm Distracted Driver Detection competition on Kaggle, much attention has been attached to utilise the CNN-based model to recognise driving-related activities. Xing et al. utilised three fine-tuned CNN to predict driver activity from the driver's body region extracted by GMM [21]. Tran et al. developed a driver distraction detection system based on the driving simulator [22]. Four CNN models including VGG-16, AlexNet, GoogleNet, and ResNet were implemented to detect driver distraction in real-time. Eraqi et al. proposed a genetically weighted ensemble of CNNs. Five parallel CNNs are built to calculate on original and segmented images, and finally ensembled by a genetic algorithm [23].
Considering driver body motion is a highly continuous process in the temporal domain, consecutive-sequence-based methods are widely adopted to extract temporal high-level features from sequential frames [24,25]. Craye et al. utilised the Hidden Markov Model (HMM) to recognise driver distraction from multiple senor data such as audio, colour video, depth map, heart rate, and steering wheel and pedals positions [26]. Martin et al. used the body joints and the distance between joints and the car interiors to train the RNN for secondary task recognitions [27]. Olabiyi et al. proposed a driver activity prediction system based on deep bi-directional RNNs using camera data as well as vehicle dynamics [28]. Peng et al. proposed a driver manoeuvre detection (DMD) system. Vehicle trajectory and image feature captured by VGG-19 were utilised as the input of LSTM networks to recognise driver manoeuvre on sequential data [29]. Weyers et al. utilised the 3D driver body key points and the segmented image of the driver's hands as the input of the RNN networks for driver activity recognition [30].
Recently, to better extract human motion features, many pieces of researches devote to construct fusions between the spatial representation and the temporal dependency of the human skeleton data. Liu et al. proposed a novel tree-structure traversal framework of skeleton and a new gating mechanism of the LSTM unit to model dependencies both in spatial and temporal domains [24]. Zhang et al. transferred skeleton data to hand-crafted geometric relational features and then utilised a three-layer LSTM to recognise human action from sequential data [31]. Since the simple chain or the hand-crafted variation of body joints lacks structural information of the human body (e.g. connectivity of different joints), the GCN is introduced to extract the spatial representation from the raw human skeleton data. The GCN has attracted much attention in various domains including social networks, knowledge graphs, molecular analysis etc. [32]. Yan et al. firstly utilised the GCN for human action recognition. Spatial links, as well as the temporal connectivity, are designed to formulate the spatial-temporal graph model [33]. Li et al. presented multi-scale graph convolutional filters to encode the temporal motion and the spatial structure [34]. Inspired by the powerful spatial feature extraction ability of the GCN, we fuse the graph convolution with the LSTM networks for skeleton-based driver activity recognition, which could effectively model the dynamic and dependency both in the spatial and temporal domains.

METHODOLOGY
In this section, the algorithm of the proposed ST-GCLSTM model is introduced in detail. More specifically, the process of graph construction, the implementation of graph convolution, the construction of attention-enhanced LSTM networks are presented.

Spatial graph convolution
In this work, let G t = {N t , L t } denote the skeleton graph of the singe frame at time t , where N t contains m joint denotes node n i t and node n j t are connected. For matrix operations, L t can also be denoted as a matrix A t ∈ ℝ m×m , (1) The feature value of every node is a 2D coordinate vector in image coordinate system, that is, F (n i t ) = (x i t , y i t ) . In this paper, 14 joints of the driver upper body are selected to build nodes set N t . The input sequence of skeleton graphs is denoted as Different from the traditional convolutional operation on the normal digital image where pixels are naturally ordered as regular grids, the nodes in the graph are irregular in non-Euclidean space. Therefore, the sampling function and weight function should be rebuilt in graph convolution.
Similar to the sampling function of traditional convolution, the graph sampling function at the node n i are defined as follows, where d i j is the shortest path length between node n i t and node n j t according to the link set L t . In this paper, we set D = 1, which means the receptive nodes are those who directly connect to the root node.
We further simplify the process of nodes labelling process and weight distribution based on Yan et al. [33]. In this work, the graph connectivity could be represented by a fixed link set L t . Therefore, the labelling process can be simplified by manually defining the order of all nodes as shown in Figure 2(b). Then, we define the part aware partition strategy as shown in Figure 2(c). Generally, the human body moves in local parts, which means, eyes, head, and hands will corporately sustain the task load when pointing to a specific visual target [35]. Therefore, three nodes (No. 0, No. 4, No. 8) are defined as sub-central nodes representing the centre of the head, left hand, and right hand. Then nodes in the receptive field could be divided into three subsets: i) root node, ii) sub-central nodes, and iii) neighbour nodes (remaining nodes in the receptive field). The process of partition strategy could be denoted as Next, every subset would be distributed a trainable weight value, therefore, the graph weight function could be further simplified as . (4) That means the graph weight function could be denoted as a matrix W ∈ ℝ 3 ×C . According to the graph sampling function, the graph weight function and the partition strategy introduced above, the output feature vector F t out of the node n i t at the time step t is defined as follows where F t in denotes the input feature vector at the time step t . To apply the graph convolution on the whole human skeleton graph, we utilised the implementation of graph convolutional networks as follows. Assume the graph connectivity is represented by the aforementioned matrix A t ∈ ℝ m×m , and matrix A t is divided into three sub-matrixes according to the nodes partition, namely, where A 1 t contains the self-connection of root nodes, A 2 t contains the connection between root nodes and sub-centres and A 3 t contains the connection between root nodes and neighbour nodes. Thus, the graph convolution of the whole skeleton graph could be implemented as follows here denotes the normalisation matrix aiming to balance the contribution of different partitions. The spatial feature could be also denoted as V = {F 1 out , F 2 out , … , F T out } via multi-layers of graph convolution and assembling the output feature vector at every time step. The architecture of graph convolutional networks is shown in Table 1.

Temporal LSTM
To construct the relation between the spatial configuration and the temporal dynamics, LSTM networks are adopted on Average pooling The structure of the attention enhanced LSTM networks the top of the aforementioned GCNs. The spatial structure feature V = {F 1 out , F 2 out , … , F T out } captured from the GCN is transferred into LSTM networks. Through the input gate i t , forget gate f t and output gate o t , LSTM cells can learn how much information to store or discard and update the hidden state h t timely according to the input data v t and h t −1 . For the first LSTM layer, v t = F t out . Functions of the basic LSTM cell are defined as follows, where denotes the sigmoid activation function and ⊙ denotes the element-wise multiplication, W and b are the weight matrix and the deviation matrix, respectively. The single-layer LSTM networks are implemented as shown in Figure 3. Finally, output vectors of all memory cells of the last LSTM layer are concatenated as H = (h 1 , h 2 , … , h T ). Generally, the hidden state of the last memory cell of the LSTM layer, that is, h T , is adopted as the final output vector for classification. However, human action is an evolutionary process, thus, contributions are different among different frames. For example, when the driver is answering the phone, the frames of holding the phone near the ear are certainly more representative than those frames of lifting the phone. To fully utilise all hidden states of the LSTM memory cells in the last layer and adaptively emphasise keyframes, the attention mechanism on the top of the LSTM layers is utilised to allocate the trainable weight value for every hidden state. Functions of the attention mechanism are constructed by where H n×T is the output feature vectors of LSTM layers, n denotes the size of each LSTM memory cell, T denotes the length of the sequence, w denotes the trainable weight vector and denotes the distribution coefficient matrix. Dimensions of w, are n and T , respectively. H * n×T refers to the final feature vector of the attention model. Figure 3 shows the structure of the attention enhanced LSTM networks. The spatial feature captured by the GCN is transferred into a single-layer LSTM network. Each memory cell calculates spatial-temporal features through the state transition. While the attention mechanism allocates each memory cell a unique weight and captures the synthetical spatial-temporal feature of the ST-GCLSTM model. Finally, the fully connected layer with the SoftMax classifier could compute the score of each class from the spatial-temporal feature of drivers' activity.

EXPERIMENTS
In this section, detailed descriptions of the data collection will be introduced first. Then, the ablation study and the comparative study with other methods are conducted to evaluate the effectiveness of the proposed model. Finally, the real-time feasibility of the proposed architecture is evaluated, verified, and discussed.

Data collection
In this study, the low-cost monocular camera is set on the right side of the vehicle to collect driving activity data. The raw video resolution is 1920 × 1080 × 3 and compressed to 320 × 180 × 3 later to reduce the computation cost. A total of seven drivers, including six male drivers and one female driver, are invited to participate in the driver activity data collection. The mean age of participants is 23 years old and the mean driving experience is 3.9 years. The procedure of data collection could be briefly divided into two stages as follows.
(i) Naturalistic driving data collection. For the sake of security, participants were asked to driving the experimental vehicle in the specific test routine (a road inside the campus with few passengers and vehicles). Participants would execute secondary tasks according to their driving habits and the order of the task execution was random. Moreover, a director would monitor the participants and timely gave safety warning in the case of an emergency. (ii) Simulative driving data collection. Since the secondary-task engagement data in realistic driving conditions are limited, participants were asked to mimic the driving-related tasks in a stopped vehicle. Participants would perform secondary tasks according to their habits and experience in the first stage. The participants should resume normal driving for 1 min every time they finished secondary tasks and the execution order is random so that the simulative data of drivingrelated tasks would be more representative of naturalistic driving conditions.
It takes around 90 min to finish the experiment for every participant. Then the collected videos are cut into different video clips and classify them into the eight driver activity categories mentioned above. According to Li et al. [36], most driver activities last several seconds, and those short-term driver activities such as mirror checking generally takes 0.5 to 1.0 s. Thus, we set the length of the sequence to 15 frames so that temporal variations among different activities will be balanced. Finally, the driver activity dataset contains 12177 video clips, the numbers of those secondary driving-related tasks are listed in Table 2.

Transfer learning and data pre-processing
We use an open-source toolbox AlphaPose [37][38][39] to obtain the 2D pixel coordinates (x, y) of the driver's upper body joints of every frame. However, in naturalistic driving conditions, the illumination varies and the lower body of the driver might be missing in the camera view, the adaptability of the original Alpha-Pose could be underperforming in this condition. Therefore, we use our collected datasets to fine-tune the last few layers of the pre-trained model. Specifically, the original fully connected layer of the AlphaPose model outputs 25 joints of the human whole body, and we replace it with a new fully-connected layer with outputs the 14 joints of the driver's upper body. Figure 4 shows some collected images and the skeleton estimation results. However, the application scenario while driving is a more complex indoor scenario because of the illumination and the camera's shake. Some joints of the human skeleton would be lost or misrecognised. Therefore, we use the temporal exponential mean filter and set a confidence threshold to correct the misrecognised joints. Specifically, we set a temporal sliding window of five frames and if the confidence value of the joint is below the threshold, the joint will be replaced by the exponential mean value of corresponding joints in the past five frames. The function of the exponential mean filter is defined as follows, where is a hyper-parameter which is set to 0.7. All skeleton joints position would be filtered, one example of the smooth result regarding the driver's right wrist is shown in Figure 5.

The solution of data imbalance
In naturalistic driving conditions, the number of secondary driving tasks is limited. Even though we sample the normal driving data and ask participants to mimic secondary tasks in a stopped vehicle, the data imbalance between normal driving and other tasks is still enormous. Therefore, we utilise focal loss [40] to overcome the problem of data imbalance. The function of focal loss is defined as follows, where N is the number of samples, M is the number of classes, y i,c is the binary indicator (0 or 1), y i,c = 1 if class label c is the correct classification for observation i, p i,c is the predicted probability of class c for observation i. c is a balance factor, and we set different c for every class so that the loss of small-scale class will be increased and the loss of large-scale class will be reduced. is a modular factor which could increase the loss of hard sample while reducing the loss of easy sample to improve the robustness of the model. According to experiments, c = 0.7 if class label c is normal driving, otherwise c = 0.3, and = 2.

Equipment and software details
The proposed model is implemented in 64-bit Open-source Linux operating system Ubuntu 16.04 installed with CUDA 10.0 and Cudnn 7.4 with the Intel Core i7 3.2 GHz computer. The program is written by Python 3.5 and the Tensorflow 1.4 deep learning platform. We use two NVIDIA RTX 2080ti GPUs to fulfil human skeleton extraction and ST-GCLSTM model training and testing. Specifically, we build five-layers GCN and one

Evaluations of the ST-GCLSTM model
In this section, the driver activity recognition performance of the proposed framework is evaluated. According to [16], the leave one out (LOO) valuation method could better verify the extensiveness of the proposed model. Specifically, the data of arbitrary one driver is used as the testing dataset and the rest is used as the training dataset. Therefore, the testing data of each driver are new to the model. We use the LOO validation for all models of our experiments, in which labels from T1 to T8 indicate normal driving, left checking, right checking, tex-

FIGURE 6
The confusion matrix of the basic LSTM ting, answering the phone, using media, drinking, and picking up, respectively, while D1 to D7 refer to seven different drivers. The effectiveness of the components of the proposed ST-GCLSTM model is evaluated in the following contents, which includes the evaluation of the graph convolution and attention mechanism. Since the recall ratio means the fraction of the total amount of positive instances that were recognised, we mainly focused on the recall ratio in evaluation indexes. For the sake of safety, a high recall ratio of secondary task engagements indicates a safe and concentrating driving manoeuvre, instead, accuracy and precision are inadvisable in this experiment due to the concern of data imbalance.
First, the effectiveness of the graph convolution is evaluated, where the baseline is the basic single layer of the LSTM network with 128 units whose inputs are the simple concatenations of the joints' positions. We compare the ST-GCLSTM (no attention-enhanced) with the basic LSTM to show the advantages of the graph convolution. Figures 6 and 7 show the confusion matrixes of the baseline and the ST-GCLSTM. The first column on the right side reflects the recall ratio and the bottom row shows the precision ratio. The general recall ratios of the basic LSTM and the ST-GCLSTM are 81.2% and 87.9%, respectively. Thus, the effect of the graph convolution is Moreover, texting has the most considerable enhancement, and the recall ratio increases by 12.3% based on the baseline. Besides, the confusion between normal driving and secondary tasks has been largely alleviated. Specifically, the misprediction ratio of normal driving and texting has been decreased by 13.4%. The graph convolution has a strong ability to capture subtle features through the connectivity of different joints. The facts above reflect that the complicated processing of the graph construction extract effective spatial features among human skeleton positions and improve the diver activity recognition performance.
Then we compare the ST-GCLSTM and the ST-GCLSTM with attention-enhanced to evaluate the influence of attention mechanism. Figure 8 shows the confusion matrix of the ST-GCLSM with attention. The general recall ratio is 88.8%, which is slightly higher than the ST-GCLSTM. Regarding the task of answering phone, the ST-GCLSTM with attentionenhanced gets the improvement by 11.0%. Considering the task of answering phone is a long-term activity where its sequence in time-window contains many irrelevant frames. Due to the introduced attention mechanism which adaptively allocates every frame in series with a unique weight value, such that, critical frames which contain more informative feature could be addressed, and the extraction of temporal dependency could be further strengthened. Table 3 illustrates the classification results of the attentionenhanced ST-GCLSTM using the LOO evaluation. The first column on the right side indicates the mean recall ratio of every driver towards all eight driving tasks, while the bottom   T1  T2  T3  T4  T5  T6  T7 T8 Mean Tips: T1 to T8 indicate eight driving-related tasks, while D1 to D7 refer to seven different drivers.
row depicts the average recognition result of each driving task among all drivers. Specifically, the classification recall ratios of all drivers are higher than 80% and Driver #3 gets the highest 95.6% recognition recall ratio. The worst classification performance (80.2%) is found in Driver #6. One reason is that Driver #6 who is a cautious female driver, tends to perform driving tasks slightly and swiftly where the duration is too short to capture enough motion features. Moreover, Driver #6 tends to use eye movement more than head movement to check targets, those body dynamics are harder to be extracted. Regarding the result of secondary tasks, picking up gets the best recall ratio of 98.8%, while the drinking gets the lowest recall ratio of 71.5%. The results are all consistent with the confusion matrixes as shown in Figure 8. Table 4 further shows the recognition macro precision ratio, recall ratio, and F1 score of the aforementioned models in the study of component comparisons. Figure 9 shows the PR curves of the baselines and the proposed method. It shows that the

Comparisons with other methods
To further evaluate the proposed model, we also compare the proposed method with other relevant methods. Please note that the results of other methods may be lower than the original ones because we use the collected datasets by ourselves to maintain consistency, moreover, the LOO evaluation approach is utilised which is stricter than the normal cross valuation approaches. Generally, the related works in the comparison could be divided into three classes.

FIGURE 10
Comparison of the PR curves between the proposed method and other methods skeletons. In comparison with the ST-GCN, our model puts the LSTM at the back-end of the GCN synthetically, which calculates spatial-temporal features instead of only computing temporal convolution on the same joints between consecutive frames. (iii) Convolutional neural network based method. Compared with the GMM-AlexNet which predicts the driver activity from raw image data, the ST-GCLSTM model formulates the temporal dependency in sequences and uses the human skeleton as the input of the network to diminish the background noise and preserve representativeness of image data. Table 5 further shows the comparison results of the proposed model and other related methods, and Figure 10 demonstrates the precision-recall (PR) curves of the proposed model and other methods, which show that the proposed ST-GCLSTM model outperforms relevant methods and presents a promising classification performance.

Evaluations of the real-time application
To further verify the implementation feasibility of the proposed model, the experiment for real-time application is also discussed. Specifically, the test system is implemented on a laptop

FIGURE 11
Results of driver activity recognition with Nvidia GTX 1060 GPU and Linux Ubuntu 16.04 system installed. Since the driver skeleton extraction toolbox AlphaPose contributes to the majorities of calculation consumptions. To accelerate the computation speed, on the one hand, we simplify the AlphaPose and discarded irrelevant functions to avoid repetitive image reading and writing. On the other hand, we reduce the size of the input image to 320 × 180. Such that, the quality of the extracted skeleton is still satisfying and the time consumption of the AlphaPose processing is reduced to 32 ms. Moreover, we use the data-parallel processing and it takes about 9 ms for the ST-GCLSTM to recognise the driver activity. Therefore, the calculating efficiency of the whole system can reach up to 24 fps, which could satisfy the requirement of real-time applications. Figure 11 shows the confidence curves of different drivingrelated activities in real-time applications, which demonstrates the process of the transformation from normal driving to texting. The lines and shades reflect the mean probability and the standard deviation of different tasks, and tow blue dashed lines reflect the changing point of different tasks. The proposed ST-GCLSTM model could effectively and precisely recognise driver activity. The datasets will be enriched and more categories will be included in our future work, so that the real-time application could be further improved.

CONCLUSION AND FUTURE WORK
Overall, real time driver activity recognition system is proposed in this study. The human body skeleton data is extracted from raw videos to represent human body motion. To reduce the noises of raw joints data, the temporal exponential mean filter is utilised to correct the lost or mis-predicted joints. Then, the ST-GCLSTM model with attention mechanism is constructed to reason spatial features among different skeleton joints and capture temporal dependencies through consecutive frames. To overcome the data imbalance problem, a well-structured driving task dataset is collected, and the focal loss function is further utilised for model training. Manifold comparisons of experiments show that the proposed ST-GCLSTM model could achieve up to 88.80% recall ratio of eight common drivingrelated tasks. The real-time application efficiency could reach up to 24 fps, which has the potential of transferring the model to the embedded hardware and satisfy the requirements of engineering applications.
In the future, the proposed system will be further updated and tested in naturalistic driving conditions. More driver activities will be taken into consideration. Moreover, head motions and facial features of the driver will be taken into account to offer more insights on driver's statuses. Also, the cases of multiple passengers in the vehicle will be optimised by utilising an object detection algorithm.