Temporal pyramid attention ‐ based spatiotemporal fusion model for Parkinson's disease diagnosis from gait data

Parkinson's disease (PD) is currently an ongoing challenge in daily clinical medicine. To reduce diagnosis time and arduousness and even assess PD levels, a temporal pyramid attention ‐ based spatiotemporal (PAST) fusion model for diagnosis of PD is produced by using gait data from ground reaction forces. This model is innovative in two aspects. First, by using the temporal pyramid attention module, multiscale temporal attention is obtained from raw sequences. Second, 1D convolutional neural network and bidirectional long short ‐ term memory layers are used together to learn spatial fusion features from multiple channels in the spatial domain to obtain multichannel, multiscale fusion features. Experiments are performed on the PhysioBank data set, and the results show that the proposed PAST model outperforms other state ‐ of ‐ the ‐ art methods on classification results. This model can assist in the diagnosis and treatment of PD by using gait data.


| INTRODUCTION
Parkinson's disease (PD) is a neurodegenerative brain disorder. This disorder predominately affects dopaminergic neurons in the substantia nigra, which control balance and movement [1]. It is the second most common neurodegenerative disease seen in people over 60 years of age, just after Alzheimer's. In 2015, PD affected 6.2 million people and resulted in approximately 117,400 deaths globally [2]. With increasing public awareness, computer-aided diagnosis (CAD) of PD has been developed to assist clinicians in diagnosing it in the early stage.
The primary symptoms of PD are tremor, postural instability, muscle rigidity, and slowness of movements. These symptoms alter gait in the early stage of PD [3]. Gait parameter analysis as an automatic, non-invasive method has been widely used for the diagnosis of PD by observing different kinds of gait features, such as stride fluctuations in PD, time-stamped gait data, wrist sensors, gait rhythm signals, and signal turn counts.
Ground reaction forces (GRFs) are a kind of gait variable that can be easily measured by force-sensitive resistors on feet non-invasively. GRFs are different among persons and even different for the same person at different times [4]. Therefore, handcrafted feature extraction methods have limits for gait analysis using GRFs. Although many researchers are focused on gait data analysis, they still lack objective tools to assist physicians in gait evaluation. There are still challenges for detecting PD and predicting the PD severity rate from gait data.
With the current deluge of the success of deep learning and the urgent need for PD diagnosis, we propose a spatiotemporal fusion model to classify and estimate PD severity levels from gait reaction force data. The flow chart of our temporal pyramid attention-based spatiotemporal (PAST) fusion model is shown in Figure 1.
The main contributions of this work are summarized as follows: (1) A data-level fusion method is proposed by incorporating multiple sources of raw sensor data to acquire more definite, informational, and comprehensive fused data than that of the primitive sources. (2) The combination of 1D convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) layers is studied to extract spatial features for multichannel sensor data, to obtain more comprehensive multisensor fusion features. (3) With the proposed temporal pyramid attention module, global and local temporal attention can be obtained from input sequences to enhance the multiscale temporal characteristics of the input signals. This module is a selfattention module, with no need for additional training parameters.
Extensive experiments are performed to validate the effectiveness of our model, which realizes state-of-the-art results on the PhysioBank data set [5].

| RELATED WORKS
In PD diagnosis, large quantities of human activity data have been collected, and numerous feature extraction and machine learning methods have been explored in previous works. Here, we divide these methods into traditional handcrafted feature extraction methods and deep learning methods.
In traditional handcrafted feature extraction methods, Song-Hong Lee et al. [6] used both gait characteristics of VGRF for PD patients' feet and wavelet-based feature extraction to classify PD patients and healthy controls. Daliri et al. [7] applied a short-time Fourier transform (STFT) on each input signal using the same vertical ground reaction force (VGRF) input data. Ertugrul et al. [8] proposed an algorithm based on shifted 1D local binary patterns with multilayer perceptron to classify the feature vectors. Wei Zeng et al. [9] used phase space reconstruction, empirical mode decomposition, and neural networks to classify gait patterns between patients with PD and healthy controls. Although the aforementioned methods achieve satisfactory results on specific data sets, they are limited in their capacity for different data sets because features are manually defined.
In deep learning, recurrent neural networks (RNNs) are widely used to process time series and sequences. Many RNN variants, such as LSTM (long short-term memory) [10] and GRUs (gated recurrent units) [11], have been proposed to improve RNN. BiLSTM [12] is a further development of LSTM that can access both the preceding and following contexts by combining the forward hidden layer and the backward hidden layer. It can also access both past and future contexts. The attention mechanism (AM) [13] can highlight important information from contextual information by setting different weights. The combination of LSTM and AM can further improve the classification accuracy. Aite Zhao et al. [14] proposed a twochannel model that combines LSTM and CNN to learn the spatiotemporal patterns of the gait data, which was defined as a decision-level fusion method. Maachi et al. [15] proposed a network with 18 parallel 1D CNN for feature extraction from VGRF signals. The pyramid attention model was also proposed to obtain multiscale attention for classification. Yang Du et al. [16] presented an interaction-aware spatiotemporal pyramid attention network for action classification, which constructed a spatial pyramid and utilized multiscale information to obtain more accurate attention scores. Hanchao Li et al. [17] introduced a feature pyramid attention module and a global attention upsampling module, which exploited the impact of global contextual information as a guidance for low-level features to select category localization details.
Inspired by the aforementioned methods, we present a temporal PAST fusion model for diagnosis of PD from gait data. It involves a temporal pyramid attention module for multiscale attention extraction in the time domain and a spatiotemporal domain fusion method for multichannels, which results in higher classification accuracy than other methods. It is an end-to-end deep learning method for which the input is gait data and the output is the classification results for PD levels.

| OUR APPROACH
In this section, we describe the PAST model in detail. First, this method uses a temporal pyramid attention model to extract multiscale temporal attention. Second, considering spatial relations between channels, it uses a spatiotemporal fusion model to learn the spatial and temporal features of the sequences. Then, features are classified into PD levels by the classification layer.

| Temporal pyramid attention model
The spatial pyramid attention module has been widely used to obtain better feature representation by combining multiscale features [16,17]. Motived by this model, we present a temporal pyramid attention model to extract multiscale temporal attention from raw inputs.
The temporal pyramid attention model is composed of multiple streams for extracting multiscale attention, as shown in Figure 1. Considering that attention covers range, we define attention as global attention G attention and local attention L attention . The top stream calculates G attention and the other streams calculate L attention with different time intervals. Global attention G attention is defined as comparisons between values of every time and the global maximum. X Global is obtained by the global max-pooling layer and then upsampled to the same dimension as input X. G attention is calculated using the 1-norm by Equation (1). Local attention L attention is defined as comparisons between values for every time and the local maximum, as Equation (2). X Local is obtained by a max-pooling layer with certain time intervals and then upsampled to the same dimension as input X. Different time intervals are used to calculate X Local to obtain multiscale temporal attention by using multistreams.

| Spatial feature learning
VGRF sensors are at specific positions on human feet. There is great relevance among multiple sensors, so we present a spatial feature learning model to learn about the spatial features among sensors at time t. To maintain the temporal order of the sequences, 1D CNN is used to learn about the relations among sensors, while BiLSTM is used to fuse multiple features to obtain a new representation for all channels. As shown in Figure 2, we use 1D convolutional filters to slide over channels of sensors and learn representations for spatial features. Then, we obtain a group of feature maps C ¼[X 1 ,X 2 ,…,X m ] at time t, t ∈ ð1; 2; …T Þ .We treat values on every feature map for one point i,i ∈ ð1; 2; …; NÞ,as a vector f . BiLSTM is used to learn the relations between every channel at time t, and the relations between every point from 1 to N and N to 1 are learnt. Then, we obtain a whole representation for all the channels at t.

| Temporal feature learning
LSTM is used to learn the relation between every time from 1 to T. The model we propose consists of two LSTM layers, and the inputs to the first LSTM are features for each time t. The outputs of the first LSTM layer are Tlength sequences. For the second LSTM layer, the output is a one-dimensional vector. Fully connected layers are used to map the vectors to the dimension space, while the softmax layer is used to classify vectors into labels of different PD levels.

| VGRF signal and data set
The VGRF signal is obtained from the feet sensors. In our experiments, data from the public data set PhysioNet are used.
This data set consists of three PD gait sub-data sets-the Ga data set [18], the Si data set [19], and the Ju data set [20] which were contributed by three groups of researchers, Frenkel-Toledo [18]

| Rating scales for PD
The Hoehn and Yahr scale (H&Y) [21,22] is used worldwide for scaling functional disability and objective signs in PD. It was originally designed on a five-point scale (1)(2)(3)(4)(5). A modified H&Y scale that includes 0.5 increments was introduced for clinical trials-Stage 1.0: unilateral involvement only; Stage 1.5: unilateral and axial involvement; Stage 2.0: bilateral involvement without impairment of balance; Stage 2.5: mild bilateral disease, some postural instability, physically independent; Stage 4.0: severe disability, still able to walk or stand unassisted; Stage 5.0: wheelchair bound or bedridden unless aided. In our experiments, we only classified PD with levels 1.0, 1.5, 2.5 and CO.

| Implementation details
Experiments were implemented using the Tensorflow and Keras deep learning framework and tested on an Intel Core i7 computer with 16 GB RAM. During training, the root mean square optimizer was used to optimize the network. The initial learning rate was set to 0.01 and reduced by multiplying it by 0.1 every 50 epochs. The batch size was 32. We used cross-entropy loss to compute the cross-entropy loss between network predictions and target values for classification, which is defined by Equation (3). X i,j was the network response for a given category, T ij was the target value of that category, and C was the total number of categories; here, C was 4. The output was four numbers with 0,1,2, and 3, for which 0 indicated a healthy person, and PD levels 1.0,1.5, and 2.5 were labelled with 1,2, and 3, respectively. The cross-entropy loss was calculated as the probability of a given observation being assigned to a given category, summed over all categories and observations, and normalized by the number of all observations. For preprocessing, we reshaped VGRF signal segments as 100�19�N, and each sequence had 100�19dimensions. One hundred samples were collected in two seconds, and the interval was 0.02 seconds. We found that the characteristics were not obvious at the beginning of the signal, so we discarded 50 seconds at the start. To enlarge the data set for training, we separated the time series with 50 shifting movements each time, and then we obtained a data set of 100�19�60,000, where 60,000 was the number of sequences. We used fivefold cross validation to validate our model, for which we also ensured that the training data and testing data were from different persons.

| The design of the temporal attention model
In this section, we discuss the design of the temporal attention model. Local attention was calculated at several local time intervals. We first selected T¼2, 4,8,16, and 32 samples as timescales, The enhanced classification results by models with different time scales are shown in Figure 3; we found that T¼2 had no effect on improving classification results, while T¼8,16, and 32 had little effect compared with T¼4; we also increased time intervals to 64 samples, but it had little influence on the classification results. We used two intervals with different combinations, which had the same effect as T¼4. Therefore, we chose T¼4 samples as local time intervals to learn the local attention of the series.
Global attention and local attention were used separately to show the effectiveness of the temporal pyramid attention model. Experiments were performed for PD level classification with different attention models. The results are shown in Figure 4.
It can be seen that temporal pyramid attention obtained better classification results than single global attention or local attention, for all data sets. Although the enhancement of classification accuracy was less than 1%, it was important in the PD level classification, because it indicated a PD misdiagnosis rate [23,24].

| The selection of model parameters
We performed experiments to set parameters and optimize layers for the model. We selected different parameters and compared the classification accuracy. Finally, we chose parameters and layers with the best classification accuracy for our model, which are shown in Table 1.
We compared our methods with these methods, and all the methods used the same data set for validation. Classification results are shown in Table 2. Accuracy (Acc.), Precision (Pr.), F1-core (F1), Recall (Re.) are recorded in Table 2. Compared with these methods, the PAST method achieved the highest performance in classifying PD patients and healthy controls. The classification accuracy for Ga dataset was 99.5%, for Ju dataset was 98.6%, for Si dataset was 99.5%, and for all the whole dataset was 99.2%. The PAST method also had the highest precision, F1-score, and recall for all data sets.
The second experiment was the PD level classification. The results are shown in Table 3. The PAST method also achieved

-
the highest classification accuracy on classifying PD patients with different severity levels, with a classification accuracy of 99.7% for the Ga data set, 98.5% for the Ju data set, 99.5% for the Si data set, and 98.9% for the whole data set. The PAST method had the highest precision, F1-score, and recall for all data sets. According to the results in the second experiment, we created confusion matrices for different data sets to show the accuracy of predicting PD rating levels by the PAST model in Figure 5.

| Ablation study
In this section, we conducted ablation experiments to investigate the effect of different components in our model. First, we evaluated the necessity of the temporal pyramid attention model; we only used the spatiotemporal fusion model, which we called the ST model. Second, we evaluated the necessity of the CNN-BiLSTM layer for spatial feature learning, where only the stacked LSTM model was used, which we called the S-LSTM model. Third, we evaluated the necessity of the BiLSTM layer for feature learning, where CNN was used for spatial feature learning and LSTM was used for temporal feature learning, which was called the C-LSTM model. The results of these models for PD level classification on the whole data set are shown in Table 4. We also increased the layers and parameters in every model for our experiments, The results in Table 4 were the best for our model, so we considered that the classification results only had relevance with the structure of the model.
The PD classification accuracy for the C-LSTM model was 95.72%, which was the lowest; the classification result for the ST model was higher than S-LSTM and C-LSTM; the PAST model evidently achieved the highest classification result of 99.18%, which verified PAST had the best performance for PD classification; all of the components in the PAST model were necessary for PD level classification.

| CONCLUSION
We presented a temporal PAST fusion model for diagnosis of PD from gait data. A temporal pyramid attention module was proposed to enhance the multiscale temporal characteristics. The 1D CNN and BiLSTM layer was used to learn the spatial domain features of the sequences. With this structure, we obtained a data-level fusion method for multiple channels. We performed experiments on the PhysioBank data set. From the experimental results, we can see that classification accuracy was enhanced by our methods for both PD CO classification and PD level classification. Moreover, the PAST model can also be oriented towards CAD of other diseases [26] in future works by using spatiotemporal signals.