Consumer Wi-Fi device based action quality recognition: An illustrative example of seated dumbbell press action

A system called WiSDP, which is based on Wi-Fi signals, to detect whether a Seated Dumb-bell Press action is standard by using inexpensive consumer Wi-Fi devices is proposed. Compared with the scheme based on high speed cameras and wearable sensors, Wi-Fi devices are insensitive to light and colour, do not need wear any device, and decrease the risk of disclosing privacy. WiSDP senses environment changes through the Channel State Information which is ﬁne-grained physical layer information comparing to frequently used Received Signal Strength Indicator. Compared to the action recognition, action quality recognition depends on slight differences between a non-standard action and standard actions, which makes it challenging. The authors propose an improved sliding window algo-rithm calculating action energy to extract Seated Dumbbell Press actions from the Channel State Information streams, estimate action quality by choosing an appropriate classiﬁer and use Principal Component Analysis and Butterworth low-pass ﬁlter to remove noise. The authors conduct experiments in two different scenarios and the average true positive rate of WiSDP are 94.66% and 95.11%, respectively.

is to evaluate whether the same action meets the requirements including accuracy, coordination, elegance and economy. This paper mainly studies the accuracy and coordination of action quality. Accuracy refers to whether the technical action conforms to the specification requirements. For example, a user should raise dumbbells with his arms directly above his body. Coordination refers to whether the technical movement is coherent, the rhythm is reasonable, and all parts of the body coordinate and cooperate in the process of completing the movement. For example, a user cannot shake his body when lifting dumbbells. From the perspective of data acquisition, action recognition can be divided into three categories: sensorbased, vision-based and wireless-based. Sensor-based scheme use wearable sensing devices to capture body movements and postures [2,3], but it requires users to wear sensors on body, which makes users feel uncomfortable. Vision-based scheme use high-speed video recorder to record motion detail information through frames [4,5]. However, the recognition accuracy is easily affected by the light intensity and the contrast between the foreground and background colours. Most of all, the video contains a lot of sensitive information that is not relevant to the target to be detected, which is likely to result in the disclosure of users' privacy. Wireless-based scheme utilise Wi-Fi signals affected by human actions to distinguish between different actions [6,7], which is a device-free recognition technology and can protect users' privacy.
Action quality recognition needs more fine-grained features [8,9]. Most researchers use a wearable sensor-based approach except WiQ [10]. To estimate the driving behaviour quality, Lv et al. present a system called WiQ based on Wi-Fi signals. In the system, two Universal Software Radio Peripheral (USRP) devices are used for sending and receiving Wi-Fi signals and Received Signal Strength Indicator (RSSI) is used for sensing. WiQ adopts an activity-based fusion policy which use at least three actions to estimate the quality of a driving activity, which means WiQ cannot effectively recognise quality for a single driving action. Beyond that, compared to a inexpensive Wireless Local Area Network (WLAN) card, USRP is a dedicated wireless device and can obtain more precise RSSI data than commercial WLAN card with a shortcoming: expensive. Radio Frequency Identification Devices (RFID) is used to detect drowsy driving actions in [11]. In order to get the movement information of the driver's head, the driver needs to wear a hat with RFID tag. RFID-based way requires the driver to wear a special device, and can only detect simple movements of head, not suitable for detecting more complex body movements. Finegrained Channel State Information (CSI) obtained from commercial WLAN card can get more sensitive information. CSI can reflect the channel gain of each subcarrier of OFDM signals used in Wi-Fi system while RSSI detected by inexpensive Wi-Fi devices represents the received power of all Orthogonal Frequency Division Multiplexing (OFDM) subcarriers. Based on CSI, Chen et al. use Bi-directional Long Short-Term Memory (BLSTM) to recognise six common daily activities of 'Lie down, Fall, Walk, Run, Sit down, Stand up' [12]. There are big differences between these actions, and we can easily distinguish different actions in various ways.
We design and implement a wireless-based recognition system called WiSDP which use inexpensive consumer WLAN card to obtain CSI and use machine learning algorithm to recognise action quality. The system allows users to be aware of whether there are any errors in their SDP actions without the supervision of a professional fitness trainer and wearing any special sensors. With such a device-free action quality recognition system, users can easily do exercise and get guidance at home. To the best of my knowledge, WiSDP is the first system using fine-grained CSI to estimate quality of body building actions. In WiSDP, we have solved several challenges.
First of all, existing action segmentation algorithms [10,13,14] are not applicable to our system in which the interval between the actions is short and the start time of the action is uncertain. For example, Ren et al. extracted the target action from CSI waveform with a specific frequency range, which will not work when the frequency of CSI waveform caused by the action changes significantly [15]. Therefore, we propose the concept of 'action energy' which is the average value of the difference between adjacent packets in a data stream segment. By calculating the action energy scheme, we do not need to esti-mate the frequency range caused by actions because the changes of CSI waveforms frequency caused by the speed of actions have little influence on the action energy. We propose a sliding window algorithm with variable step size to calculate the action energy. We use a large step size to search for the start and end ranges of the action, and a small step size to accurately search the range determined by the large step size. In this way, we can remove the data when there is no action, transient actions and small actions, which do not belong to SDP.
Second, to solve the problem of high noise caused by cheap WiFi devices, we propose to use a method based on Principal Component Analysis (PCA) to reduce noise. We calculate the eigenvalue and the eigenmatrix of the CSI matrix and remove the eigenmatrix corresponding to the maximum eigenvalue, because random noise is proved usually distributed in the first dimension which correspond to maximum eigenvalue [16]. The denoised data can be obtained by the new eigenmatrix.
Third, to select the appropriate classifier for the action quality recognition, we test various machine learning algorithms and find that the Support vector Machine (SVM) has a good performance in the action quality recognition. Although the classifiers based on neural network can achieve a better recognition accuracy, it requires much more training samples and long training time than the SVM. Therefore, based on SVM, we can easily migrate WiSDP to be suitable for other action quality recognition scenarios.
In summary, the main contributions of this paper can be summarised as follows: • We propose and implement WiSDP, a system based on cheap consumer Wi-Fi devices, to recognise action quality and take seated dumbbell press actions as an example. • We propose to use action energy to accurately reflect the action change, which has good robustness. The improved sliding window algorithm which can dynamically adjust the search step size is proposed for extracting actions accurately. • To overcome the shortcoming caused by inexpensive Wi-Fi devices, we propose to use PCA to remove the firstdimensional feature where noise mainly distribute. • After trying many popular machine learning algorithms, we find a classification algorithm with high recognition accuracy, short training time and short running time. Thus, WiSDP can be easily adapted to recognise action quality in different scenarios.
The rest of this paper is organised as follows. We first introduce the related work in Section 2. Then in Section 3, we present the principle and structure of WiSDP system, followed by describing implementation and evaluation in Section 4. Finally, we give our discussions and conclusions of WiSDP in Sections 5 and 6, respectively.

RELATED WORK
With the development of artificial intelligence technology, coarse-grained action recognition has achieved good results in many fields, but the evaluation of action quality still needs further study. In this section, we introduce the relevant technologies of action quality recognition from three schemes: sensorbased, vision-based and WiFi-based.

Sensor-based action quality recognition
Sensor-based scheme have been widely used in action quality recognition [17,18]. Increasing the number of sensors can effectively improve the recognition accuracy, which makes it have good robustness. For recognising lifting action quality, Velloso et al. used four 9 degrees of freedom razor inertial measurement units to record data [19]. Three sensors were worn by the user and one was attached to the dumbbell. Giorgis et al. used signals of many wearable inertial sensors to detect relevant peaks of acceleration of limbs (arms and legs) [20]. These data were used to classify the level of skills of athletes performing karate kata. The results judged by the system were almost the same as those judged by the experts. However, sensors must be tightly attached to the athletes, which is likely to affect performance of athletes.

Vision-based action quality recognition
Vision-based scheme records the action information by video, which turns the original three-dimensional information into two-dimensional information [21,22]. To get richer visual information, Elkholy et al. used depth cameras to detect abnormality in common daily action performance [23]. They used extracted medically features from three-dimensional skeletal data provided by a depth camera to train a probabilistic normalcy model and build a linear regression model to assess action quality. Depth cameras are susceptible to angles of view and environment, so the combination of vision-based scheme and sensor-based scheme is often used to evaluate the action quality. Niewiadomski et al. using video and MoCap tested two karate actions to evaluate action quality, and the results showed that the action quality evaluated by the scheme was highly similar to the action quality evaluated by human experts [24].

WiFi-based action quality recognition
For WiFi-based scheme, prior work mainly focuses on how to distinguish different actions, and few researchers pay attention to how to judge whether an action is standard or not. Based on Wi-Fi signal, there are two main sensing schemes: RSSIbased and CSI-based. RSSI measures the superposition effect of multi-path signals propagation and cannot distinguish multiple signals propagation paths one by one. Therefore, RSSI is often used for coarse-grained indoor positioning and rarely used for fine action recognition [25,26]. For example, Haseeb et al. custom seven hand gestures to operate the media player using RSSI signals and the results show the system can achieve an average accuracy of 87.5% [27]. It is difficult to obtain high recognition accuracy by RSSI-based scheme even if the movement is very simple. For CSI-based scheme, there is no system for detecting action quality, but researchers have proved that CSI-based scheme can detect even small body actions [28]. Ali et al. proposed a system called WiKey to recognise the keyboard typing [29]. The detection accuracy of WiKey was more than 97.5%. However, the high accuracy of WiKey is because the receiver is only 30 cm away from the keyboard and use a high-performance enterprise router as the sending device which is too expensive for ordinary consumers. Zhang et al. presented a system called WriFi, in which users wrote 26 letters in a size of 25 cm × 25 cm air area [30]. The hidden Markov model was established in the system by dividing each letter according to the stroke and the average recognition accuracy could reach 86.75% in the experimental environment. In practice, it is difficult to separate each stroke of a continuous letter, so the system must require the writer to have an unnatural pause.

WiSDP SYSTEM MODEL
WiSDP is a wireless sensor system working on cheap consumer Wi-Fi devices, obtaining fine-grained CSI data from the physical layer and using feature-based machine learning technologies to recognise human action quality in indoor scenarios. In the overall design, WiSDP is divided into action sensor module, action detection module and action quality recognition module. Figure 1(a) is the sensor module in the system and Figure 1(b) is the workflow of the whole system. In sensor module, WiFi-based sensor is mainly composed of two commercial devices: an Access Point (AP) (such as a router) and a Detection Point (DP) (such as a laptop). To let DP sense the change of environment, we constantly send data to DP to maintain intensive communication. If there is any action between AP and DP, it will change the propagation path of the signals, resulting in different CSI which can be sensed by DP.
In detection module, we first pre-process the obtained original CSI signals to remove outliers and high-frequency noises. Then we detect the signals that contains the target actions and separate it from the original signals.
In action quality recognition module, we first select some representative characteristics from samples to describe the action quality. Then we train the SVM model using the characteristics. Finally, trained SVM model can be used to recognise quality of samples.

3.1
Data collection

SDP actions
Seated dumbbell press action, one of the most popular body building actions, is chosen to conduct our experiments. In the experiments, we use a stool with no backrest, which will cause more wrong actions by the users and will bring more challenges to the action quality recognition. We list nine SDP actions, including one standard action and eight actions with different  3. Exhale as the dumbbells slowly fall and arms fall to the starting position.
The differences of action recognition and action quality recognition can be illustrated by Table 1. Action recognition is to detect different actions, and the CSI waveforms caused by different actions are usually quite different. Therefore, different actions can be easily recognised by analysing the CSI waveforms. Action quality recognition is to detect the same actions, and the same actions are usually only slightly different, so it is more challenging to recognise action quality. For example, compared to the SDP1 (standard action) in Table 1, the SDP2 is identical to the SDP1 except that the user lowers his head slightly when pushing up. Compared to the SDP actions, slight lowering of the head is a small action. To recognise the action quality, we have to find out the characteristics of these small actions on the CSI waveforms.

CSI measurement
In PHY layer, to improve the transmission efficiency of data, OFDM converts high-speed data signals into parallel low-speed sub-data streams by dividing channels into several orthogonal subcarriers and modulating them into each subcarriers for transmission. OFDM has been applied in IEEE 802.11 a/g/n and is used in proposed WiSDP.
In the wireless channel, the received signals matrix Y can be expressed as Equation (1): where X is the transmit signals matrix, H is the channel gain matrix and N is the Gaussian white noise, so the approximation of H can be expressed as Equation (2): In Equation (2), h is CSI actually obtained, which contain the channel gain of the signals in each subcarrier. Because subcarriers have different frequencies, each subcarriers have different channel gains. Compared with CSI, RSSI measured by Intel 5300 Network Interface Card (NIC) cannot distinguish multiple signals one by one, which makes the information reflected by RSSI less than CSI [31]. Therefore, CSI can be more sensitive to the changes of wireless channels and can perceive subtle differences between actions.
The data structure of a CSI packet is q×s, where q is the number of sending and receiving antenna pairs and s is the number of subcarriers in a sending and receiving antenna pair. In a CSI packet, we use h ( j ) i to express channel gain of ith subcarrier in jth sending and receiving antenna pair. h i is channel gain matrix of ith subcarrier in all sending and receiving antenna pairs. h ( j ) is channel gain matrix of all subcarriers in jth sending and receiving antenna pairs. Channel gain h can be denoted as ( We use h[k] as channel gain matrix of kth packet and we have In a sending and receiving antenna pair, we can only obtain one value if sensing RSSI and we can obtain s values if sensing CSI.

Data pre-processing
Many raw CSI are affected by interference and environmental noise, so we need to remove interference and noise to obtain data related to human activities.

Channel gain amplitude outlier removal
We remove outliers from the CSI streams by Hampel filter [28,32]. Outliers are the points falling out of the closed interval [ − , + ], where is the median, is median absolute deviation and is an adjustable coefficient relying on the data sequence. Finding outliers and replacing them with the average of the data are helpful to get more reliable CSI data.

Time outlier removal
Due to the instability of the receiving system constructed by inexpensive devices, the problem of missing data occasionally occurs in the received packets. To ensure the uniform data format, we set the missing value as the mean value of the two adjacent packets. Not only that, the inexpensive devices we used cannot precisely control the time interval of received packets. As shown in Figure 2, the receiving interval can sometimes be as much as five times the sending interval. We used the linear interpolation algorithm [33] to recalculate CSI strictly according to the time interval.

Noise removal
CSI can be obtained in each subcarrier, which can provide more information than RSSI. In order to demonstrate the denoising effect, we chose to display the CSI of a subcarrier. Figure 3(a) shows the original CSI waveform of an SDP action. Compared with the expensive USRP used in [10,13], the CSI data collected by the cheap Wi-Fi devices contain more noise which has random intensity in time domain and are uncorrelated with human actions [34,35]. Therefore, we use PCA-based scheme to remove random noise in CSI streams. First, we use Z-Score standard algorithm to normalise the matrix to mean 0 and variance 1. Second, we calculate the covariance matrix and the eigenvalue, and then order the eigenvalues from large to small. Then we remove the first eigenvalue and calculate the eigenmatrix corresponding to the eigenvalues, because random noise is proved usually distributed in the first dimension [16]. Finally, we obtain the denoised CSI matrix by inverse transformation. Figure 3(b) displays the CSI waveform denoised by PCA. We can see that although there is still a lot of high-frequency noise in the data, many noises have been removed and the characteristics of the original data have been retained.
Our experimental results show that the frequency of CSI waveforms changes caused by arm and body movements is less than 20 Hz. Therefore, we use Butterworth low-pass filter to remove high-frequency noise and the filter frequency response can be expressed as Equation (5) |F ( j ) where I is the order of filter and c is cut-off frequency. In Figure 3(c), we can see that most of the noise has been removed and only the effective information is left.

SDP action detection
The CSI of adjacent subcarriers are very similar [29]. Thus we divide s subcarriers into g groups. We merge the adjacent subcarriers together to get average channel gain of g groups, denoted ash i which is expressed in Equation (6), where i is the weight of corresponding combined subcarriers, × i ≤ s and i ∈ ℤ + . The value of is 3 in WiSDP.
According to Equations (3) and (6), a packet merged the adjacent subcarriers in CSI a stream that can be expressed as follows:h Rapid actions tend to make CSI waveforms changes more drastic [30,36]. We can learn from Figure 3(c) that there is an SDP action between 0.5 s and 3.5 s and the CSI waveform changes relatively slowly when there is no action. In related studies, there are many techniques to split target actions from CSI streams [14, 30 36]. However, these techniques are not suitable for WiSDP, which needs to detect actions with small time intervals and actions can start at any time. We define the action energy to reflect the intensity of the waveform change in a CSI stream. Mean Absolute Difference (MAD) algorithm, a common way to compare the similarity of two pictures, is used to calculate the difference between two adjacent packets in this paper. In Equation (8), ∇h[i] is instantaneous action energy between i and i − m packets and m is a number that can be adjusted by Because actions start in different time, we measure the average energy of a CSI stream over time. In Equation (9) Figure 4 displays the denoised energy of the SDP action. Figure 5 shows the action energy of different actions. Energy of sitting, walking and SDP are close and the energy of drinking is relatively lower. Since there is a big difference in energy between a action and no action, we can use action energy to estimate whether there are human actions or not and extract human actions from CSI streams.
In a real-time CSI stream, we do not know when user started an SDP action or when it ended. Before recognising quality of an SDP action, we must split it from the CSI stream by checking the action energy. To look for possible SDP actions in a realtime CSI stream, we propose to use the slide window algorithm with variable step size to search a possible SDP action by timeline which is shown in Algorithm 1. Through the algorithm, we using action energy to remove invalid data from the CSI stream, such as no action, instantaneous actions and small actions information. To improve the retrieval efficiency and ensure the accuracy of the data, we use big steps to detect the start and end ranges of an action, where the small steps are used to accurately retrieve the start or the end points. The effect of our algorithm mainly depends on the length of the sliding window a, the step length of the sliding window b and the threshold of minimum energy of sliding window E m . ∇h[1 : n] is the pre-processed CSI packets by Equation (8), [E l , E h ] is the energy range of an SDP action packet, [T l , T h ] is the time range of an SDP action and is a coefficient for adjusting step of accurate search. In order to ensure the efficiency of the search and as accuracy as possible to obtain the start and end of the action, we first use bigger step b to roughly estimate the range of SDP actions and then use the smaller step b for accurate detection. In lines 6 and 16, we first find the start point and end point, respectively. Then we take a step back and continue with a smaller step in line 27. Finally, we get the action range h[start : end ] in line 19. We can learn from Figure 5 that SDP actions cannot be effectively distinguished only by action energy. We use AdaBoost classifier with a maximum depth of five to find the SDP actions from other indoor actions.

Feature extraction
After obtaining the data corresponding to the SDP actions from the CSI stream, we extract features from denoised CSI matrix. First, we use PCA to extract the features from each SDP action. Although PCA is based on unsupervised learning to extract features, it can often achieve better results in the case of few samples. By mapping the n-dimensional feature to the orthogonal k-dimension, the correlation between variables is eliminated. Through experiments, we found that using the features corresponding to the top 20 eigenvalues can well express the overview of SDP actions. Then, we divide an SDP action into two parts: push up and pull down, and extracted the features which is caused by different sites on the body and different speeds. We extract the profile for each part and choose three features as the profile: root mean square, variance and peak. Therefore, the number of extracted features is 26.

Classification
Although the classifiers based on neural network can achieve high accuracy, it needs a large number of samples, and collecting samples will take a lot of time. Once the category changes, it needs to be retrained and the training time is long. We have tested SVM, K-NearestNeighbor (KNN), Gradient Boosting Decison Tree (GBDT), Random Forest (RF) and Convolutional Neural Networks (CNN) algorithms. Finally, we select SVM model based on Radial Basis Function (RBF) to evaluate the SDP action quality. The reason is that it needs much less samples to perform well comparing to neural network, and training speed of the model is fast. The whole model is unnecessary to be retrained even when adding or deleting classifications, which makes it more flexible in different action quality recognition scenarios. SVM recognise action quality by finding the decision surface with the largest spacing between various sam-ples. SVM can distinguish two categories, but we need to distinguish nine actions to evaluate SDP action quality. We thereby established k(k−1) 2 SVM classifiers by One-Versus-One (OVO) way. In the process of classification, the category that receives the most votes through all classifiers is the predicted category of a sample. Compared with One-Versus-Rest (OVR), OVO do not need to retrain all classifiers when adding categories, and we can also avoid the problem that the data of negative samples is much larger than that of positive samples. The performance of the SVM models is related to the parameters. To find the optimal parameters, grid search which use result of cross-validation for score is used to traverse the main parameters.

Implementation
In WiSDP, we use the TL-WDR6300 router as the AP. The router has four antennas, two of which work in 2.4 GHz frequency band and the other work in 5 GHz frequency band. We use 5 GHz in WiSDP, which can reduce the interference between signals, and get higher receiving packet rate. We use Lenovo G480 laptop as DP, which is equipped with Intel i5-3210m processor, 8G DDR3 1600 MHz memory and Intel 5300 NIC. We modify the BIOS white list of the laptop to ensure compatibility between devices.

Evaluation setup
In our experiments, AP and DP are installed at a height of 1 m and separated by 2.8 m. The height of the backless stool is 0.45 m. Four male volunteers who are 170-175 cm tall and weigh 57-70 kg took part in our experiments. In the line of sight environment, users sit in the middle of AP and DP, and repeat 60 times for each SDP action. In the experiment, we require that the layout of the room remain unchanged and that only one volunteer is allowed in the room. We make one-key sampling program based on CSITool to let Intel 5300 NIC receive the CSI packets. As the program runs, the computer camera begins to record the user's actions for training model. With the timeline of the video, we can tag the CSI data for each SDP action.

SDP detection analysis
To evaluate the accuracy of the SDP detection, we introduce four basic indicators: True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN), which are based on the real category and prediction category. True Positive Rate (TPR), False Positive Rate (FAR) and Accuracy can be express as follows: We invite four volunteers to carry out experiments involving five indoor actions (motionless, sitting, drinking, walking and SDP). For motionless, sitting, drinking and walking, we require user to repeat each action 90 times. For SDP, we collect 10 samples per user for each action in Table 1. Figure 6 is the gray scale of the confusion matrix for five indoor actions. We use cross-validation to get average value. From the confusion matrix, we can know that WiSDP using action energy can distinguish motionless scenario from other four scenarios without any errors. We use AdaBoost classifier to distinguish SDP from sitting, drinking, and walking with a TP rate of 99%. We can infer from the confusion matrix that the FAR of WiSDP is 0. In conclusion, the average detection TPR of the five indoor scenarios is 98.6%.

Impact of the improved sliding window algorithm on recognition accuracy
The improved sliding window algorithm can reduce redundant data. In order to verify that the improved sliding window algorithm can improve the recognition accuracy, we compare the algorithm with the general sliding window algorithm in WiSDP. To ensure the reliability of the experimental results, we use  The recognition accuracy of two sliding window algorithms a data set collected from different volunteers to evaluate the algorithms. The results can be seen in Figure 7. By comparing recognition accuracy of the two algorithms, we can find that the improved sliding window algorithm can improve the recognition accuracy in WiSDP, in which the recognition accuracy of user1, user3 and user4 is obviously improved and the recognition accuracy of all users is improved by 0.53% on average. In theory, our algorithm can only improve the recognition accuracy slightly, because each person has different behavioural characteristics, which sometimes make the recognition accuracy less obvious. For example, our algorithm has not achieved good results for user2. Although the recognition accuracy is not improved, the recognition accuracy is not reduced. In general, the improved sliding window algorithm is helpful to improve the recognition accuracy.

Impact of classifier on recognition accuracy
We compare the performance of various machine learning algorithms on our data set, including SVM, KNN, GBDT, RF and CNN. For each SDP in Table 1, we collect 240 samples from four volunteers. We randomly extract 20% of the entire data set as a test data set, and then use a fourfold cross-validation on the remaining data set to adjust and optimise model parameters. In Table 3, we present the main parameters of the algorithms in Table 2 in our experimental environments. We can know from Table 2 that SVM is significantly better than other algorithms with a recognition rate of 95.11%. The specific role of these parameters can be referred to the scikit-learn machine learning library [37]. Training time and running time of KNN, GBDT and RF are close to SVM, but recognition accuracy of SVM is much higher than others. Although the training time and running time of GBDT are the least, the training time and recognition accuracy of GBDT are greatly affected by the learning_rate parameter. We have tested three CNN-based schemes: 1D-CNN, 2D-CNN and 3D-CNN, and 1D-CNN have better performances in recognition accuracy and training time. The archi-  tecture of 1D-CNN is illustrated in Table 4. Conv1D layers are used to perceive the local information of the CSI waveforms. MaxPooling1D layers are used to reduce dimensions of feature, compress the number of data and parameters, reduce the overfitting, and improve the fault tolerance of the algorithm. Flatten layer is used to convert multi-dimensional inputs into one-dimensional outputs and connect the layer 6 and layer 8. Although CNN and SVM have similar recognition accuracy, the training time and running time of CNN are longer than non-neural network algorithms. In addition, improper network structure will lead to overfitting or underfitting, and network structure needs to be adjusted accordingly to achieve a better recognition accuracy. In summary, CNN is more suitable for large training data sets. Acquiring a large number of training samples in a practical application will take a significant amount of time, which is not conducive to WiSDP applications in a vari-ety of scenarios. Compared with CNN, SVM can achieve an ideal recognition accuracy by training fewer parameters.

Robustness analysis
We evaluate the system performance with four factors that affect the action recognition result into consideration: the receiving packet rate, the number of antennas, the sample number and the environment.

Impact of receiving packet rate on recognition accuracy
We simulate different receiving packet rates with different time intervals. We collect 50 samples generated by each SDP action with one sending and receiving antenna pair from four volunteers. A sample contains all the CSI packets from the beginning to the end of an action. We use the fourfold cross-validation to get more convincing results in our experiments. We can learn from Figure 8(a) that increasing the receiving packet rate does not necessarily improve the recognition accuracy of SDP actions. When the receiving packet rate is from 20 to 200 packets/s, there is little difference in recognition accuracy. The recognition accuracy begins to decline obviously when the receiving packet rate is lower than 20 packets/s. Too high receiving packet rate will increase the time of data processing, and too low receiving packet rate will not achieve a good recognition rate. Thus the optimal receiving packet rate is 20 packets/s based on the SDP actions recognition rate of four users experiments.

Impact of the antenna number on recognition accuracy
Based on receiving packet rates experiments before, we set the receiving packet rate to 20 packets/s. We have two transmitting antennas and three receiving antennas, so there are six sending and receiving antenna pairs. We increase the number of antenna pairs one by one and use cross-validation to test the accuracy of the recognition. As depicted in Figure 8(b), increasing the number of antenna pairs is conducive to improving the recognition accuracy. Compared with using only one antenna pair, the recognition accuracy can be improved by 5.38% and 8.37% on average when using two antenna pairs and six antenna pairs, respectively. The more sending and receiving antenna pairs work, the more CSI data we can obtain. The large amount of CSI data allows us to analyse the details of the SDP action more comprehensively and get higher recognition accuracy.

Impact of sample number on recognition accuracy
In this experiments, the receiving packet rate is 20 packets/s and the number of antenna pairs is 6. We collect 60 samples for each SDP action to train the model and a number of samples are randomly selected from each action category to form a new data set. We conduct three groups of experiments to evaluate the relationship between sample number and recognition accuracy. We rebuild SVM model with 40, 50 and 60 samples for each action. Figure 8(c) shows the comparison of the Leave-One-Out cross-validation results for nine SDP actions. With the increase of sample number, the recognition accuracy is increasing. Using 40 samples to build model, we can get the recognition accuracy of 93.67%. When the number of samples for each SDP increases from 40 to 50, the average accuracy of four users increases to 95.07%. Compared with 50 training samples, the recognition accuracy increased by 1.11% on average when the number of samples increased to 60. When the sample number is small, there may be overfitting in the SVM model, which will lead to a low recognition rate. With the increase of the sample number, the generalisation performance of our model is enhanced and the recognition accuracy is improved, while SDP3, SDP4, SDP5 and SDP7 are more sensitive to training sample number than other actions.

Impact of the environment on recognition TPR
We conduct our experiments in two different scenarios to assess the impact of the environment on the recognition TPR. With the same experimental parameters, we carried out the experiments in the student dormitory and family living room. Figure 9 shows the room layout for the two scenarios. Figure 9(a) is an ordinary student dormitory, in which there are beds, tables, chairs and some daily furnishings. Figure 9(b) is an ordinary family living room with a TV set, a dining table and a sofa.
In the experiments, we do not need to move any furniture in the house, only to avoid irrelevant people into the experimental environment. In each scenario, we collected 60 CSI samples for each actions from four volunteers. The receiving packet rate is 20 packets/s and six antenna pairs is used for collecting data. Figure 10 displays the recognition result of nine SDP actions in two scenarios. We can learn from the confusion matrix that WiSDP can recognise the nine SDP action with an average TPR of 94.66% and 95.11% in student dormitory and family living room. The TPR of SDP6, SDP7 and SDP8 is lower than average value and the misclassification among them is the main reason for the low recognition accuracy. The other six SDP actions have good TPR in two scenarios. On the whole, WiSDP can work well in different room layouts.

DISCUSSION
Although WiSDP have achieved a good recognition accuracy for quality of SDP action, there are three limitations in our system: 1. Unrelated users walking in the room will reduce recognition accuracy. We know that CSI is easily disturbed by the environment, and the movement of other users will introduce a lot of interference, so that it cannot effectively recognise the action quality. 2. WiSDP is currently only working within the line of sight area. In the line of sight area, CSI is more sensitive to human movement, which helps to catch subtle differences. 3. WiSDP can only identify a single error in an SDP action.
When there is more than one error in an SDP action, the system cannot recognise all the errors. Because the various wrong are superimposed on each other, the characteristics will be different from the original.

CONCLUSION
We propose a fine-grained action quality recognition system taking SDP action as an example. In the system, two cheap consumer devices, a laptop and a router are used for collecting CSI of Wi-Fi signals. We use PCA and Butterworth low-pass filter to denoise the collected CSI. We propose the concept of action energy to estimate the intensity of the waveform change, which can be used to distinguish some different actions by thresholds. The action energy detection algorithm based on sliding adaptive size window is proposed for obtaining CSI streams segments in which only the CSI packets related with SDP actions are contained. We test different algorithms and choose SVM which can reach high accuracy of recognising action quality with small sample number. The experiment results show that we can achieve an average TPR of 94.66% and 95.11% in two different scenarios.