Vision skeleton trajectory based motion assessment system for healthcare rehabilitation

: Most existing methods of motion assessment system used the contact sensor, infrared sensor, and depth sensor, and few works provided the solution of digital camera. To solve this problem, the authors propose a new motion assessment system based on light camera. In this work, the motion assessment was regarded as pattern regression problem of skeleton joint trajectory. Firstly, the system uses the camera to capture the image sequences. The pose estimation method is used to obtain body skeleton from image. Secondly, due to the difference of motion frequency of each person, the length of the image sequences is different and the length of each joint trajectory also will be different. Fourier transform is applied to normalise the trajectory and use the coefficients of Fourier transform as the joint trajectory feature. Finally, the regression model is built to assess the motion quality. Some experimental results and discussion on action video data are used to verify the effectiveness of the system.


Introduction
With the development of social economy, people's attention has shifted from eating and wearing to healthy. Exercise rehabilitation and healthcare system is not only needed by professional athletes, but also by the general public after injuries. The emergence of motion rehabilitation system provides a systematic and standardised performance index for exercise rehabilitation and healthcare. It can monitor and manage the patients in time and effectively, and reduce the risk of patients in the process of motion rehabilitation. Motion rehabilitation assessment is very important for exercise rehabilitation and healthcare system. The motion rehabilitation assessment can be divided into two types: contact sensor-based system and non-contact sensor-based system. It is more popular to use a contact sensor to estimate human motion state [1], which extracted human biological data such as arm movement speed, frequency, and strength. The motion rehabilitation system can use these data to assess exercise status and evaluate motion rehabilitation status to provide users more accurate and convenient service.
However, using the contact sensor will affect the movement process of user, which contains slow down the speed of the movement, lower the amplitude of the movement, and increase the resistance of the movement, especially in motion evaluation system. With the development of computer vision technology, more and more approaches which have good performance in computer vision has been extended to various fields, including human pose estimation, action recognition, object tracking, and so on. Especially, deep learning-based methods perform very good results in all the above fields. How to build a non-inductive motion evaluation system based on vision camera has attracted more attention in recent years, the key is to use visual methods to acquire the biological information of human body instead of contact sensor.
To address this problem, some techniques have been proposed to analyse the motion status from videos [2][3][4], including spatial temporal interesting point (STIP) based methods and skeleton based methods. STIP is the representation of shape and motion, mainly including Hu invariant moments based on motion history and energy, iterative filtering, frame grouping, and Poisson significance. STIP based methods regarded the human motion as a set of key points and encoded the feature of STIP to model human motion.
Other methods to represent human action are human skeletonbased methods which use human pose estimation algorithm to obtain skeleton [5][6][7]. The action analysis method based on skeleton features mainly uses the changes of human body's joints to describe the action, which focus on the variation of joints' position and appearance. However, skeleton features relatively depend on pose estimation and are greatly affected by background occlusion. In addition, skeleton data always provide depth information, which is judged by visual influence. Skeleton based methods represent human action by the change of skeleton and depth information. However, only some public data can provide depth information and the depth sensor is influenced by environmental and high cost.
Building the action analysis system based on vision camera is very important for the practical application of these methods. In this paper, we design a motion assessment system for healthcare rehabilitation based on digital camera. To evaluate the statue of the healthcare rehabilitation, the tester is required to complete the specified action. A novel motion assessment method by using visual camera is proposed in this system. The assessment score is corresponded to the status of healthcare rehabilitation. The framework of the proposed method is shown in Fig. 1.
Different from other methods which used depth camera to capture human skeleton, we just use the vision camera and the human skeleton can be extracted by deep learning methods. Firstly, human pose estimation algorithm was used to detect the skeleton joint in real time. Although the assessment system which only uses the camera sensor is more convenient to apply, the problem that its algorithm needs to solve is more complicated. This scene is similar to an open world, and it is difficult to control the posture and angle of the measured people. Due to the difference of the height and posture of the measured people, the difference of the viewpoint and distance of the camera, the skeleton size of the measured people will perform different. It is defined as the skeleton scale change in this work. Some examples are shown in Fig. 2.
On the other hand, different people also have different movements when performing the same action, the joint trajectory of different people will have different length, range, and frequency. In order to overcome the skeleton scale change, in the proposed method, joint trajectory feature was normalised by relative J. Eng., 2020, Vol. 2020 Iss. 9, pp. 805-808 This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/) displacement between each joint and reference joint, instead of calculating absolute skeletons. The relative joint trajectory was modelled by Fourier transform to eliminate the influence of motion difference. Finally, the regression model was built to assess the motion status based on the feature distance between the input video and standard motion video.
The contributions of our work are twofold: (i) We propose a systemic framework for motion assessment based on vision camera.
(ii) A pose estimation-based method has been proposed to evaluate motion quality for healthcare rehabilitation and the experimental results have been discussed in this paper.
The remainder of this paper is organised as follows. Section 2 reviews some related works. Section 3 describes the proposed method. Section 4 presents and discusses about our experimental results. Finally, Section 5 concludes the paper.

Related work
Excellent performance motion assessment system can accurately judge the recovery of the patient's body; it will save a lot of time for the doctor and the patient. Recently, the appearance of plentiful approaches, that are capture human motion information from different sensors, is to assess the motion quality. All the above methods promote the progress of motion assessment system. We mainly introduce relative works for motion analysis by vision camera which are summarised in this section. First, the vision based human pose estimation methods are introduced. Then, motion recognition assessment methods are provided.

Vision based human pose estimation methods
The field of computer vision has advanced rapidly over the past few years, especially the human pose estimation, while the problem of estimating body skeleton is regarded as human pose estimation. A common method is to use depth sensor [8]. Marin-Jimenez et al. [9] proposed 3D human pose estimation from depth data which obtained a depth map containing a person and a set of predefined 3D prototype poses and returns the 3D position of the body joints of the person. Gilbert et al. [10] proposed an approach to fuse multi-viewpoint video with inertial measurement unit sensor data (without optical markers, a complex hardware setup or a full body model) to accurately estimate 3D human pose.
Depth sensor is high cost and low accuracy in outdoor environment, many approaches provided the solution for estimating human skeleton from still image [11]. Felzenszwalb et al. [12] proposed a mixture of multiscale deformable part model to detect and locate object from image. These objects are variable, including not only brightness and perspective, but also the deformation of the object. Ukita and Uematsu [13] used partsegment features for estimating an articulated pose in still images. With advance of deep learning methods in computer vision, many methods have proposed effectively pose and skeleton estimation algorithm based on deep learning theory. Cao et al. [7] proposed OpenPose algorithm to estimate multi-person 2D pose in real time, and is compared with the existing methods; the biggest advantage is that the detection speed is not sensitive to the number of characters, which greatly improves the speed while maintaining the detection accuracy. In [14], the body part proposal is first detected with Resnet, and then classified as different people based on association and spatial information. Body part detector is applied and an incremental optimisation strategy is used to estimate human pose.

Vision based motion recognition and assessment
Human motion recognition is a hot research topic in computer vision and has been widely used in many application [2,15]. Especially with the advancement of vision-based pose estimation [16], skeleton sequences have become a key feature of motion recognition. The skeleton based motion recognition methods can be divided into two categories: hand-crafted feature-based methods and deep learning-based methods.
Methods of hand-crafted features: Keceli and Can [17] extracted human action features based on the angle and displacement information of the skeleton joint points by using Kinect sensor. Yang and Tian [18] proposed a new human action representation method named EigenJoints, which used an accumulative motion energy function to select video frames and more informative joint points to model action. However, handcrafted features have limitations: it has a good effect only for some specific data sets and in some application scenario, while it can only extract low-level semantic information. With the development of deep learning in the field of computer vision, it achieved good results in human action recognition, target detection, and so on. Many researchers apply deep learning to motion rehabilitation and evaluation, not only the extracted features are diverse, but also the advanced semantic information can be extracted. Yan et al. [19] introduce a new method named ST-GCN (spatio-temporal graph convolution neural), which build 3D graph, and use graph volume to learn the established graph data features, to solve the problem of action recognition based on human skeleton joint. Zhu et al. [20] attempted to integrate the calculation of the optical flow sequence into the network, and merge the original image and optical flow sequence to represent human action. All of these works show excellent performance in their respective fields. Several works used these methods to motion assessment [21]. Müller et al. [3] present a motion capturing system based on Kinect sensor and delivers high accuracy in gait parameters comparable to a gold standard motion capturing system. Zia et al. [4] provided video and accelerometer-based motion analysis for automated surgical skills assessment. Different with these works using multiple sensors or depth sensor, which is high cost and low accuracy, we focus on the problem of building the effective motion assessment on light camera.

Human skeleton estimation and representation
Compared with RGB information, skeleton information has clear and simple features and is not easily affected by appearance factors. So, we decide to locate the body by using skeletal joints. There are two ways to get human skeleton, one is obtained by joint estimation from RGB image, the other is straight obtained by depth camera (like Kinect). In our system, OpenPose algorithm [7] has been used to capture human skeleton from RGB camera. It can detect multi-person skeleton in real time by using Part Affinity Fields. OpenPose provided 18 joints for human skeleton, as shown in Fig. 3. Based on OpenPose, the input video can be regarded as the skeleton sequence V = S 1 , S 2 , …, S n , n is the number of video frames. S i is the human skeleton in each frame. S i = S i1 , S i2 , … , S im where S i j means the jth joint points in skeleton and m is the number of joints in skeleton. In OpenPose, m is set to 18. For each skeleton joint, S i j can be represented by twodimensional coordinates in the image space S i j = P x i j , P y i j .

Skeleton feature
Due to the difference of each video, such as the number of frames, motion scale, and the skeleton size of different person, the joint feature has been calculated by relative distance with the neck joint (1th joint) as defined in Fig. 3. So, for the skeleton in each frame, we can extract the relative distance d i j , which means the Euclidean distance between the jth joint and 1th joint in ith frame. Then, for each joint, the distance feature can be regarded as d j = d i j , i = 1, 2, … , n, i is the index of video frame. The Fourier transform have been used to compute the video feature to normalise the different length of video. It also can extract the frequency information of the change of the relative distance. The first ten coefficients of the Fourier transform have been regarded as the joint feature td in the video. Finally, for each video, the feature contained L Fourier coefficients vectors D = td 2 , td 3 , …, td L , the dimension is m − 1 * L.

Motion assessment by feature distance regression
We extracted the skeleton information and solved the processing of different videos. Next, we need to put the processed information into the model and let the model evaluate the action. To evaluate the motion status, the support vector regression (SVR) is used to compute the score of the input video. In order to make the regression model convincing and standardised, two main works have been done. One is built a standard dataset for action assessment. The golf video is collected from web and sport dataset. The other is the motion score which has been labelled by professional person. Supposing that the training set contains a set of labelled training video D = D 1 , y 1 , D 2 , y 2 , …, D l , y l , for which the score of motion status in each image is given. y i is equal to the score. D i ∈ R k , k = m − 1 * L is the feature vector of the video. l is the size of the training set. The regression model G x is trained to find a real-valued function f D of the feature vector. Let ϕ D be a feature transformation of D, ϕ : R k → R K . Consider the case in which f D is linear in the transformation space, and the target score y can be modelled by the following equation: where w ∈ R K and ε is the observation noise. In the proposed method, the SVR is applied to train the model and predict the score for testing video. To train the model, the loss function is the error between the regression score y p and ground true y g . In SVR, the loss function is defined as follows: where b is relaxation variable. The goal of the training process is to compute the parameters which can minimise the loss function. The optimisation of this loss function can be solved based on Lagrangian multiplier.

Experimental results and discussion
In this section, we evaluate the effectiveness of our proposed approach. We first introduce the criteria of evaluation that other regression models always use. Then we show the results of our experiments.

Evaluation criteria
To evaluate the accuracy of motion assessment, the mean squared error (MSE) and mean absolute error (MAE) are used as the evaluation criteria during all experiments where y^i is the estimation result of the ith video from the regression model and y i is the ground truth of the ith videos. T is the number of test dataset. The reason of using MSE and MAE is that single criterion cannot be comparable and MSE is more convenient while MAE is more robust for abnormal point.
In this work, an action assessment dataset is built. There are total 320 videos in the dataset. Each video performs one action by one person. Each video has the assessment score which is labelled from 0 to 1 by the professionals. The higher score means the higher the completion of the action, which means a better rehabilitation.
In the experiment, we use K-fold cross-validation strategy, and k is set to 10. The dataset was divided into ten portions, each one contained 32 videos. In turn, we selected one of them as the test data and the other nine as the training data in the experiment. We take the average of the ten resulting errors as the total error.

Experimental result and discussion
Firstly, the effectiveness of the relative joint distance Fourier feature for motion assessment is evaluated. The SVR with RBF kernel is used to estimate the score. The assessment performance is influenced by the dimension of the skeleton feature. In this experiment, the assessment results with different dimension of skeleton feature are discussed. The experimental results are shown in Fig. 4.
In Fig. 4, the dimension of skeleton feature is set from 8 to 34. When MSE and MAE values are the minimum, the effect is the best. The best result is MAE = 0.2038 and MSE = 0.0625, in which the feature dimension is 8. The error of each testing sample is shown in Fig. 5. In Fig. 5, the error of each testing sample is provided.
In order to further evaluate the performance, we divided the data into three categories, called great level which obtains a professional score of 0.8-1, good level (0.5-0.7), and fail level (0.1-0.4). In great level data, the maximum error of our method is 0.3310, the minimum error is 0.0204, and the average MSE value is 0.145. In good level data, the maximum error and the minimum error are 0.0542 and 0.0001, respectively. The average MSE value is 0.013. In the last level, the maximum error and the minimum error are 0.2376 and 0.0001, respectively. The average MSE value is 0.0461. From the comparison results between the proposed method and professionals, the error of the predict score of the proposed method is small. No matter what state the performer is in, such as a great level, good level, or a failed state, our method can evaluate the action with a low error.

Conclusions
In this paper, we proposed a new method to evaluate human motion for healthcare rehabilitation. The motion assessment problem has been regarded as the motion status regression problem. It can be addressed by skeleton based motion feature and SVR. To verify the performance of the proposed method, an action assessment dataset is built for the assessment of healthcare rehabilitation. The experimental result showed that the proposed method is effective.