and Assisted Living Energy expenditure estimation using visual and inertial sensors

Deriving a person's energy expenditure accurately forms the foundation for tracking physical activity levels across many health and lifestyle monitoring tasks. In this study, the authors present a method for estimating calorific expenditure from combined visual and accelerometer sensors by way of an RGB-Depth camera and a wearable inertial sensor. The proposed individual-independent framework fuses information from both modalities which leads to improved estimates beyond the accuracy of single modality and manual metabolic equivalents of task (MET) lookup table based methods. For evaluation, the authors introduce a new dataset called SPHERE_RGBD + Inertial_calorie, for which visual and inertial data are simultaneously obtained with indirect calorimetry ground truth measurements based on gas exchange. Experiments show that the fusion of visual and inertial data reduces the estimation error by 8 and 18% compared with the use of visual only and inertial sensor only, respectively, and by 33% compared with a MET-based approach. The authors conclude from their results that the proposed approach is suitable for home monitoring in a controlled environment.


Introduction
The term 'energy expenditure' refers to a human's calorific uptake over time, which is one commonly used single metric to quantify physical activity levels.It is an important determinant in understanding the development of chronic diseases, such as obesity and diabetes.Current evidence-based guidelines [1] indicate that people who are regularly physically active have a 20-40% lower risk of developing conditions such as cardiovascular disease and type 2 diabetes than those who are inactive, and suggest that adults should accumulate at least 150 min of moderate intensity physical activity each week or 75 min of vigorous activity, or a combination of the two.Most research into estimating and understanding calorific expenditure focuses on coarse energy totals over longer time segments or relates to specific activities only, such as walking and running, which generally occur outside the home.
Yet, very little attention has been paid on how activities of normal daily living in an indoor environment can be quantified and understood in terms of energy expenditure.Traditionally, physical activity levels have been measured in Metabolic Equivalents of Task (MET) [2], where a fixed value is assigned to each activity, e.g. 1 MET corresponds to energy expended at rest.However, the method is highly unreliable due to the fact that the activities are monitored using self-report approaches, such as questionnaires and occasional clinical check-ups.
There are various approaches that reliably estimate human energy expenditure via respiratory gas analysis, including both direct and indirect methods.Direct calorimetry measures, such as a sealed respiratory chamber [3], produce accurate outputs, but require a laboratory-based environment.Indirect calorimetry, on the other hand, measures energy expenditure based on inspired and expired respiratory gas flows, volumes and concentrations of oxygen consumption and carbon dioxide production.Some of these measurement devices are portable, less invasive and can produce accurate readings.They form the measurement standard for nonstationary scenarios where the person can move freely.Nevertheless, participants in experiments are required to carry gas sensors and wear a breathing mask [4].
Recently, with an increasing number of wearable devices becoming available, approximating the energy expenditure using inertial sensors has become a popular monitoring choice due to its low cost, low energy consumption, and data simplicity.Acceleration reflects a relation between motion and energy expenditure, thus tri-axial accelerometers are the most broadly used inertial sensors [5].Recent studies show that more sensors could be involved in the task: breath rate, chest and arm skin temperature also show the correlation with energy expenditure via estimating the oxygen consumption [6].The data could be obtained by a heart rate monitor and thermometers.
Vision-based systems, as alternative approximative sensors, do not require the wearing of extra devices.In fact, they are already a key part of home entertainment systems today [7], where RGB-Depth sensors allow for a rich and fine-grained analysis of human activity for purposes such as gaming within the field of view.Recent advances in computer vision have now opened up the possibility of integrating these devices seamlessly into home monitoring and assisted living systems [8][9][10].
Simultaneous utilisations of visual and inertial sensors are not common today, but are receiving growing attention in various areas, including action recognition [5], gesture recognition [11], robotics [12], augmented/virtual reality [13], and assistive technologies applications, such as fall detection [14], food preparation [15] and in a general ambient assisted living system [16].Although employing multi-modal sensors has the advantage of complementing shortcomings of individual modalities, wearing a multitude of sensors can cause user acceptance issues.
With this in mind, in this paper we propose a framework for estimating energy expenditure in living environments based on a non-intrusive RGB-Depth visual sensor and two inertial sensorsworn on the wrist and waist -backed up in experiments by simultaneously taken indirect calorimetry measurements based on the measurement of oxygen consumption and carbon dioxide production for an accurate ground truth provision.This is a new application and to the best of our knowledge no dataset of a similar setup with reliable and accurate ground truth exists.Thus, in order to quantify the performance of the proposed system, we present a new dataset, SPHERE_RGBD + Inertial_calorie, for calorific expenditure estimation collected within a home environment.The dataset contains 11 common household activities performed over IET Comput.Vis., 2018, Vol. 12 Iss. 1, pp. [36][37][38][39][40][41][42][43][44][45][46][47] This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/) up to 20 sessions, lasting up to 30 min for each session, in each of which the activities are performed continuously.The experimental setup consists of an RGB-Depth Asus Xtion camera mounted at the corner of a living room, two accelerometer sensors, and a COSMED K4b2 [4] indirect calorimeter for ground truth measurement (see Fig. 1).The SPHERE_RGBD + Inertial_calorie dataset is publicly released (The dataset is available online at http:// doi.org/cc5k.).
This paper is built on our recent work in [17][18][19], with significant extensions and improvements.Tao et al. [17] introduced a fusion framework for recognising human daily activity using visual and inertial sensors.The work did not address the issue of energy expenditure estimation.Tao et al. [19] proposed a framework for calorific expenditure estimation using only a visual sensor, thus, there was no sensor fusion involved.In [18], we presented a system which allows real-time prediction for activity intensity levels, relying on light-weight bounding box features.This makes the method unable to produce precise calorific expenditure values.In this work, we have improved the feature representation for both inertial and visual sensor data by considering spatial and temporal information at the same time, and investigated both early and late fusion approaches of the data from these sensors.The key contributions of this work are as follows.(i) We propose a first-ever framework for the estimation of calorific expenditure from a RGB-Depth sensor and inertial wearable sensors.There is no work published on visual-inertial energy expenditure estimation, there being only very few works that offer purely vision-based estimation [18,20,21].(ii) We improve the feature representation for both inertial and visual features in the previous fusion framework in [17] by extracting rich, multi-level information to give improved estimation accuracy.(iii) We introduce a new dataset, linking more than 10 h of RGB-Depth video data and inertial sensor data to ground truth calorie readings from indirect calorimetry based on gas exchange.(iv) We present a comparative study on the utility of both visual and inertial data when estimating energy expenditure in a living environment.The visual sensor and inertial sensors are evaluated individually first, followed by an evaluation of two fusion approaches.The rest of the paper is organised as follows.Section 2 presents the background and work related to our study.Section 3 describes the proposed framework for estimating energy expenditure from RGB-Depth and inertial sensors alone, as well as in fusion.The experimental setup and the results are presented in Section 4, followed by a discussion and our conclusions in Section 5.

Inertial sensors
Acceleration, angular velocity, and rotation signals from wearable devices have been used for human action recognition [22], and are popular in healthcare-oriented applications, such as in fall detection systems [23] and medication adherence monitoring systems [24].Inertial sensors can offer particularly low-cost and ubiquitous monitoring solutions for physical activities.Techniques that can control computational complexity, power consumption, and improve the unobtrusiveness of the wearable computers [25] are applicable to many systems including the one at hand.Here, we first discuss inertial sensor feature extraction methods described in the literature, followed by an outline of existing models of energy expenditure estimation based on them.
Feature representation: Different features extracted from inertial sensor devices have been considered ranging from raw signal samples to high-level descriptors.Raw time series data from accelerometers is most often provided as triples of scalars, where each scalar corresponds to acceleration in one of three orthogonal spatial dimensions.The same fundamental structure applies to angular velocity signals and orientation signals of three directions.There is no computational burden associated with feature extraction when the raw data is used.However, raw data may not expose enough discriminative structure to achieve high performance on specific classification tasks.Instead, statistical features may be extracted from each of the three axes, where sensor signal sequences are often partitioned into temporal windows over which features are generated.All features extracted from a temporal window are then concatenated to form a single combined descriptor vector.
Commonly used features include the first-and second-order statistics, namely the mean and variance [26].In [17], apart from these commonly used features, correlation measures between each axes pair are also extracted.Basic statistical measures are computationally efficient and are able to capture structural patterns in inertial data.The feature descriptor can be further quantised into a number of codewords, such as in [27].Approaches based on deep learning are currently being explored to create more generalised learning methods that generate features directly from the input data and promise to optimise performance further [28].
Energy expenditure estimation: The first automatic methods for inertial energy expenditure estimation [29] were count-based estimation systems applied by fitting a single regression model to all the data regardless of what activity was being performed.However, systems that map from a single wearable to calorie values struggle to accurately estimate the intensity of physical activity across a range of actions.For example, some actions involving only upper or lower body movements are difficult to be recognised via a single wearable device, and therefore a high estimation error [30] occurs.Different activities may require different models to represent them.Activity-specific (AS) methods split the estimation process into two steps, where activity groups are detected and classified first, and only then an AS model is applied to estimate energy expenditure.MET lookup tables are the most common approach to perform the latter, where a static MET value is assigned from a compendium on physical activities [2] to each one of the clusters of activities [31].However, METs-based approaches neglect any transitional effects of activities (continued calorie expenditure after rigorous activity has finished), and they overlook the fact that even the activities in same cluster can be performed at varying intensities, for example, walking at different speeds, or body exercise with different intensity.
An attempt to model the transition between activities was proposed in [32], where an accelerometer and a heart rate sensor were used and the transition between sedentary, household activities, and walking were modelled.The work in [33] shows that by using data from multiple inertial sensors one can more accurately predict energy expenditure, although the limitations of wearable devices are considerable particularly with respect to accuracy as emphasised in [17].
Accelerometer feature descriptors are often formed within a temporal window.This brings out another concern that the window sizes are set usually at <10 s in existing works [6,32].The length of window would significantly affect the results.It should be short enough to recognise activities as local temporal information are more descriptive, but long enough to predict calorie values since current energy expenditure strongly depends on previous activity intensity level.

Visual sensors
Visual sensor based techniques have emerged over recent years for which there exists a significant body of literature describing the inference of activities from two-dimensional (2D) colour intensity imagery [34].Meanwhile, the increasing availability of depthmeasuring sensors, especially the introduction of the Microsoft Kinect, has generated an opportunity for utilising depth in conjunction with traditional RGB camera data allowing for richer and more fine-grained analysis of human activity [7].Applying computer vision techniques to help with the diagnosis and management of health and wellbeing conditions has gained significant momentum over the last years [35].However, studies on energy expenditure using visual sensors have been relatively limited.Our work explores this field further and builds on several relevant subject areas in computer vision.
Visual feature representation: The visual trace of human activity in video forms a spatio-temporal pattern.To extract relevant properties from this for the task at hand, one aims at compactly capturing this pattern and highlighting important aspects related to the properties of interest.Assuming that both body configuration and body motion [36] are relevant to infer calorific uptake, the pool of potential features is large -ranging from local interest point configurations [37], over holistic approaches like histograms of oriented gradients and histograms of motion information [17], to convolutional neural network features [38].
Motion information in the first place could also be recovered in various ways, e.g. from RGB data using optical flow or from depth data using 4D surface normals [39].Whilst a composition of these features via concatenation of per-frame descriptors is straight forward, this approach suffers from the curse of dimensionality and unaffordable computational cost.Sliding window methods [40], on the other hand, can limit this by predicting current values only from nearby data within a temporal window.Further compaction may be achieved by converting large feature arrays into a single, smaller vector with a more tractable dimension count via, for instance, bags of visual words [41], Fisher vectors [42], time series pooling [43], or the features extracted from convolutional neural networks [44].In summary, the challenge of feature representation will require capturing visual aspects relevant to calorific expenditure, whilst limiting the dimensionality of the descriptor.
Activity recognition: There exists a significant body of literature describing the inference of activities from 2D colour intensity imagery [34], RGB-Depth data [7], and skeleton-based data [45].Knowledge about the type of activity undertaken has been shown to correlate with the calorific expenditure incurred [2].In alignment with Fig. 2a, we will argue in this work that an explicit activity recognition step in the vision pipeline can, as an intermediate component, aid the visual estimate of energy uptake.
Energy expenditure estimation: 2D video has recently been used by Edgcomb and Vahid [20] coarsely to estimate daily energy expenditure.In their work, subjects are first segmented from the scene background.Changes in height and width of the subject's motion bounding box, together with vertical and horizontal velocities and accelerations, are then used to estimate calorific uptake.Tsou and Wu [21] take this idea further and estimate calorie consumption using full 3D joint movements tracked as skeleton models by a Microsoft Kinect.We note, however, that both of the above methods use wearable accelerometry as the target ground truth, which in fact does not provide an accurate benchmark; skeleton data is commonly noisy and currently only operates reliably when the subject is facing the camera.This limits applicability in more complex, in-the-wild, visual settings as, for instance, contained in the SPHERE_RGBD + Inertial_calorie dataset.Our recent work in [19] introduced a visual based framework for estimating calorific expenditure in a home environment, and we then extended it to be able to estimate physical activity intensity levels in real time [18].Although the method is practically applicable to more complex settings, the light-weight features extracted from bounding boxes (velocity vector and the ratio of height and width of the bounding box) can only help make a gross estimate of calorific expenditure.In this paper, instead of using only simple bounding box features, we simultaneously collect RGB and depth imagery and then encode appearance and motion features via spatial pyramids.The temporal information is encoded using a pyramidal temporal pooling with multiple pooling operators.This has the aim of extracting rich, multi-level information to give improved estimation accuracy, whilst maintaining applicability to detect complex human activities.

Sensor fusion
It is reasonable to expect that the use of multiple sensor types would improve the overall performance compared with single sensor settings, since sensors may complement the limitations of each other.Given an accurate temporal synchronisation between the different modality sensors, learning from multi-modal data is applicable.In general, feature-level fusion (early fusion) and decision-level fusion (late fusion) are the two approaches most often employed to fuse multiple modalities.Both early and late fusion strategies are explained in further detail in [46].Feature-level fusion: This methodology involves carrying out fusion of features right after features are extracted from raw data.This scheme only requires one learning stage and allows to take advantage of mutual information from data.For instance, in [47], the depth and inertial sensor data were concatenated, then an Hidden Markov Model (HMM) classifier was employed for recognising basic gestures on the fused data.The results reveal significant improvements when the fusion scheme is applied compared to using each sensor individually.The work in [17] investigates the practical home-use of body-worn mobile phone inertial sensors together with an RGB-Depth camera to achieve monitoring of daily living scenarios.The results indicate that the vision-based approach significantly outperforms the wearablebased method, while fusion of both sensors slightly improves the performance further.Clearly, feature-level fusion can be applied effectively in practical settings; however, it may suffer from the 'curse of dimensionality'.
Decision-level fusion: This approach fuses the decisions made by individual classifiers, each of which corresponding to one sensor.Since decision information is of low complexity, the curse of dimensionality can effectively be targeted.In [48], for instance, a Bayesian co-boosting training framework combines multiple hidden Markov model classifiers of two modalities -a Kinect sensor and an inertial measurement unit.The result is the construction of a strong classifier for gesture recognition, which achieved the best performance in the multi-modal gesture recognition challenge.A real-time action recognition system in [49] uses Dempster-Shafer theory to combine the classification outcomes from a depth camera and several inertial sensors.A Bayesian model for sensor fusion is introduced in [16], which aims at addressing the challenges of fusion of heterogeneous sensor modalities in ambient assisted living.
Comparisons: In this work, we consider both fusion approaches and provide a direct comparison.In the feature-level fusion approach, features generated from the two modality sensors are merged before classification, and the decision-level fusion is performed by forming a linear combination of different classifiers using stacking regression [50] to improve overall accuracy.As outlined in the following section, our work attempts to use skeleton-independent, RGB-Depth-based vision, together with two wearable accelerometer devices to estimate calorific expenditure against a standardised calorimetry sensor COSMED-K4b2 based on gas exchange.

Method
To describe our framework for estimating calorific expenditure, we initially introduce the methods for visual and wearable sensors separately, and then describe two approaches for their fusion.
Fig. 2a shows a flowchart of the visual method -mapping visual flow and depth features to calorie estimates using AS models.The method implements a cascaded and recurrent approach, which explicitly detects activities as an intermediate to select type-specific mapping functions for the final calorific estimation.Importantly, our setup as a video-based system is designed to reason about activities first, before estimating calorie expenditure via a set of models which are each separately trained for particular activities.In contrast to this, our direct mapping (DM) method designed for wearable sensor data directly maps inertial features to calorie estimates via a monolithic classifier.A flowchart of the wearable approach is shown in Fig. 2b.In our fusion system, we consider both feature-level and decision-level fusion of these two approaches.Finally, we compare these methods against a ground truth of gas-exchange measurements and off-theshelf alternatives, that is manual mapping from activity classes to calorie estimates via METs lookup tables [2] as it is often applied in clinical practice today.

Visual features
We obtain RGB and depth imagery using an Asus Xtion.For each frame t, appearance and motion features are extracted, with the latter being computed with respect to the previous frame (level 0).A set of temporal filters is then applied to form higher level motion features (level 1).We extract features from within the bounding box returned by the OpenNI SDK [51] person detector and tracker, which allows to follow up to six persons in the camera view at the same time.To normalise the utilised image region due to varying heights of the subjects and their distance to the camera, the bounding box is scaled by fixing its longer side to M = 60 pixels, a size recognised as optimal for human action recognition [52], while maintaining aspect ratio.The scaled bounding box is then centred in a M × M square box and horizontally padded.
Motion feature encoding: Inspired by Tran and Sorokin [52], optical flow measurements are taken over the bounding box area and split into horizontal and vertical components.These are re- sampled to fit the normalised box and a median filter with kernel size 5 × 5 is applied to smooth the data.A spatial pyramid structure is used to form hierarchical features from this.Such partitioning of the image into an iteratively growing number of sub-regions increases discriminative power.The normalised bounding box is divided into a n g × n g non-overlapping grid, where n g depends on the pyramid level, and the orientations of each grid cell are quantised into n b bins.The parameters for our experiments are empirically determined as n b = 9 and n g = 1 and 2 for levels 0 and 1, respectively.Fig. 3a exemplifies optical flow patterns and their encoding in two different example activities.
Appearance feature encoding: We extract depth features by applying the histogram of oriented gradients feature on raw depth images [53] within the normalised bounding box.We then apply principal component analysis and keep the first 150 dimensions of this high-dimensional descriptor, which retains 95% of the total variance.
Pyramidal temporal pooling: Given the motion and appearance features extracted from each frame in a sequence of images, it is important to capture both short-and long-term temporal changes, and summarise them to represent the motion in the video.Pooled motion features were first presented in [43], even though designed for egocentric video analysis.We modify their pooling operator to make it more suitable for our data as follows -an illustration of the temporal pyramid structure and the process for pooling operations are shown in Fig. 3b.The time series data S can be represented as a set of time segments at level i as The final feature representation is a concatenation of multiple pooling operators applied to each time segment at each level.The time series data can also be explained as T per-frame feature vectors, such that S = S 1 , …, S N , S ∈ ℝ N × T for a video in matrix form, where N is the length of the per-frame feature vector, and T is the number of frames.A time series S n = [s n (1), …, s n (T)] is the nth feature across 1, …, T frames, where s n (t) denotes nth feature at frame t.A set of temporal filters with multiple pooling operators is applied to each time segment [t min , t max ] and produces a single feature vector for each segment via concatenation.We use two conventional pooling operators, max pooling and sum pooling, as well as frequency domain pooling.They are defined as Frequency domain pooling is used to represent the time series S n in the frequency domain by the discrete cosine transform (dct), where the pooling operator takes the absolute value of the j lowest frequency components of the frequency coefficients where M is the discrete cosine transformation matrix.

Raw time series data from accelerometers is measured as [X, Y, Z]
vectors, where each column corresponds to acceleration in orthogonal spatial dimensions.Fig. 1 illustrates the raw accelerometer data collected from one wearable device for various actions.From the raw data the pooled motion features are formed from each of the three axes for each device.Abstracting short-term and long-term changes in the inertial feature descriptor is essential; it is particularly useful for modelling the level of activity intensity changes.Thus, we apply three pooling operators (max pooling, sum pooling, and frequency domain pooling) to the inertial data.

Learning and recurrency
Energy expenditure estimation can be formulated as a sequential and supervised regression problem.We train a support vector regressor to predict calorie values from given features over a training set.The sliding window method is used to map each input window of width w to an individual output value y t .The window contains the current and the previous w − 1 observations.The window feature is represented by temporal pooling from the time series S = S t − w + 1 , …, S t .We note that energy values for a particular time are highly dependent on the energy expenditure history.In our system, these are most directly expressed by previous calorific predictions during operation.Thus, employing recurrent sliding windows offers an option to not only use the features within a window, but also take the most recent d predictions y ^t − d , …, y ^t − 1 into consideration to help predict y t .During learning, as suggested in [54], the ground truth labels in the training set are used in place of recurrent values.

Fusion approach
Both feature-level and decision-level fusion are considered in our work.
Feature-level fusion: This is an early fusion approach, for which all features from all modalities are concatenated together, and employed as a single unified feature stream to the learning components.Given visual features in d 1 -dimensional feature space S v ∈ ℝ d 1 and accelerometer features in d 2 -dimensional feature space S a ∈ ℝ d 2 , the fused feature set can be represented as where the feature set is constructed as S = (S v , S a ).The fused feature vector is then used as input to the classifiers of the system.Fig. 4a shows a flowchart of this feature-level fusion approach.
Decision-level fusion: In this approach, a collection of models are learned, and the predictions are combined together only at the last stage to form the final decision.We apply the decision-level fusion via a stacking regression method, which forms linear combinations of different classifiers to improve overall estimation accuracy.
Consider that there are K predicted values y ^1, …, y ^K estimated from each regressor individually.Then, the final predictor value Y ^(S) can be represented as a linear combination of a set of predicted values with different weighting coefficients, constructed as Given a set of training data {(S 1 , y 1 ), …, (S T , y T )} with T training samples, where each S t is an input vector, the goal is to minimise the distance of the ground truth y t and the predicted values Y ^t(S) from the combined regressor.This optimised distance can be obtained by arg min with the constraints 0 ≤ α k ≤ 1, k = 1, …, K.The resulting combined predicted value ∑ k = 1 K α k y ^k(S) is then used as prediction.
Fig. 4b shows a flowchart of this decision-level fusion approach.4 Experimental results

Dataset and ground truth
We introduce the SPHERE_RGBD + Inertial_calorie dataset for human calorific expenditure estimation, comprising RGB-Depth and inertial sensor data captured in a real living environment.Colour and depth images were acquired at a rate of 30 Hz.The accelerometer data was captured at about 100 Hz and sampled down to 30 Hz, a frequency recognised as optimal for human action recognition [55].The calorimeter gives readings per breath, which occurs approximately every 3 s.To model transitions better between activity levels, we consider the nine different combinations of the three activity intensities (light, light+, moderate) in the design of each session.
Fig. 1 shows a detailed example of calorimeter readings and associated sample RGB images from the dataset (top) and the accelerometer data reading (bottom).The raw breath data is noisy (in red).We apply an average filter with a span of ∼20 breaths (in blue).The participants were asked to perform the activities based on their own living habits without any extra instructions.

Parameter settings
In our experiments, we use non-linear Support Vector Machines (SVMs) with radial basis function kernels for activity classification and a linear support vector regressor for energy expenditure prediction.The libsvm [56] implementation was used.We perform a grid search algorithm to estimate the hyper-parameters of the SVM.To test our individual-independent approach, we implement leave-one-subject-out cross-validation on the dataset in which each subject's data are tested in turn using models trained with all other subject data combined.This process iterates through all subjects, and the average testing error and standard deviation of all iterations are reported.We use the normalised root-mean-squared error (normalised RMSE) as a standard evaluation metric to facilitate the comparison between data with different scales for the deviation of estimated calorie values from the ground truth.

Evaluation of individual modalities
We start with tests on each sensor type (visual and inertial), and compare their performance in situations when used independently.
Temporal window size: The accuracy of predicted calorie values is linked to the window of previous information utilised for making the prediction.In a first experiment, we look at the relation

Comparing sensor fusion approaches
Having tested the two modalities individually, we now study modality fusion approaches against the use of individual sensor systems, and also compare against the MET lookup table method for completeness.In feature level fusion, we apply our AS approach with w = 15 s for activity recognition and w = 60 s for calorific expenditure estimation.In decision level fusion, we again use the most suitable model for each sensor data to fuse, which is the AS approach for visual sensor data, and DM approach for inertial sensor data.The estimation performance of the two fusion approaches is compared with the performance of each sensor modality individually, as shown in Fig. 8b.It can be seen that both fusion approaches on average outperform unimodal prediction.In particular, by combining the features from the visual and the inertial data, the overall prediction error decreases from 0.46 (inertial sensors alone) and 0.42 (visual sensor alone) to 0.39.The calorie prediction accuracy for most activities is improved when using fusion approaches.We also observed that the two fusion frameworks achieve similar performance.
Finally, we present the results produced by MET, which is commonly used by clinicians and physiotherapists, and compare our proposed methods against it.It assumes N clusters of activity A = A 1 , A 2 , …, A N are known.A MET value is assigned to each cluster, together with anthropometric characteristics of individuals.The amount of AS energy expended can then be estimated as energy = 0.0175(kcal/kg/min) × weight (kg) × MET values [2].Here, we use the ground truth labels to select activities to keep this procedure identical to the commonly used manual estimate.Table 8 presents the detailed results for each sequence.The accuracy is calculated over the total calorie expended in each recording session.We also measure the correlation between the ground truth and the observed values [Note that the total calorie values for sequence 4, 5, 8, 11, 15, and 16 are relatively low due to shorter sequences.].We can see that the fusion of visual and inertial sensors achieves higher accuracy and correlation in more sequences than the MET model or unimodal approaches, and obtains better rates on average, which points towards an advantage of using visual-inertial setups for the task of calorific expenditure prediction.

Conclusion and future directions
We have presented a system for calorific expenditure estimation using data from two different modality sensors, a RGB-Depth camera sensor, and wearable inertial sensors (accelerometers).We have demonstrated the effectiveness of the fusion approach through a comprehensive comparative study with single modality setups and widely used METs prediction.The proposed fusion system used pooled spatial and temporal pyramids of visual and accelerometer features, which subsequently are fed in both early and late fusion approaches.To test the methodology, we introduced the challenging SPHERE_RGBD + Inertial_calorie dataset, which covers a wide variety of home-based human activities.The proposed fusion method demonstrates its ability to outperform the METs estimation approach and the use of single modality sensors.The focus of the paper has been on presenting a system for estimating calorific expenditure from combined visual and accelerometer sensors, where the purpose of the study has been to show that the fusion of both modalities improves the estimates beyond the accuracy of single modality, and the proposed system outperforms manual metabolic lookup table based methods -the main measure used in clinical practice today.We acknowledge that applying more advanced fusion approaches and different feature representations may improve the performance further.Possible future directions include introducing deep learning models and investigating advanced data fusion methodologies for different modality sensors.We hope this work, and the new dataset, will establish a baseline for future research in the area.

Acknowledgment
This work was performed under the SPHERE IRC project funded by the UK Engineering and Physical Sciences Research Council, Grant EP/K031910/1.The data from this study is available by request from the University of Bristol research data repository via http://www.irc-sphere.ac.uk/work-package-2/calorie or http:// doi.org/cc5k.

Fig. 1
Fig. 1 Ground truth example sequence.Top: raw per breath data (red) and smoothed COSMED-K4b2 calorimeter readings (blue) and sample colour images corresponding to the activities performed by the subject.Bottom: three-axis acceleration signals from the waist-wear sensor

Fig. 2
Fig. 2 Overview of our visual-based and wearable-based frameworks (a) Visual-based framework.RGB-Depth videos are represented by a combination of flow and depth features.The proposed recurrent method then selects AS models which map to energy expenditure estimates, (b) Wearable-based framework.Inertial features are formed from two accelerometer sensor data, then features are mapped directly to calorie estimates via a monolithic classifier

Fig. 3
Fig. 3 Flow feature encoding via spatial pyramids and temporal pyramid pooling and its feature representation(a) Flow feature encoding via spatial pyramids.First row: limited motion while standing still.Second row: significant motion features when moving during vacuuming.First column: colour images with detected person.Second column: optical flow patterns.Third column: motion features at level 0.Last column: motion features from the top-right quadrants of the image at level 1 (at which the image is subdivided into four quadrants), (b) Temporal pyramid pooling and its feature representation.This schematic shows the temporal subdivision of data into various pyramidal levels (left) and the concatenation of resulting feature (e.g.max, sum, and dct) into a descriptor vector (right)

Fig. 4
Fig. 4 Fusion approaches overview (a) Feature-level fusion framework.The features from visual and inertial sensors are concatenated to form a monolithic input into activity recognition and AS models, (b) Decisionlevel fusion framework.Calorie values are predicted individually by the different sensor modalities, and then combined using a regression method to form final calorie estimates

Fig. 5
Fig. 5 Example poses from the activity 'vacuuming'.It can be seen that the sequences contain a large variety of body positions, view-points, and distances naturally associated with the action.Two example sequences are captured in daytime and nighttime, respectively, indicating different lighting conditions

Table 1
Activities, their associated MET values, and the levels of activity intensity

Table 8
Ground truth and predicted calorie values in total per sequence and its accuracy and correlationThe best results for each sequence are in bold.