Data behind mobile behavioural biometrics – a survey

: Behavioural biometrics is becoming more and more popular. It is hard to find a sensor that is embedded in a mobile/ wearable device, which cannot be exploited to extract behavioural biometric data. In this study, the authors give the reader an overview of mobile device behavioural biometric data and how this data is used in experiments, especially examining papers that introduce new datasets. They will not examine performance accomplished by the algorithms used since a system's performance is enormously affected by the data used, its amount and quality. Altogether, 40 papers are examined, assessing how often they are cited, have databases published, what modality data are collected, and how the data is used. They offer a roadmap that should be taken into account when designing behavioural data collection and using collected data. They further look at the General Data Protection Regulation, and its significance to the scientific research in the field of biometrics. It is possible to conclude that there is a need for publicly available datasets with comprehensive experimental protocols, similarly established in facial recognition.


Introduction
Biometric technology is getting more and more popular and accepted by society, mainly due to its success in mobile devices -a wide range of traditional biometric modalities are used in modern smartphones, including fingerprints, iris and face. This is unsurprising -the Bank of America 'Trends in Consumer Mobility Report 2016' found that on an average day 39% of millennials interact with their smartphone more than anything or anyone else and feel anxious when they do not have access to their smartphone [1]. Even without an alert from the mobile device, we decide that we 'must' check in on social media -and thus our phonesimmediately [2].
Today, the idea of the use of behavioural biometrics does not seem too unusual. Even South Park -the animated TV series jokes about this topic. In a 2016 episode of the satire show titled 'Fort Collins', one of the characters invents 'emoji analysis', which allows an individual to figure out each student's and teacher's emoji usage and compare it to the person who is trolling the other characters [3].
There is an intersection between mobile behavioural biometrics and cognitive psychology because many of the features exploited in behavioural biometrics can be seen as sequences of motor actions [4]. Thus, there should be some lessons learned from the way humans learn to the possible effect on behavioural biometrics. In Psychology, it is known as the 'power law of practice [4][5][6]. This law suggests that learning does not occur at a constant ratewhen learning a new task, speed of performance's improvement declines. This variance of improvement should also affect the recognition rate if this new 'skill' was used as a behavioural biometric feature [4]. Haasnoot et al. [4] discuss the effects of practice and time on the behavioural biometrics recognition performance, working with a dataset in which subjects performed a 6-element discrete sequence production (DSP) task [7] with each participant completing 864 trials. They investigate how usage of initial samples (when a subject starts learning to perform a DSP task) and time between enrol and probe sessions affects performance. The authors find that early samples negatively affect recognition performance -this reflects the 'power law of practice'. Even after a recognition plateau is reached there is evidence that behaviour patterns keep changing -this is in line with the known facts about motor sequence learning -e.g. motor chunks [8] for DSP task and other tasks connected with motor memory. Both these findings support the idea that it is crucial how data is acquired and selected for behavioural experiments. For example, if behavioural data is acquired with a device that does not belong to the subject or subject needs to perform a specific task; there should be allowed enough time for the user to get accustomed to the device and enough attempts to learn the new task.
The fact that behavioural biometrics can include the need for subjects to learn tasks and that humans can change their behaviour is why this research domain is challenging. A disadvantage of (continuous) behavioural authentication methods is that it can not cope with unusual behaviour which can be caused by alcohol or injuries [9].
Researchers nowadays even offer open-source, extensible behavioural biometrics framework for Android called Itus [10]. The framework enables real-time classification on resourceconstrained mobile devices for prototyping and deployment of new behavioural biometric schemes. This framework is widely used, including [11][12][13].
It is hard to find a sensor that is embedded in a mobile/wearable device which can not be exploited to extract behavioural biometric data. Various types of mobile devices are used to capture sensor data, ranging from wearable sensors to tablet devices. This is why in this paper we want to investigate data in behavioural biometrics and how the behavioural data is used. We only examine papers, that introduce new datasets. If a paper introduces a dataset but does not perform any experiments or the baseline results (as in [14]), we look into the same author successive work that presents such results. Especially, we focus on papers that: • promote reproducible research by offering public datasets or using publicly available data; • test generalisation properties of developed algorithms by using different subject data for training and testing or by using multiple datasets.
The remainder of this paper is structured as follows: Section 2 presents related works -different surveys of papers published in the field of behavioural biometrics; in Section 3 we present the motivation for this work and specifics about biometric data collection. Section 4 presents what we are examining when looking at the papers that are introducing new datasets; Section 5 -what the sensors currently used in behavioural biometrics are. The main contribution is in Section 6 presenting the summary of 40 articles. In Section 7 selected papers are analysed in detail. We discuss the General Data Protection Regulation (GDPR) and its importance to research in biometrics in Section 8. Conclusions are given in Section 9.

Related work
Multiple review papers summarise and categorise existing work in mobile behavioural biometrics. Alzubaidi and Kalita [15] present an extensive study of current (paper was published in the year 2016) research about behavioural biometrics. Authors analyse more than 70 studies and use seven behavioural biometric feature categories: hand waving (2 studies); keystroke (9 studies); touch screen (22 studies); gait (13 studies); signature (11 studies); voice (5 studies); behaviour profiling (9 studies). The paper also presents lessons learned, open problems and future trends including the opinion that behavioural biometrics are considered promising for providing continuous authentication for consumers; machine learning algorithms are well-suited to generalise from past user behaviours to 'predict the future'; most current studies gather and record data under laboratory conditions; and that most published methods have been tested on the Android platform and ignored other platforms, e.g. iOS and Windows Mobile.
Rybnicek et al. [16] present an overview of biometric traits that can be used to secure mobile devices. The authors describe in detail keyboard, touchscreen, accelerometer, gyroscope based as well as hybrid authentication methods. Moreover, the authors discuss multiple aspects of behavioural biometrics for mobile devices and offer a roadmap. These aspects are criteria for using biometric data; data collection; system architectural structure; and biometric features and classification.
In [17] behavioural biometric systems are referred to as transparent authentication systems. The paper presents a review of these systems for mobile device security. The authors classify behavioural biometric systems into six categories: keystroke based authentication; gait based authentication; touch-based authentication; device sensor-based authentication; behavioural profiling based authentication; and multi-modal transparent authentification and analyse 33 papers published between the year 2007 and 2015.
Meng et al. [18] summarise biometric authentication methods on mobile phones. In their study, the authors classify biometrics used in 11 groups. Five physiological: fingerprint, face, iris, retina, and hand/palm recognition and six behavioural: voice, signature, gait, behaviour profiling (defined by authors as 'techniques that aim to identify people based upon the way in which they interact with the services of their mobile devices'), keystroke dynamics, and touch dynamics. The authors also propose a framework for establishing a reliable authentication mechanism through implementing multimodal biometric user authentication. Only there seem to be some inaccuracies in the mathematics used when expected average FRR (False Recognition Rate) and FAR (False Acceptance Rate) is calculated. The authors claim that both FRR and FAR can be improved by five orders of magnitude when the user has given three authentication tries and an option to enter a PIN if the person has not been recognised during the biometric authentication stage (similarly as the authentication mechanism in iPhones with Touch ID). Although, this would improve the FRR (since the genuine user has more attempts to get accepted), but this is not true for the FAR, since more tries would suggest that there is a more significant probability of getting accepted incorrectly/ guessing the PIN.
In [18], published in 2015, the authors survey eleven biometric modalities used in smartphones -five physiological and six behavioural. The behavioural biometric modalities, that can be implemented in a mobile device and are summarised in this paper are voice recognition; signature recognition, gait recognition, keystroke dynamics, touch dynamics and behavioural profiling (in this case, the hypothesis is that the mobile users use applications differently depending on the location and the time). The authors also describe a generic biometric authentication system and present eight possible attack points summarising possible [practical] attacks and possible countermeasures. They also present guidelines for developing a robust biometric system concluding that multilevel authentication is preferable.
This 2018 survey by Gupta et al. [19] as its title would suggest -Demystifying Authentication Concepts in Smartphones -serves more as an explanatory dictionary for the novices in the field of mobile biometrics. One of the authors' objectives is to help explain to the new researchers the sophisticated jargon of biometrics. Similarly, as in our paper, the authors observe that usually, the mobile biometric systems are being reported in terms of accuracy, while other aspects are being overlooked. The authors emphasise that usability is a vital aspect to investigate, whereas, in our paper, the focus is the data itself -how the dataset is collected and used. The paper summarises relevant, publicly available behavioural biometrics datasets and briefly introduce the reader with research papers presenting recognition systems using these behavioural biometric modalities -gait, keystroke/touch dynamics, voice.
Buriro et al. [20] (published in 2017) is not a typical literature summary paper -instead, authors in this six-page long article discuss mobile biometrics and present 17 guidelines for designing and testing such systems. However, the focus is on both kind of mobile biometric systems -physical and behavioural ones. Authors conclude that for maximisation a proposed biometric solution's benefit; it is essential to evaluate the solution by using multiple criteria.

Gap in state of art
We did not find existing reviews that specifically discuss the collection and usage of behavioural biometrics data. That is why we offer this literature overview, presenting a summary of 40 research papers' approach on the collection and use of behavioural biometric data. Except for [20], currently, we do not see publications that would present extensive guides into how behavioural biometric experiments should be performed. Because of the lack of guidelines, well-established recommendations and handbooks in the behavioural biometrics field for the new scientists it is hard to navigate in this broad subject. In the last decade, there are thousands of published research papers in this field. Due to limited resources, we can not summarise them all. Instead, we have selected 40 papers in which authors present new mobile behavioural biometric datasets and perform experiments on the collected data. We choose papers from the last two decades, especially focusing on recently published papers, selecting papers that explore different mobile sensors and underlying behavioural modalities. We hope to provide an insight into how different researcher teams approach data collection and use.

Data collection and use
Li and Jain [21] states that 'the heart of designing and conducting evaluations is the experimental protocol. The protocol states how an evaluation is to be conducted and how the results are to be computed'. Our investigation suggests, researchers do not disclose much about data collection or how the collected data are usedwhat are the protocols and percentages of data used for training and testing. Often the reader can conclude, that collected data is used in an All VS All scenario. Generally speaking, this means that all data was available to the researcher during the development of the recognition algorithms. One or some samples of each subject's data are used as templates and the rest of the samples as probes (see definition in Table 1). All VS All means that all each subject's probes were scored against every subject's template, e.g. if there are N subjects, each with k samples and for each subject there is 1 sample used as a template, then all remaining subject's samples (k − 1) are scored against all subject templates (N templates), generating N 2 ⋅ (k − 1) scores, N ⋅ (k − 1) genuine and N ⋅ (N − 1) ⋅ (k − 1) zero effort impostor scores.
Often no generalisation (algorithm testing on multiple databases/data subsets with non-overlapping subject data) of developed algorithms is researched -it is not clear to the reader IET Biom., 2020, Vol. 9 Iss. 6, pp. 224-237 This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/) how the developed system will perform with unseen subject data or if the system was over-fitted. Thus, it becomes hard to evaluate both -the modalities used and the developed algorithms.
The authors of this paper think that data collection, preparation (design of data protocols, the division of data into different datasets), as well as testing how the developed algorithms would generalise across unseen subjects are some of the most critical stages in the whole process of biometric research. Unfortunately, often in the literature, there is limited information on how this process should be done. In books about biometrics, there can be no information -e.g. [28] or description of basic -All VS All protocol -e.g. in [29].
In the literature about machine learning, especially deep learning, the importance of data and how data should be treated, as well as testing of algorithm generalisation is much more widely discussed. It is well-known in the machine learning community that it is vital to have separate training, test, and validation sets [23,30]. 'The training set is used to fit the models; the validation set is used to estimate prediction error for model selection and the test set is used for assessment of the generalisation error of the final chosen model. The test set should be kept in a vault, and be brought out only at the end of the data analysis' - [23]. A typical split might be 50% for the training dataset, and 25% each for the validation and testing datasets. The validation set (but not the test set) can also be approximated by data re-use, by using crossvalidation and bootstrap methods [23]. Luckily, some papers do discuss data importance for biometric experiments. [31] claims that to carry realistic (and unbiased) experiments, it is necessary to use different populations and data sets for development and evaluation. [32] investigate the effect on the performance of violating the rules for creating training sets of 'the Good, the Bad, and the Ugly' (GBU) biometric recognition challenge problem. The authors show that disregarding the GBU protocol can substantially overestimate performance on the specific face recognition task. Verification performance on the most challenging dataset -ugly increases from 11.4% (no subject overlap) to 15% when 91 subjects (out of 222 subjects) overlap in training and testing data sets (same subject different images). If images are drawn directly from the test set, verification performance increases hugely to 61.2%. The authors conclude that 'there are applications of face recognition where training and testing on the same people may be a reasonable thing to do, such as with family photo libraries'. They also recommend that researchers publicly post their training sets. This should provide confidence in the veracity of the reported results.
Other researchers, as suggested similarly in the machine learning literature, introduce three subsets -training, development (in other literature and this paper development data set is called verification) and testing. There are no overlapping subjects between these subsets [24]. The authors claim that 'such choice guarantees that specific behaviour (such as eye-blinking patterns or head-poses) are not picked up by detectors and final systems generalise well'. The authors recommend that training and development samples be used to teach classifiers how to discriminate. The training dataset can be used for training the classifier and the development data to estimate when the training should be stopped. Another way, which may generalise less well, is to merge both training and development datasets; use the merged set as training data and to formulate stop criteria. Finally, the test set should be solely used to report error rates and performance characteristics. Such, database division into three subsets and purpose are summarised in Table 2. If a single number is desired, a threshold τ should be chosen at the development (verification) set, and the half-total error rate reported using the test set data [24,25].
The factors that we find essential when collecting and managing data -a roadmap to data use -are summarised in Table 3.

Assessment
In this section, we explain to the reader how we are assessing the 40 papers that are viewed in this survey. The aim of this paper is not to analyse all articles that introduce new mobile behavioural biometric datasets. Instead, we try to give the reader an insight into how data are being collected and used in this field.
In most of the papers surveyed in Table 4 data is collected so that researchers could develop the recognition algorithms and perform biometric experiments, assess ing the performance of developed recognition system s. The exception is the 'MIT Reality Mining Dataset' [38]. This study explores the capabilities of the smartphones and enables social scientists to investigate human interactions beyond the traditional survey-based methodology, rather than as a specific study about behavioural biometrics. Another exception is [39], where researchers have a different approach. They choose a novel behavioural biometric modalitythe time it takes for the user to slide his/her finger between the unlock pattern points. They select machine learning algorithms (one-class support vector machine (SVM) and K-means) that can distinguish between the genuine and imposter behavioural patterns.  [22], enrolment data, stored for reference.
Biometric probe (probe) [22] incoming biometric sample that is compared to the stored template. enrol dataset Collection of biometric templates, e.g. in training dataset. probe dataset Collection of biometric probes, e.g. in training dataset. test dataset The dataset that is used for assessment of the generalisation error of the final chosen model. Ideally, the test set should be kept in a 'vault', and be brought out only at the end of the data analysis [23]. The test set consists of enrol templates and probes. Test set should be solely used to report error rates and performance curves [24,25].

Training dataset
Dataset that is used to fit the models [23] and train the classifier.

Validation dataset
Dataset that is used to estimate prediction error for model selection [23], and to estimate system's thresholds. Validation set consists of enrol templates and probes. closed-set identification Opposite of 'Open-Set Identification' [26]. open-set identification It is unknown whether the subject presented to the biometric system for recognition has been enrolled in the system or not. Therefore, the system needs to decide whether to reject or recognise him as one of the enrolled subjects [26]. attempt Submission of one (or a sequence of) biometric samples to the system [27]. identification The process of searching against a biometric enrolment database to find and return the biometric reference identifier(s) attributable to a single individual [22]. Also referred to as 1:N matching [21]. session Data collection process separated by at least one day. verification The process of confirming a biometric claim through biometric comparison [22]. Also referred to as 1:1 matching [21]. After selecting the algorithm and developing an Android app, the authors measure the system's recognition performance. In this case -not by creating a database. Instead, the authors recruited 54 subjects who act as genuine users and 10 subjects who act as imposters. In this way, observing the participants and counting, how often they succeed authenticating in the app, they determine the performance.

Data used
In the papers presenting new datasets, we examine what sensor data is collected. Due to space constraints, we describe only the type of sensors used, not specific usage details (e.g. the particular touch events are examined). The main data categories used are summarised in Section 5. We have summarised the subject count, whether subjects needed to perform a specific task (everything from walking [61] to navigating maps [14]). Because it takes time to learn a task and that subject's behavioural patterns can change over time, we examine whether researchers collect multiple session data, and if so, the time gap between sessions. It is also possible to have an 'unconstrained' data collection, when data are captured over multiple days and does not require the subject to interact with the system in any particular way, nor perform any task. We are interested in the overall duration of the data collection (sum of all data collection session duration).
In some studies, it is argued that subjects should not know the real data capture reason in order not to affect the way how subjects interact with the mobile devices during data collection, e.g. [13]. Of course, it may be perceived to be unethical, if an 'unconstrained' data collection's subjects would not know the collection reason and the type of data that is collected.
Most of all we think that it is essential if authors publish data used in experiments and provide others with the data protocols so that others can reproduce their experiments identically.

Experiments performed
We are not examining performance accomplished by the algorithms used, since a system's performance is enormously affected by the data used. For example, the number of subjects in the database, the data usage protocols (the instructions of which probes should be scored with which templates), feature/score fusion used, quality of the data (e.g. is the data collection unbiased, were all participant data collected in the same way). Some datasets are more challenging, allowing subjects to use the mobile device freely and consisting of larger number of subjects. It would not mean that algorithms developed using this dataset are worse, though the performance on this challenging dataset is lower. This is why we are investigating only general facts that would give an insight of how data was treated while performing experiments. We will ask whether authors explore generalisation (is the classifier is designed to correctly classify unseen objects which are not used during the training process? Generalisation represents the capacity of the classifier to respond to this task. When a classifier has a good generalisation capacity, it can correctly classify unseen examples [79]. To investigate, whether generalisation is performed, we are going to separately evaluate whether authors train system parameters and feature space on independent data (Multiple datasets used) or use multiple datasets (Multiple datasets used)). We also enquire, if open-set identification is performed, whether new subjects can easily be enrolled into the database, and what kind of data was used as imposters. An extended summary of parameters evaluated can be seen in Table 5.
On account of this, we summarise vocabulary used in this paper in Table 1 based on the ISO/IEC 2382-37:2017 standard [22]. For Table 3 Roadmap -set of actions when collecting and using data Action Reason Acquire explicit consent from subjects; implement necessary technical and organizational measures for data collection, storage and usage.
Seamless data collection using subject's device OR (allowing subject to practice/learn task AND allowing subject to get used to the device).
Allow time between template and probe dataset collection.
To explore behaviour pattern changing, e.g. due to motor sequence learning e.g. because of motor chunks [8]; Saeed [34] discusses necessity to acquire data over multiple days in at least three sessions; ISO/IEC 19795-1 [27] states that enrolment and testing is normally carried out in different sessions, separated by days, weeks, months or years, depending on the target application. For verification experiments -(false accept rate (FAR) AND false reject rate (FRR)) OR ROC OR DET curves should be reported.
ISO standard [35]; the best way to present or compare biometric verification performance is the ROC curve [36].
For open-set identification experiments -(False positive identification rate AND corresponding false negative identification rate) OR (False match rate AND false non-match rate) should be used.
Using training, testing and validation data sets. Feature space and the verification system parameters must be trained using completely independent data from that used for specifying client models [31]. Assessment of generalization by testing algorithms using multiple, publicly available databases.
To see how good algorithms generalise on different datasets, and compare results with other researchers. Results should be reproducible/reproducible research approach should be used.
To show evidence of the correctness of one's results and to aspire to reproducibility is to enable others to explore methods used and results acquired [37]. example, some papers, e.g. [67] use terms 'training' as building an enrol model and 'verification' as probing templates against probes, e.g. [52]) 'training' means enrol templates and 'testing' -to score them against probes. We will use terms as defined in the machine learning community, as identified in, for example, [24,25].

Sensors used in behavioural biometrics
In mobile behavioural biometric studies, data are collected using mobile devices -phones, tablets and wearable devices. In this paper, we will not be looking at traditional biometric modalities  The number of subjects. This can be given as N(K), where N > K. In this case, N is a number of subjects in the used dataset, but K -the number of subjects used in experiments. This would mean that part of subjects were excluded from experiments. In some open-set identification papers K is the number of enrol templates, and N subject data are used as imposters. Purpose known Do the subjects know the purpose of the data collection at the time when data collection takes place? (yes/no) Unsupervised If data collection were unsupervised? (yes/no) Supervised data collection would mean that at all times there was a person who interacts with the subjects (biometric attendant [22]). Specific task Did users perform a specific task? (yes/no) The opposite would be, that users were asked to use their mobile devices as usual, and data were collected in the background. The specific task cannot be related to the actual mobile device's use (e.g. in [61] the task is to walk).

No. of sessions
The number of data collection sessions. Although in some papers different definitions are used, we define a session as data collection process separated by at least one day. This means that 2 data collection attempts in a single day are counted as single data collection session. Continuous data collection, which spans over multiple days we define as a single session.

Days between sessions
Days between sessions. If there are multiple sessions with irregular time gaps between them, an average amount is given.
If data collection was performed on consecutive days, 'days between sessions' = 0. Duration, approx Duration of the entire data collection (the sum of individual attempt and session duration).

Multiple datasets used
Were multiple datasets used (e.g. one for training feature space and fixing thresholds, other to obtain recognition results) with non-overlapping subjects? (yes/no) Multiple databases used Were multiple databases used to research generalisation? (yes/no)

Open-set identification
Was the developed recognition algorithms tested on open-set probe set? (yes/no) It is unknown whether the subject presented to the biometric system for recognition has enrolled in the system or not. Therefore, the system needs to decide whether to reject or recognise person as one of the enrolled subjects. It is the opposite of 'Closed-Set Identification' [26]. We will use this definition, similarly used in [80][81][82][83][84]. Li and Jain [21] define two metrics for open-set identification: False Alarm Rate and Detection Identification Rate. The Detection Identification Rate is the fraction of the genuine probes (that belongs to an enrolled subject) that are correctly detected and identified. The Detection Identification Rate is a function of the operational threshold. The Detection Identification Rate can be determined at a specific rank n, this requires that after matching, all similarity scores between genuine probes and the enrolled client templates are examined and sorted. A probe has rank n if the similarity score between it and its true template in the gallery is the nth largest similarity score. False Alarm Rate provides performance when a probe does not belong to any subject enrolled in the database [21]. We will not be so formal; only assessing what the probes are (e.g. are present probes of subjects, that are not enrolled in the template subset, so-called unknown-unknowns [81,82]), not what are the metrics used. New subjects easily enrollable Can new subjects be easy enrolled? (yes/no) Whether it is easy to enrol new subject is a qualitative measure, but the authors of this paper think that it is vital to be assessed. The new subject enrolment process is straightforward with most hand-crafted algorithms, simple similarity metrics, hidden markov model (HMM), and if machine learning is used and the problem is approached as a one class problem (the one class is composed of the genuine templates, the classifier is unaware of the imposter data). New subject enrolment is difficult if a classifier with k output classes, where k -count of enrolled subjects, is used. We also argue that if the recognition is addressed as a two-class problem where one class represents the genuine templates, and the other class -imposters, it is difficult to enrol new subjects, since data to build both class models are available before a subject is enrolled. We believe that a more convincing method should be used. Zheng et al. [45] go further and say that 'two-class classifier need input data from impostors or non-target users at the training phase, which is unrealistic and raises privacy concerns'. Lui et al. [32] states that 'there are applications of face recognition where training and testing on the same people may be a reasonable thing to do, such as family photo libraries. However, it is entirely unacceptable for large scale deployed systems that must manage many enrolled people. For example, retraining a deployed system each time that a new person is enrolled is a logistical nightmare.' Imposters What are the imposter probe samples, when a system is evaluated. For example other subject data, synthetic data.

Comments
Our comments about the paper. readability purpose, we divide sensors into physical and software sensors.
Hardware sensors: • Accelerometer -used in [14,56,61,71] sensor that measures the acceleration of the mobile device [39]. Acceleration is measured in three dimensions -X, Y and Z (relative to the phone's orientation), meaning, that the sensor provides information about phone's movement. It is one of the two sensors that determines the position of a device [87]. Gravity data (the applied force of Earth's gravity (m/s 2 ) on the mobile device) can be calculated using accelerometer data together with a magnetometer and the gyroscope. A device's orientation as used in [54,71] can be computed using accelerometer and magnetometer [88]. This is why from now on the use of orientation sensor data will be denoted as the use of accelerometer and magnetometer. The accelerometer can be used to extract information about a person's gait as well as other movements that are transferred to the phone. • Shen et al. [71] focus on motion sensor output while subjects are performing touch-tapping and single-touch-sliding actions. When users touch a smartphone, the accelerometer measures acceleration force applied to it; the gyroscope measures the rate of rotation, magnetometer measures the ambient geomagnetic field for three axes. It is possible to also determine degrees of phone's rotation around three physical axes. Users may develop personal operational habits, which are based on a different rhythm, strength, and angle preferences of finger movements. • Geomagnetic field sensor -monitors changes in the earth's magnetic field [89]. • Gyroscope -measures the rotation around a device's axis [90].
Accelerometer and gyroscope sensors are always hardwarebased, and a variety of software-based sensors can use their data [91]. • Location -in this paper, denotes location through Wi-Fi information, cell tower information, global positioning system (GPS) and Bluetooth information.
One of the key ideas in [38] is to exploit the fact that modern phones use multiple networks (e.g. Bluetooth and global system for mobile communications (GSM)) data about multiple networks can complement each other to determine subjects' location and actions. Both radio tower IDs and nearby Bluetooth devices can be logged to obtain different types of information. Every Bluetooth device is capable of 'device-discovery,' which allows them to collect information on other Bluetooth devices within 5-10 m. This information includes the Bluetooth MAC address (BTID), device name, and device type. The BTID is a 12-digit hex number unique to the particular device. The authors of [40] recorded GPS coordinates if users enabled it. They also collected location, obfuscated service set identifier (SSID) if a user is connected to a Wi-Fi network and IDs of cellular base stations.
In [69] location was determined using GPS data when outdoors and Wi-Fi when indoors.
• Magnetometer -provides general rotational information and relative orientation of the mobile device to the Earth's magnetic north [92]. The magnetometer is enclosed in an embedded device that often incorporates accelerometer. This helps correct magnetometer sensor measurements using tilt information from the auxiliary sensor. The magnetometer is similar to the geomagnetic field sensor, except that no hard iron calibration is applied to the magnetic field [87]. • Phone status -meta-data about the subject's phone.
• In [38] the authors collect information about phone status such as charging and idle. In [65] information about battery level was recorded and analysed. • Proximity sensor -determine how close the face of a device is to an object [87]. • Touch -denotes the usage of the touchscreen.
In [14] touch describes raw touch events: a timestamp, finger count, finger ID, raw touch type, X/Y coordinates, contact size, screen orientation data, tap gesture data, scale gesture data, scroll gesture data, fling gesture data, key press on virtual keyboard data including press type, and key ID.
The authors of [13] define two user touch actions that are frequent and primitive and call them 'trigger-actions'. Whenever the user performs such action, the system records touch data. These actions are sliding horizontally over the screen (e.g. when browsing through images or navigating to the next page of icons in the main screen) and sliding vertically over the screen (e.g. to move screen content up or down, this is typically done for reading e-mail, documents or web-pages).
In [71] it was found that more than 98% of touch-interaction behaviour comprises touch-tapping and single-touch-sliding actions.
Software sensors: • App usage -monitoring of applications being used.
• The authors of [66,67] collect data from five popular social network services -Facebook, Twitter, LinkedIn, Skype and WhatsApp. Collected data includes a session ID, application name, the duration of that session, the amount of data used in the session and the initial location where the session started. behavioural biometrics trends are. Table 5 summarises the criteria used to evaluate papers, whilst Table 4 presents detailed information about 40 papers.

Data publishing effect on citation count
Analysing data summarised in Table 4 (if multiple papers were used to present one data set, they both were separately included; papers in which data were partially published (marked with ◐ in Table 4) were not included) we can conclude that the mean value of citations, if a paper presents a published data set, is 34.9. If a paper presents a data set, that is not made publicly available, the mean citation count is 8.7. Sadly, there are too few studies that publish data (8 studies, 11 papers involved). Moreover, because of that, it is difficult to make any conclusions. It can be noticed that if data is published, it is possible for the paper to become more popular. As it can be seen with Eagle and Pentland [38]  citations per year). These studies appear as outliers between papers that are not publishing collected behavioural data.
Looking at the ten most cited studies, we did not find any statistically significant evidence that a particular experiment setup -constrained versus unconstrained; asking the participants to perform a specific task versus using a device freely.

Trends in use of mobile operating systems
In most of the 40 reviewed papers, devices running the Android operating system are used. These mobile devices include wearable devices -Al-Naffakh et al. [61] use Microsoft Band 2, and Yang et al. [58] use the Samsung Galaxy Gear smartwatch. Use of the Android operating system is understandable since it gives higher freedom for the app developers than iOS where some concepts are difficult to be accomplished [53]. The 'MIT Reality Mining Dataset' [38] uses Nokia 6600 phones using Symbian Series 60 operating system (data were collected in the year 2004); the authors of [53,65] use iOS devices (iPhone 5S and iPhone 6 accordingly). In [46], the authors recruited 18 subjects to collect data using their iPhones.
Two studies used a different kind of wearable devices: the authors in [70] used Sensoria smart socks that are embedded with three proprietary textile pressure sensors and a 3-axis accelerometer. The authors of [42] in addition to an Android device also use a digital sensor glove equipped with accelerometer and gyroscope sensor boards based on the Arduino platform. It provides the finger movement information. The authors collected two datasets -only touch data from the mobile device and touch data together with the finger movement data. The proved that the additional finger movement information improves the recognition rate.

Modalities used and subject count in presented databases
It can be summarised that different researchers' approaches to data collection are diverse: the duration of the data collection can span from ∼2.5 min as in [65] to 19 months (in 'Mobile Device Application, Bluetooth, and Wi-Fi Usage Data as Behavioural Biometric Traits' - [55]). The same applies to the number of subjects, ranging from 4 participants in [65] to 14,890 subjects in [43]. There are multiple notable examples, with the subject count over 100. For example, Li and Bours [74] with 304 subjects, Neal et al. [55] and Fridman et al. [69] with 200 subjects. Publicly available datasets with more than 100 subjects data are Teh et al. [75] dataset, published in the year 2018, containing 150 subjects data, 'MIT Reality Mining Dataset' [38], Hand Movement, Orientation, and Grasp (HMOG) dataset [14,60], as well as a dataset published by Abate et al. [59,60] -each with 100 subjects data.
How data are collected also differs considerably. Two extremes are entirely constrained, lab-based data collection and fully unconstrained when the mobile devices are given to subjects to use them as one's own. There are in-between studies: authors of [9] asked participants to use a self-made banking app. A similar approach is used by Kolly et al. [43] and De Luca et al. [41]. In [43] the authors developed a mobile game that was used by 14,890 subjects. In [41] the authors asked the subjects to input a password pattern that consists of five strokes. There were 21 data collection sessions. Thus, subjects were not limited to their choice of location/environment, but they still had to perform a specific task.
Talking about subjects -publication [78] is impressive with the fact that in their data collection, participated users as young as ten years old.

How were experiments performed?
A summary of the papers shows that in behavioural biometrics, recognition algorithms and thresholds are not trained using separate datasets. An exception is the authors of [74] -they use separate subject data for crafting of the recognition algorithms and testing the developed biometric system. Data collected from 304 subjects are divided into two datasets -development dataset containing 142 subject data for choosing the optimal weights, and evaluation dataset with 162 subject data on which to report the results. Another exception is in [43]. Its authors have collected data from 14,890 subjects and test the recognition performance on small subject groups -2 to 15 people. The authors also claim that 'in all experiments, we have separated training and test datasets to avoid over-fitting', but, unfortunately, they do not provide any additional details. The authors of [71] tested algorithms on two datasets, but it was not disclosed whether the thresholds and other parameters were fixed for both datasets or fine-tuned for each.
In other papers, even if there were some division performed, all data subsets contained the same subject different data. Thus, the use of the data in behavioural biometrics is different from when dealing with traditional biometric modalities -e.g. facial recognition.
Across the papers, different biometric experiments are performed: verification, closed-set identification or open-set identification. In [53], verification experiments are conducted, and even the data were collected in a way that would represent verification (1:1 scenario) by asking imposters to enter the genuine user's user name.
In [40,47,51,63] the authors use open-set identification. Shi et al. [40] developed a data collection application which was posted in the Android Marketplace and freely downloadable by anyone. Among the 276 users who downloaded data collection application, the authors selected 50 users (who participated over a period of 12 days or longer) as the genuine users and selected subjects who have participated over a duration of 3 days or longer to serve as impostors (so-called 'unknown-unknowns' [81]). The authors of [51] recruited 23 genuine users that use their own Android mobile devices and selected 100 impostors. A paper presented by Inguanez and Ahmadi [63] offers an interesting study -the authors analyse text typed using an on-screen keyboard, focusing on touch features and time intervals. They collected a dataset consisting of 32 subjects. Of all touch data, 48% is generated by one subject, who represents the genuine device user. The remaining 31 users represent the imposters. In [47] the authors selected 28 out of 75 subjects as target users. None of the papers offering open-set identification experiments used ranking, and it is understandable that this would not make much sense, since the task is to protect a single smartphone user, instead of finding similarity in a database. Open-set identification has the benefit of allowing the addition of subjects which have partially participated in the data collection. In this way, one can test the system more thoroughly, with a broader variety of imposters, since more imposters are scored against each genuine template. In the remainder of the papers, closed-set identification was performed.
Another aspect we were interested in is whether new subjects can be quickly enrolled in the system, in case such verification/ identification systems are used in real-life scenarios.
We were pleased to find that in multiple cases, new subject enrolment is straightforward. Authors use simple similarity metrics -in [61] the authors use the average Euclidean distance between the reference template and probe sample; this distance value represents the similarity between both samples. Zhao et al. [48] employ as the similarity metric normalised cross-correlation -L 1 and L 2 norms. To compute similarity, Teh et al. [75] use a score level fusion using three simple similarity metrics -Gaussian Estimation, Z-Score and Standard Deviation Drift. The authors of [41,58,60] use Dynamic Time Warping (DTW). In [56], an HMM-based behavioural template training approach is presented. It does not require training data from other subjects other than the owner of the mobile device and can be updated with new data over time.
One-class classifiers are used, for example, in [52] where the authors performed experiments using three different one-class verifiers: scaled Manhattan (SM), scaled Euclidean (SE), and 1class SVM, since using a one-class approach there is no need for imposter data during the training stage. Zheng et al. [45] use a oneclass classifier, the nearest neighbour distance, to the training data. Yang et al. [78] found that the best performing algorithm to recognise whether a particular touch operation belongs to the genuine user is one-class SVM.
The paper by Kolly et al. [43] presents a behavioural biometrics dataset that is collected unusually. The authors developed a mobile game for the Android platform that was freely downloadable by everyone. The game's user interface used different elements that users had to touch: buttons, lists, and radio buttons. The mobile game was downloaded by 14,890 subjects. Overall, 1 million button touch events and over 2 million list touch events were generated while using the game app. Sadly, the authors do not present additional statistics about the collected data such as what was the average interaction time with the app and how long did the data collection continue. There also were not many details about how data was used. It was disclosed that subjects were not scored in an All VS All scenario. Instead, the recognition experiment was to recognise a person out of a set of 2 to 15 people, concluding that the recognition rate drops when increasing the number of subjects. The authors highlight and measure three properties of touch: timing, pressure, and the position relative to the target.
Neal et al. [55] present a prolonged study -authors have collected 200 subject data for 19 months. The data collected includes application data, Bluetooth and Wi-Fi information. The authors present recognition rates of four data types -application, Bluetooth, Wi-Fi and combined. They offer results for consecutive day, week, and month scenarios, concluding that there is a possibility that a day is too short a period in 'which to extract features' (at least for the used features).

Papers in detail
In this section, we will assess selected papers in detail. We have selected all eight papers that present public datasets.

• Abate et al. [59] -'Smartphone Enabled Person Authentication Based on Ear Biometrics and Arm Gesture' and [60] -'I-Am: Implicitly Authenticate Me Person Authentication on Mobile
Devices Through Ear Shape and Arm Gesture'. The authors present a novel approach for incorporating physical and behavioural biometric features into a biometric authentication system. They use accelerometer and gyroscope readings to record arm gestures and use the phone's secondary (frontal) camera to acquire images of a subject's ear. The idea is to capture these modalities when the subject is answering a phone call; arm dynamics should affect the smartphone motion pattern due to behavioural and anatomical characteristics. Arm gesture data in conjunction with the ear biometrics makes a compelling and never-before viewed combination. The authors present a publicly available (upon request by e-mail) database, consisting of 100 subject arm gesture and ear data. 30 of the 100 subjects participated in 3 data collection sessions (spread over a twoweek time period The authors perform two sub-studies; the second study's database is made public (available upon request via e-mail). The authors pay great attention to the data collection process and the methods of how to make the study more unbiased. They investigate, how to equip a shape-based unlocking scheme with behavioural authentication methods. This is done by exploiting the touch data when the subject is entering the unlock shape. • The first study (data are not publicly available) is a time-limited, lab-based, two-session study. Sessions were 2 days apart. In it, subjects were asked to unlock the data collection device by swipe (horizontal, vertical, vertical with two fingers and diagonal unlock swipe patterns). Subjects know the purpose of the data collection and are asked to unlock the device, in the same way, on every occasion. The authors gave the data collecting device (an Android Nexus One mobile phone) to the subjects in test mode, for them to get familiar with the task they needed to perform. In the actual data collection, authors used two devices -one for data collection and the other one for subject distraction. After every 20 unlock tries, subjects had to input text message's text in the other phone. The second session was undertaken to minimise the subjects' habituation effects. • After this study, the authors performed a second experiment, reasoning, that not enough touch data was collected and the fact that the subjects know the original data collection's purpose. The follow-up study's data is publicly available. In this study, the authors investigate the possibility of adding additional protection to the password pattern by using behavioural biometric authentication. The password patterns consist of five strokes assigned randomly to the subjects. practise their issued password pattern as many times as necessary before the actual data collection. After the practising stage was completed, subjects were asked to execute one authentication per day for 21 days. Subjects could perform authentication in an unconstrained environment. After completing the 21 sessions, participants were asked to attend a follow-up event, where the data was copied from the devices and imposter data recorded. The premise was that the imposter knows the genuine user's password pattern. Since devices used by the participants varied, only the generated data of users, who owned the same type device were used as the imposter data. Yang et al. [14] introduce a comprehensive behavioural biometrics dataset. Data is acquired from 100 subjects, and there are 9 channels of data -accelerometer, gyroscope, magnetometer, raw touch event, a tap gesture, scale gesture, scroll gesture, fling gesture and key press on the virtual keyboard. Data were collected using an application developed by the authors for Android phones. Subjects were asked to perform three different type tasks: document reading, text production and navigation on a map to locate a destination. For each type of task, the volunteer either sits or walks while performing the task (thus there are two operational scenarios). One attempt lasts 5 to 15 min, and each volunteer is expected to perform 24 attempts (8 for each task type). In total, dataset consists of 2 to 6 h of each subject's behavioural biometric data. The dataset is publicly available.
Sitová et al. [52] offer baseline experiments for this dataset. The authors propose a new set of behavioural biometric features for continuous authentication, calling them 'Hand Movement, Orientation, and Grasp (HMOG)'. They propose the use of a oneclass classifier [94]. From the original 100 subject dataset, 10 subjects are excluded due to an insufficient amount of data. To improve authentication performance, the authors performed feature selection, feature transformation with principal component analysis and outlier removal. For each verifier, HMOG features were selected separately. For this, the authors used 10-fold crossvalidation.
Although the terms 'training' and 'testing' are used, in the paper they refer to enrolment template and probe data (e.g. 'we used all training vectors to construct the template'; 'we created authentication vectors by averaging test vectors'; 'authentication vector was matched against the template of the same user'). This suggests that an algorithm's ability to generalise to unseen subjects is not investigated (for example by dividing collected data into training, validation and test subsets, training the model on one subset and acquiring the recognition rates on another); the system is tuned for a particular set of subjects.

General data protection regulation
The GDPR [33] is a European Union (EU) legal requirement for data protection. GDPR enables citizens of the EU to control their data. It became enforceable on 25 May 2018 ( [33], Article 99), after a two-year transition period, replacing the 1995 Data Protection Directive.
The GDPR Article 4.14 defines that biometric data is 'personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person, such as facial images or dactyloscopic (fingerprint) data'. This broad definition implicitly acknowledges that biometric technology will continue to evolve [95]. The definition recognises traditional biometric features as well as behavioural biometrics. The extent of the possible behavioural biometric features is not clear from this definition since the definition raises a question what does unique identification imply -is the purpose of the use of biometrics for identification/authentication enough to consider it to be a unique identification or will a certain level of precision have to be reached to assume that the biometric feature in question delivers unique identification?
The significant change regarding biometrics when the existing Data Protection Directive is compared with GDPR is that the GDPR recognises biometric data as a subset of sensitive personal data [95,96]. The new directive, Article 9.1 states that processing of such data is prohibited. Fortunately, there are exceptions for processing 'special categories of personal data'. The general exception -if the 'data subject has given explicit consent to the processing of those personal data for one or more specified purposes' (Article 9.2a). This means that biometrics can still be used in various kinds of application but a user's consent is mandatory and a special care is necessary to collect, store and use the data. Use of special category data is still allowed for scientific research purposes (Article 9.2j).
Data processing is defined (Article 4.2) as 'any operation or set of operations which is performed on personal data or on sets of personal data, whether by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction'. The definition states that even storage of such data is something that falls under the GDPR and that requires special care; this also applies to the scientific community. The most concern for the scientific community arises from the distinct care required with the special category data. GDPR Article 28.2 says that the processor (defined as 'natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller', GDPR Article 4.8) shall not engage another processor without prior specific or general written authorisation of the controller'. In other words, this mandates a strict system of how to access and use data inside scientific organisations including sharing the data across researchers and organisations.
GDPR Article 17 defines the Right To Erasure (or Right To Be Forgotten). The data subject, at any time, can withdraw his or her consent and their data must be deleted. GDPR Article 17.2. states that 'where the controller has made the personal data public and is obligated (…) to erase the personal data, the controller (…) shall take reasonable steps, including technical measures, to inform controllers which are processing the personal data that the data subject has requested the erasure by such controllers of any links to, or copy or replication of, those personal data.' This would imply that anyone at any time can ask to withdraw their biometric data from the database. If the data are published, the institution that published the data have to ask the third parties to delete the subject's data in question, but GDPR Article 17.2. releases the original authors from liability if the third party does not cooperate. The sharing of data still would mandate a written authorisation, as stated in Article 28.2.
GDPR Article 17.3.d states that Right To Be Forgotten does not apply to the extent that processing is necessary 'for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes in accordance with Article 89.1', which comments on the necessary technical and organisational measures that need to be in place. Article 89.1 also states that 'Those measures may include pseudonymisation provided that those purposes can be fulfilled in that manner. Where those purposes can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, those purposes shall be fulfilled in that manner.' The Right To Be Forgotten does not necessarily relate to the biometric databases if used for scientific purposes but it is not entirely apparent what the last sentence implies: does data subject pseudonymisation is enough or biometrical data should be distorted to not permit the identification?
GDPR Article 25 defines 'Data protection by design and by default'. The article states that the data controller shall 'implement appropriate technical and organisational measures, such as pseudonymisation, which are designed to implement dataprotection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects'. Similarly, as with the previous point, it is not apparent how extensive these measures should be.
Summarising the GDPR with respect to biometrics one can conclude that it tries to organise and regulate all the aspects of a natural person's data usage. It has an enormous effect on biometric data and its use in the scientific community. One positive outcome is that finally these aspects are being regulated at the EU's level. Unfortunately, there are currently many questions and no known precedent or guidelines how the necessary mechanisms (Right To Be Forgotten; Data protection by design and by default; data processing; data sharing) should be implemented and maintained while collecting, using and sharing biometric data for scientific reasons.
Irwin [97] states an interesting concept that entities like to experiment with new technologies (e.g. biometrics) because they are accessible. It is possible that because of the high data requirements set by the GDPR this trend will change, and companies will evaluate more carefully whenever they want to handle the special categories of personal data.

Conclusions
In [98] five levels of reproducibility are defined: Reviewable Research; Replicable Research; Conformable Research; Auditable Research; Open or Reproducible Research. Unfortunately, most of the papers published in the field of behavioural biometrics fall under the first level 'Reviewable Research' -'The descriptions of the research methods can be independently assessed and the results judged credibly. This includes both traditional peer review and community review and does not necessarily imply reproducibility' [98].
Although we cannot prove that there is a correlation between data publishing and citation count, we still think that it is beneficial, if not for the researcher itself, then for the whole research community to publish databases and software for result reproduction. There is a gap for reproducible research and grand challenges to be made (similarly to the GBU [32] in face recognition) in the field of behavioural biometrics.
There seems to be a need for standardisation and guidelines on how biometric experiments should be performed to enable authors to implement how training is to be conducted and how to compute results.
The GDPR tries to regulate a somewhat grey area of personal data, including biometrics. Only time will tell how it will be interpreted by businesses, researchers and the general society. Regarding research in biometrics and GDPR -hopefully, issues will become more apparent and there will be precedents of implementation of the necessary technical and organisational measures.

Acknowledgments
Thanks to Erwin Haasnoot, Tiago Freitas Pereira, Sushil Bhattacharjee, Elakkiya Ellavarason and Matthew Boakes for the valuable discussions. We also want to thank the anonymous reviewers for their insightful comments.
This work has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Marie Skƚodowska-Curie grant agreement No. 675087.