Volume 8, Issue 1 p. 29-39
Research Article
Free Access

Analysis of the effect of ageing, age, and other factors on iris recognition performance using NEXUS scores dataset

Dmitry O. Gorodnichy

Corresponding Author

Dmitry O. Gorodnichy

Science and Engineering Directorate, Canada Border Services Agency, Ottawa, Canada

Search for more papers by this author
Michael P. Chumakov

Michael P. Chumakov

Business Application Services Directorate, Canada Border Services Agency, Ottawa, Canada

Search for more papers by this author
First published: 21 September 2018
Citations: 9

Abstract

The historical NEXUS iris kiosks log dataset collected by the Canada Border Services Agency from 2003 to 2014 has become the focus of scientific attention due to its involvement in the iris ageing debate between the National Institute of Standard and Technology and the University of Notre Dame researchers. To facilitate this debate, this study provides additional details on how this dataset was collected, its various properties and irregularities, and presents new results related to the effect of ageing, age, and other factors on the system performance obtained using the portions of the dataset that have not been previously analysed. In doing that, the importance of conducting subject-based performance analysis, as opposed to the traditionally done transaction-based analysis, is emphasised. The significance of factor effects is examined. Recommendations on further improvement of the technology are made.

1 Introduction

The biometric kiosks deployed by the Canada Border Services Agency (CBSA) since 2003 for the NEXUS trusted traveller programme [1] present one of the longest deployed iris recognition technologies in automated border control to date. The performance log collected from these kiosks provides scientists and developers with a unique source of information that can be used to better understand and improve iris technology.

In 2012, a portion of anonymised NEXUS kiosk log data was shared with the National Institute of Standard and Technology (NIST) scientists for the IREX VI study aimed at investigating the effect of aging on iris performance [2], where it was labelled the OPS-XING dataset. The results of IREX VI study, which were published in 2013, brought a lot of attention from the scientific community, actively discussed and contested by the scientists from the University of Notre Dame [39].

One of the key reasons behind the arguments of invalidity of the results obtained by the NIST scientists on the OPS-XING dataset lies in the fact that besides ageing, which was the main factor under investigation, the paper looked into only one additional factor affecting the system performance – dilation. The log data related to other factors were not made available to the NIST scientists.

Another reason is the fact that the dataset was obtained from the operational system, the full operation of which is not entirely known to external organisations. The dataset contained a number of irregularities – due to human and machine errors, which were not known to the investigators. A full explanation of how the system worked and its performance objectives were not provided.

Finally, the evaluation methodology that was applied in the IREX VI study for analysing the effect of ageing was also put into question. The effect of habituation, even though being admitted by NIST scientists, was not really taken into account.

This paper addresses these three limitations of the IREX VI study. Detailed descriptions of the system operation (Section 2) and dataset irregularities (Section 3) are provided. An alternative methodology based on the use of subject-based metrics, instead of conventionally used transaction-based metrics used in the previous studies, is described and is shown to be more appropriate for the application (Section 4). Finally, new results on the effect of age, ageing, and other factors, based on the new methodology and the previously unused portions of the dataset, are presented (Sections 5 and 6). Recommendations for the improvement of iris recognition performance based on the obtained results conclude this paper.

2 NEXUS system description

The CBSA commenced using iris recognition technology for automated authentication of travellers in airports in 2003, following the launch of a similar iris-enabled registered traveller programme in the United Kingdom. First, it was used for CANPASS-Air [10], which is a Canadian programme that provides pre-enroled pre-cleared Canadians expedited passage at arrival in airports for flights within Canada. Later in 2004, the use of iris-enabled identification of travellers was extended to the NEXUS-Air, which is a bi-national, Canada–US programme for pre-approved low-risk travellers flying between Canada and the USA [1].

The expedited passage allows NEXUS members to proceed directly to the NEXUS self-serve kiosks, bypassing lengthy queues, and interaction with customs border protection officers and border services officers. All kiosks are located in Canadian airports, owned, and controlled by the CBSA, with iris biometric data being collected and stored by the CBSA. Kiosks used for travellers arriving in Canada are located in the primary inspection area. Kiosks used for travellers leaving Canada to the USA are located in a specially dedicated lane of the USA pre-clearance area. In total, 69 NEXUS kiosks have been installed in Canada in eight Canadian airports: Calgary, Edmonton, Halifax, Montreal, Ottawa, two terminals at Toronto Pearson International Airport, Toronto Billy Bishop (Toronto City Airport), Vancouver, and Winnipeg. Of these, 8 kiosks are used in enrolment centres and 22 kiosks are used at the USA pre-clearance. The same kiosks and iris database are used for both NEXUS-Air and CANPASS-Air programmes. The number of CANPASS-Air users (about 2000 people by 2014), however, is significantly less than that of NEXUS-Air (over half a million in 2014).

Two designs (as shown in Fig. 1) were used for the NEXUS kiosks of the first generation NEXUS system that were deployed from 2003 to 2014, the log of which comprises the OPS-XING dataset: with one-eye LG camera (deployed in 2003) and two-eye Panasonic camera (deployed in 2007).

Details are in the caption following the image

Workflow and decision logic of the NEXUS kiosks of the first generation, the log of whichwas used in NIST IREX IV iris ageing study. The system decisionsteps for match and rejection are shown in dark blue arrows. Theuser's procedural steps are shown in light orange arrows. Dashedorange arrows indicate optional steps for the users(best viewed online in colour)

2.1 System decision logic

At the enrolment stage, both irises of a traveller are photographed. Image quality (IQ) control on iris images is performed. Only if their IQ metric is high, will they be enroled into the system database. Owing to IQ control, in some cases only one eye can be enroled and in some rare cases none of the eyes can be enroled. Travellers have also a choice of opting out from enroling their iris images. For travellers enroling the iris, instructions are provided on how to use the kiosks, among which is the recommendation to remove eye glasses and contact lenses of any type. However, it is not known how closely these recommendations are followed.

At the time of crossing the border, referred to as the passage stage, the system is configured to search for the identity of the captured eyes using a 1-to-first search using the decision tree shown in Fig. 1. Once the system captures images of a person's eyes, it tries to authenticate a person using the left eye only. If the left eye is not matched, the right eye is used. In both cases, the match is performed against all images (i.e. both left and right images) stored in the database until the first image with a matching score below the threshold is found. This is due to the fact that the first generation of NEXUS kiosks used single-eye iris cameras, which captured an eye of a person without knowing whether it was a left or right eye.

When a traveller is rejected by the system (which happens because of one of two reasons: either IQ of live image is poor or no matching image is found in the database), s/he is asked to try again, with the total of three attempts allowed in a single passage session with the kiosk. When a traveller is accepted, her/his attempt number at a given session is recorded.

A passage session ends either because of the traveller's inactivity or the maximum number of capture attempts is reached, after which the system resets into the initial state with the ‘Welcome. Please choose your language’ message. Travellers who are not recognised within a single session receive the ‘Please visit Special Services Counter’ message. At the same time, they are also allowed to initiate additional passage session using the same or different kiosks, which they can do as many times as they want. Similarly, they are also allowed to proceed to Special Services Counter at any time if they experience a problem with the kiosk.

It is possible that some travellers, particularly those who experienced rejection problems in the past, have proceeded directly to the Special Services Counter without initiating a single session with the kiosks. There is no data left in the system log about these travellers. The data from travellers who used the system but were rejected was also not logged. This presents a critical limitation of the OPS-XING dataset made from historical NEXUS log data. By the design, this dataset is biased toward better performing users, as it contains mainly the data from travellers who did not experience problems with the system and does not contain any rejected transactions. Nevertheless, even with this limitation, this dataset presents a unique and very valuable source for investigation of iris biometrics properties and limitations, specifically related to age and ageing, which becomes particularly important now with iris modality becoming increasingly used in many government and United Nations programmes [11, 12] and the ongoing debate related to the tolerance of iris biometrics to ageing [39, 1321].

2.2 Iris recognition algorithm: matching formula and threshold

NEXUS kiosks use Daugman's original iris recognition algorithm [22, 23]. The same version of the algorithm is used throughout the entire life cycle of the system. Since its deployment in NEXUS system, iris technology has improved [24, 25] including more precise pupil and iris circles interpolation, better masking bits for occluding parts of the iris region due to eyelashes, specular reflections, boundary artefacts of hard contact lenses, and the use of both real and imaginary bits of the iris code. To our understanding, however, these later improvements of the algorithm are not implemented in the version that was used in the collection of the OPS-XING data.

Iris images are compared using the Hamming distance (HD), which is a dissimilarity score between the corresponding iris templates (IrisCodes). The score HD = 0 means perfect match. A high score (i.e. HD>THD) results in reject. The value of the threshold THD is automatically selected by the algorithm based on the theoretical prediction of the false accept rate for a given number of entries in the database, slightly decreasing every year as the number of enroled NEXUS members grew: from 0.282672 in 2006 (when the logging of the system commenced) to 0.271534 in 2014 (when the logging finished).

The HD is computed in two steps. First, the raw HD (HDRAW) is computed as the fraction of bits that disagree between two irises. Then, the normalised HD (HDNORM) is computed from HDRAW following the normalisation rule that gives less weight to comparisons performed on heavily occluded irises, using the following formula:
urn:x-wiley:20474938:media:bme2ietbmt20185105:bme2ietbmt20185105-math-0001(1)
where Nbits is the number of bits used in comparison and where <Nbits> is a vendor defined constant equal to 911, which, according to the original algorithm [23], represents the average number of bits compared.

Fig. 2a shows HDNORM, HDRAW, and Nbits score distributions in the OPS-XING dataset. It is noted that, in contrast to HDNORM scores distributions, the HDRAW scores distributions have much less visible artefacts due to score truncation and censoring, and are unimodal (i.e. have only one maximum). This makes an analysis of HDRAW scores using statistical techniques easier.

Details are in the caption following the image

Distribution of number of bits compared, HDRAW and HDNORM scores in the OPS-XING dataset (best viewed online in colour)

(a) Left-eye scores (solid green) versus right eye (dashed red) scores, (b) Histograms of HDNORM scores for a different number of attempts. Minimum, 25, 50, 75%, quartile, and maximum values are shown at the bottom of each histogram

We also note that the actual average value of Nbits is 954, which is higher than <Nbits> constant used in the normalisation formula (1).

2.2.1 Observation related to score normalisation

Through our analysis, the value of using the normalisation step (1) for the NEXUS application has been questioned in a number of ways. Besides producing non-unimodally distributed values (as seen in Fig. 2a), which complicates modelling the system performance using statistical methods, it also contributes to higher false reject rates for travellers with occluded iris.

A number of ways are seen to further improve the matching formula for the application. This includes post-processing score normalisation described in [26], the use of conditional normalisation formula (conditioned on additional IQ metrics such as contrast and/or person's age), which is analysed further in this paper, or not applying the normalisation formula (1) at all. These, however, are outside of the scope of this paper.

In this paper, it is the importance of analysing HDRAW in combination with IQ metrics, as opposed to analysing HDNORM scores only as done in the past, that is emphasised.

2.2.2 Observation related to the correlation between matching score and number of attempts

In our analysis, in addition to the matching score (HDRAW and HDNORM), we also use the number of recorded attempts (# Attempts) as one of the important kiosk performance metrics. There exists a subtle relationship between the two, as illustrated in Fig. 2b, which shows the distribution of matching scores for a different number of attempts and the corresponding five-number statistics for HDNORM.

On one hand, it is seen that the larger the number of attempts, the larger (worst) the matching scores, as reported in [7]. On the other hand, a higher matching score does not necessarily mean that a person gets rejected (as long as the matching score is less than the threshold, a person is accepted). Similarly, recognition from a single attempt does not necessarily mean that a person has not tried and was already rejected multiple times during other sessions that were not logged. Therefore, using both metrics in the analysis provides richer complementary evidence for the results obtained.

3 OPS-XING dataset

The OPS-XING dataset, a part of which was used in the IREX VI evaluation by NIST [2, 4] and the evaluations conducted by UND [5, 6, 9], consists of over a quarter billion of matching and IQ metrics that were recorded during enrolment and passage transactions by the NEXUS system. These metrics are listed in Table 1. The metrics that were shared with NIST and UND and used in the previous research [29] are marked in bold.

Table 1. Metrics recorded in the OPS-XING dataset
at enrolment FAKE_ID, age, EYE (L – left or R – right), CAMERA (‘L’ for old LG camera, ‘B’ for new Panasonic camera)
ENROLLMENT_DATE (month, year, time of the day)
IQ metrics: related to localisation accuracy — iris centre x, iris centre y; iris radius; pupil centre x, pupil centre y
related to dilation — pupil radius, pupil–iris ratio (the same as DILATION)
related to image contrast — iris sclera contrast, iris pupil contrast, average iris intensity, iris texture energy
related to occlusion — iris area, number of bits encoded
at passage FAKE_ID, EYE used, TRANSACTION_DATE (month, year, time of the day)
ELAPSED_TIME (the number of days since enrolment) HDNORM, HDRAW
CAPTURE_NUMBER_WITHIN_PA (capture-and-recognise attempts)
FAKE_KIOSK_ID (OK_01,…,OK_69), THD (matching threshold), MATCHING_MODE (two-eye pilot, regular one-eye operation)
IQ metrics: same as at enrolment, the number of bits encoded, number of bits compared
  • Metrics used in the previous work [29] are marked in bold. The distributions of metric values are shown in Fig. 3.

In total, there were 1,370,890 enrolment transactions [recorded from September 2003 to May 2014, from 705,553 travellers – most (662,220) done with dual-eye Panasonic cameras deployed in 2007, others done by single-eye LG cameras] and over 10,000,000 passage transactions (recorded from October 2007 to May 2014, from 467,314 travellers – all done with dual-eye Panasonic cameras). Distribution of enrolment and passage transactions over the years is shown in Fig. 3. Seasonal patterns in passage data can be noted.

Details are in the caption following the image

Number of enrolment (left) and passage (right) transactions per month

3.1 Aberrations in data

The OPS-XING dataset contains a number of abnormal entries that are not described by the system logic. Mostly caused by human error or temporal experimentation with the system (either by kiosk users or programmers), such aberrations in data may give rise to additional challenges in understanding the technology and arriving at the correct conclusions by external researchers who process this historical dataset. These data aberrations are described below. They needed to be removed or taken into account prior to conducting the analysis:
  • HD scores higher than the threshold: There are 351 passage transactions in twokiosks that happened with a right eye, which have matching scorehigher than the threshold (i.e.HDNORM>THD). These are from thepilot that was conducted in 2012, in which the first eye isrecognised but the second eye is verified as 1:1.

  • More than three attempts: There are 1495 passage events, in which there were more than three attempts. These are due to some users unexpectedly interrupting the operation of the kiosk in the middle of its operation.

  • Enrolments of left and right eyes on different days: Some (14) travellers have eyes enroled on different years. When there was a problem enroling an eye image, the older eye image was often kept.

  • Multiple enrolments (dilation scores) at enrolment: Some (1405) travellers have multiple IQ data (including dilation score) at enrolment transactions for the same eye, due to applying several attempts to enrol the iris.

  • Other issues: As mentioned above, the system performs a 1-to-first search. In doing that a new probe iris image, which can be either from left eye (default eye) or right eye (when left eye did not find the match) is compared with all iris images stored in the enrolment database including left and right eye images and sometimes old and new name records of a person. This results in some unknown number of zero-effort false match scores being recorded as part of the dataset.

A filtered version of that OPS-XING dataset with data aberrations marked or removed (other than the unknown number of false match scores) has been prepared and used in our analysis.

4 Methodology for analysing the performance of NEXUS kiosks

This section presents one of the key results of our paper, which shows that the performance of the system varies considerably among the subjects and that subjects who experience problems with the system use it much less than the others. On the basis of this finding, a methodology for subject-based performance analysis is developed to allow one to investigate the factors affecting the system performance. The taxonomy for categorising such factors is established.

4.1 Variation of performance among subjects

As mentioned in Section 2, the OPS-XING dataset does not contain the data about travellers who were rejected by the kiosks. Therefore, the following two metrics are used to stipulate the number of travellers who have experienced difficulty in using the system, knowing that some of them used the system only once and some used it more than a 100 times, with 942 passages being the largest number of passages for a subject:
  • Metric 1: Traveller's average number of Attempts is higher than 1.5 (i.e. s/he is over 50% likely to be rejected by the system from the first attempt).

  • Metric 2: Traveller's minimum matching score HDNORM is higher than 0.2.

The first metric relates directly to the border wait time, which is a performance metric that the agency needs to minimise. This metric, however, may not always show the actual number of attempts taken by a traveller (e.g. as described in Section 2, when a traveller tries different kiosks or different sessions at the same kiosks, the number of attempts from the last session is recorded only). The second metric addresses this issue, as it allows one to estimate the difficulty of using the kiosk under situations when the number of recorded attempts is the same.

As highlighted in Section 2.2.2, the HDNORM metric correlates with the Attempts metric (the more attempts it takes the traveller to be recognised, the worst is the HD value). This allows one to use HDNORM metric as a proxy performance metric for kiosk performance instead of Attempts.

Table 2 shows the number of travellers who used the system different number of times and the percentage of them who experienced the ‘difficulty’ using it, where the difficulty is defined using the two metrics described above.

Table 2. Number of travellers as the function of the number of times they used the system and percentage among them experiencing ‘difficulty’
Times used the system 2 +  4 +  8 +  16 +  32 +  64 +  128 + 
number of travellers 383,463 287,472 196,573 119,538 61,332 24,383 6,530
percentage of them having HDNORM>0.2, % 4.2 2.4 1.3 0.8 0.6 0.3 0.2
percentage of them having Attempts>1.5, % 3.4 2.4 1.3 0.6 0.3 0.12 0.06
  • ‘Difficulty’ is measured by high minimum HD score (HDNORM>0.2) and high average number of attempts(Attempts>1.5). The temporal information (i.e. whether a traveller used the system over a short or long period of time) is not used. More details are provided in [27]

It is observed that travellers who experience ‘difficulty’ in using the system use it much less than those who do not. Therefore, any performance evaluation results obtained by aggregating transaction metrics such as those obtained in the previous analysis of the OPS-XING dataset [29] will be highly skewed toward ‘better’ performing subjects. To provide an objective picture of the system performance quality, subject-based performance analysis is required.

In contrast to the transaction-based analysis, established by the International Standard Organization (ISO) and currently used by the industry [28], which answers the question: ‘How many times did the system reject a person?’, the subject-based analysis answers the question: ‘How many persons were rejected by the system?’.

4.2 Subject-based performance analysis

Subject-based variation of biometric performance is well studied for voice and face modalities [29, 30]. It has been much less documented and analysed for the iris modality. The major first evidence of subject-based variation of biometric performance in iris systems was presented in our earlier work in 2011 [31] and has become since then an important guiding principle for us in performing an evaluation of biometric systems.

As a general rule for conducting subject-based analysis, the following approach is used. All performance metrics X that are computed for a population are computed using the averages obtained separately for each individual (2), as opposed to using averages over all individuals of the entire population (3), done in the transaction-based analysis.
urn:x-wiley:20474938:media:bme2ietbmt20185105:bme2ietbmt20185105-math-0002(2)
urn:x-wiley:20474938:media:bme2ietbmt20185105:bme2ietbmt20185105-math-0003(3)
In general, one should expect transaction-based metrics to be different from subject-based one, skewed toward the average metrics of the most frequently observed subjects. By conducting a subject-based analysis, one is able to better decipher the factors that negatively affect the system performance. These factors are categorised and analysed next.

4.3 Factor categorisation

From an operational perspective, it is important to distinguish factors by their prime cause. Using the approach that we first developed for video surveillance applications [32], the factors that affect biometric systems performance are classified into one of three types according to the ‘technology–process–subject’ factor triangle:
  • Technology-related factors: This group of factors relate to the general limitations of the technology. They affect all users regardless of the process and user-specific characteristics. Any improvement of the system performance due to these factors requires contacting a vendor and potentially replacing the technology. Ageing (i.e. deterioration of the technology performance with time) is an example of a technology-related factor.

  • Process-related factors: The second group of factors related to the conditions, in which the technology is used. It is normally the responsibility of the organisation deploying the technology to make sure that the technology is used under the conditions where it works the best. Kiosk location is a prime source of process-related factors, potentially leading to worst IQ and performance for all users.

  • Subject-related factors: The last group of factors related to particular characteristics of a person or group of people that make some travellers more vulnerable in operating biometrics systems than others. This includes person's gender, age, and other subject-specific physiological and behavioural peculiarities such as eye colour, size, or shape of the pupil, medical conditions including wearing contact lenses. If such factors are detected, they can be used to improve the performance of the system by either alerting a user (e.g. by automatically detecting contact lenses and asking a user to remove them) or by allowing different thresholds for users of different groups (e.g. for the elderly).

In the following, the effect of these three groups of factors is examined, using the enrolment data and then using the passage data.

5 Analysis of enrolment data

Enrolment data allows one to examine subject-related factors, specifically the effect of age on IQ. It does not require subject-based metrics because all enroled travellers have exactly one enrolment transaction.

5.1 Young and elderly have worst IQ and are harder to enrol

Fig. 4 shows the number of NEXUS members who enroled iris and the percentage among them who could enrol one iris only for each age: from newborn to 100 year old people. A dip at 19–20 years of age is explained by the NEXUS programme rules, where children 18 and under are free to enrol at no charge with parents.

Details are in the caption following the image

Number of travellers by age at enrolment and passage. The left image shows the number of travellers who enroled iris (in blue) and the percentage among them who were able to enrol one iris only (in red). The right image shows boxplots summarising the number of passages for each age. The inset shows 95% truncated boxplots (i.e. with 5% of outliers removed)(best viewed online in colour)

Two important observations are made. First, it is seen that the majority of enroled travellers are between 30 and 60 years old, and almost all of them (>98%) were able to enrol both irises.

Second, it is seen that the ability to enrol both eyes is much worst for young travellers and diminishes steadily with age for older travellers. This is an indication that IQ of these age groups is worst than that for the middle-age group. This conjecture is validated next.

To remove the factors due to camera quality, the data from travellers enroled with new (‘B’) cameras are used only. These data counts for over 95% of the dataset. The boxplots for dilation, contrast, and number of bits encoded values at enrolment in these data, for each age, are shown in Fig. 5.

Details are in the caption following the image

Variation of IQ and matching scores by age. Boxplots on the top show the distribution of dilation, contrast, and number of bits encoded/compared scores for each age – at enrolment (left) and passage (right). Boxplots at the bottom show the distribution of HDNORM and HDRAW scores at passage. Box width is proportional to the population size. Data from new ‘B’ cameras are used (best viewed online in colour)

It is observed that dilation monotonically decreases with age for adults, which supports the conclusions from [2]. However, it is also observed that other IQ metrics also slightly decrease with age for adults. The decrease of all IQ metrics for young people is also observed. This explains the lower number of iris enrolments for older and young users.

To further examine the relationship between traveller's age and IQ metrics at enrolment, we plot in Fig. 6 the correlation of age and IQ metrics, and the distribution of age and IQ metric scores at enrolment – for cases where both irises were captured versus those cases where only one iris was captured.

Details are in the caption following the image

Analysis of scores at enrolment: relative distribution of age and IQ scores for ‘two-eye’ (solid green) versus ‘one-eye’ (dashed red) enrolments (shown at left); correlation of age and IQ scores (shown at right). Data from new ‘B’ cameras are used (best viewed online in colour)

The observation is that older (over 60 years) and younger (under 15 years) users are harder to enrol, i.e. have more ‘one eye only’ enrolments. Three distinct IQ metric groups are also observed – related to dilation (pupil–iris ratio), contrast, and openness, of which the dilation group correlates with age the most (at 0.53).

6 Analysis of passage data

6.1 Variation of performance by kiosk location

As pointed out by the UND researchers in [6, 9], the NEXUS system performance varies among airports. Using subject-based analysis we can now further quantify this observation while demonstrating the importance of applying such analysis for the NEXUS application.

Fig. 7 shows the performance of all kiosks measured by the average number of additional attempts (Attempts-1) and the average matching score (HDNORM), computed using transaction-based and subject-based metrics, sorted from worst to best. The average number of transactions per subject (T/S) is shown as well.

Details are in the caption following the image

Effect of kiosk location: performance of NEXUS kiosks measured by the average number of attempts (left) and the average matching score (right), using transaction-based (in red) and subject-based (in blue) metrics, sorted from best performing to worst performing. The average number of transactions per subject (T/S) is shown (in green). Kiosks numbers are obscured to protect airport identities (best viewed online in colour)

It is observed that some kiosks perform 10–20% better than others, according to both metrics. Furthermore, it is seen that performance reported using subject-based metrics is always worst than that reported using transaction-based metrics, sometimes by more than 30%. Kiosks with higher transactions per subject ratio urn:x-wiley:20474938:media:bme2ietbmt20185105:bme2ietbmt20185105-math-0004 report better averaged performance, which is not surprising taking into account the finding presented earlier that people who use the system more frequently tend to have better matching scores.

It is also observed that the variation in kiosk performance within the same airport and the same direction of border crossing is less than that across different airports or different direction of border crossing. We use this finding later when we need to minimise the effect of kiosk location on the system performance.

To further quantify the difference in performance due to the kiosk location, we apply T-test [33] on the HDRAW scores measured at different kiosks. The application of T-test is justified in this case, because we have over a thousand points measured at each kiosk and the distribution of HDRAW scores is unimodal as highlighted earlier in Section 2.2 (Fig. 2). Fig. 8 shows the result. It shows the 95% confidence intervals for the difference in the kiosk average HDRAW score computed for two better performing (in green) and two worst performing (in red) kiosks. Kiosks are chosen so that to have different traffic densities (one has much higher traffic than the other). Results are obtained using both subject-based and transaction-based metrics. The HDRAW scores are shown in the greyed area, the number of transactions and subjects (T/S) for each kiosk are shown on the margin. The 95% confidence intervals for the score difference are shown in the middle part of this table.

Details are in the caption following the image

Difference in the average of the HDRAW score (best viewed online in colour)

HDRAW score is computed for two better performing kiosks (marked in green) and two worst performing kiosks (marked in red) using subject-based(s-b) and transaction-based (t-b) metrics

It is observed that the difference in system performance due to different kiosk locations can be as high as 15%. This confirms that kiosk location is one of the most important factors affecting iris recognition performance.

6.2 Variation of performance by age

This section presents the main finding of our analysis related to the demographic bias of the iris biometrics, i.e. that iris biometrics performs worst for certain age groups. The existence of a demographic bias in other biometric modalities (face and fingerprint) has been reported previously and has become the basis for the development of new ISO guidelines on mitigating such biases [34]. Nothing, however, has been reported so far on the existence of a demographic bias in iris systems.

By examining the passage statistics for each age (as shown in Fig. 5), it is noted that middle-aged travellers use the system much more often than young and elderly travellers. At the same time, as highlighted earlier (Section 5.1), middle-aged travellers have better quality enrolment images, and therefore should be expected to have better performance at passage. Hence, subject-based analysis, introduced in Section 4.2, needs to be applied in order to objectively measure the effect of age on the technology performance. This is done below. Meanwhile, knowing the high interest in using iris biometrics for humanitarian and national ID programmes [12], we can confirm (from our enrolment and passage age statistics) that iris biometrics is as successfully used by young children and youth as it is by elderly.

Fig. 6 shows boxplots of IQ scores (dilation, contrast, number of bits compared) at passage. The bottom of the figure shows the boxplots of matching scores (HDNORM and HDRAW) for each age group in the OPS-XING dataset: from newborn to 99 year old persons that have used the kiosks. Data are taken for all kiosks and all cameras.

As with enrolment data, a variation of IQ scores among different age groups is observed. The increased (worst) matching scores for young and elderly travellers are also observed. In the following, we further quantify the variation of the system performance due to age and compare it with that due to other factors.

Fig. 9a plots the average HDNORM and dilation (DIL) scores as a function of AGE computed using generalised additive model (GAM) regression [33] for three largest Canadian airports (Toronto Terminal 1, Vancouver, and Montreal). The subject-based analysis is conducted separately for each airport for travellers enroled with old (‘L’) and new (‘B’) cameras. The number of subjects at each airport for each camera is indicated on the top of which graph. The grey area shows a 95% confidence interval. Large grey areas for travellers of over 80 years of age indicate that there is no sufficient data to reliably compute the function.

Details are in the caption following the image

Effect of age (best viewed online in colour)

(a) Average HDNORM and DIL computed for subjects enroled with old (‘L’) and new (‘B’) cameras at three largest airports using GAMs regression, (b) Average HDNORM computed for subjects enroled with new (‘B’) cameras at different times of day: 0:00–8:00 (in red), 8:00–16:00 (in green), and 16:00–24:00 (in blue)

A clear drop in average matching scores (i.e. better performance) for middle-aged travellers is observed at each airport: from 0.18 (for those younger than 15 years and older than 80 years) to <0.14 for 40 year old travellers. This is in contrast to average dilation, which monotonically decreases with age: from 0.55 (at 15 years of age) to 0.35 (at 80 years of age). This is an indication that dilation is not the only factor that contributes to worsening of the matching score. Other IQ metrics are also likely affecting the result.

It is also noted that kiosks in Vancouver airport have been relocated during the period of data collection, resulting in their improved performance (which was noted in [9]). This however did not affect the result related to the variation of system performance by age. It is also seen that variation due to age is larger than that due to kiosk location.

6.3 Age versus time of day and time of year

The data used in the previous experiment is further split into three subsets, corresponding to three different times of day (morning, mid-day, and evening), using left-eye transaction data from travellers enroled with ‘B’ cameras. Fig. 9b shows the results for two airports. The bottom row shows results for kiosks in USA pre-clearance area, top row for kiosks in the arrival area.

A slight increase in matching scores for all ages at mid-day, i.e. during the brightest time of the day, is seen in two areas. This is consistent with earlier results suggesting that iris recognition produces poorer match scores when passage image acquisition takes place in strong sunlight and is an indication that kiosks in those two airports are likely located where a large amount of sunlight comes through the windows. Critically, however, it is seen that performance variation due to daytime difference is much less than that due to the age difference.

In another experiment, some consistent increase in HDNORM during December–January was also observed, supporting earlier such funding in [9]. In contrast to [9], however, where such variation is explained by the effect of season on eye dilation, we are inclined to think that most likely this is due to the subject-based performance variation, as more people travel and use the technology during the holiday season including those who do not travel often and who (based on the results presented above) have a higher risk of experiencing the difficulty in using the system. In either case, the effect of time of the year is also seen to be much less than that of age and kiosk location.

6.4 Age versus ageing

To address the debate between NIST and UND researchers related to the effect of ageing, we compare this effect with that of age and other factors. To do that, we apply generalised additive mixed models (GAMM) regression [35] to compute average HDRAW scores as a function of age (AGE) and ageing (measured by the number of days since enrolment, ELAPSED_TIME) using left-eye passage data from all kiosks for all users enroled with ‘B’ cameras.

In contrast to GAMs used earlier (Fig. 9), GAMm allow one to include random effects, which in this case are kiosk location (FAKE_KIOSK_ID) and person's physiology (FAKE_ID), in addition to fixed effect (AGE and ELAPSED_YEAR). The ‘gamm’ function from the ‘mgcv’ R package is used for this purpose [35].

Once the predictive model is computed, it is applied to compute the expected average HDRAW scores for a grid of age–ageing values, where age is incremented by 5 years, and ageing (ELAPSED_TIME) by 100 days. The result is shown in Fig. 10a. The following observations are made.

Details are in the caption following the image

Effect of ageing (best viewed online in colour)

(a) Average HDRAW as a function of AGE and the number of days since enrolment (ELAPSED_TIME) computed using GAMM regression. Kiosk fake id (FAKE_KIOSK_ID) and traveller's fake id (FAKE_ID) are treated as random effects, whereas AGE and ELAPSED_TIME are treated as fixed effects, (b) Average HDNORM as a function of the number of months since enrolment (ELAPSED_MONTH) computed using GAM regression at different times of day of the passage. Data from a single airport, where variation due to kiosk location is small, are used

First, for all ELAPSED_TIME groups (i.e. along the horizontal axis), the relationship between the matching score and age is exactly the same as found earlier (as seen in Fig. 9): the matching score is the lowest at 35–40 years of age and monotonically increases as one moves further away (left or right) from the middle.

Second, for most age groups (i.e. along with the vertical axis), ageing has no negative effect on matching scores. It is only for 55–65 age group, where slightly increased (worst) matching scores with ageing are observed. Critically, the variation in matching score due to ageing is much less than that due to the age difference.

To explain the observed improvement of HDNORM score with ELAPSED_TIME, we offer the following four reasons: (i) habituation (travellers learn how to make the machine work better for them, e.g. by opening their eyes wider), (ii) the improved positioning of the kiosks (as in Vancouver, found in [9]), (iii) the use of transaction-based metrics (which show ‘better’ results for travellers who use the system more often), and over time in the threshold for recording a match score, THD, means that subjects who use the system over a period of years are able to record a higher score in the earlier years of using the system than they are able to record in the later years.

To place the effect of ageing in context with other factors, we compare it with that of time of day. Fig. 10b shows the average HDNORM computed using GAM regression on the data taken from a single airport (which has little variation among its kiosks) as a function of ageing (ELAPSED_MONTH) for four different times of day (morning, mid-day, evening, and night). It is observed that the effect of ageing is less than that of time of day of passage transaction, which in turn (as discussed earlier) is less than the effect of age and kiosk location.

To conclude, taking into account the results from previous sections, where it was shown that age correlates with IQ metrics, particularly, with dilation and (to a lesser degree) with contrast, it can be stated that ‘ageing problem’ is not about ‘whether a biometrics modality changes in time’ (yes, it does) but rather about ‘whether the technology can deal with the changes due to ageing’. Evidently, iris biometrics can deal with changes due to ageing quite well, at least over the range of years analysed in this study (which is 7 years). At the same time, it is seen that, as with all other biometric modalities, its performance is affected by sensor quality, capture conditions (lighting), and also by person's age (when comparing technology performance for travellers of different age groups).

6.5 Factor significance

Once the effect of certain factors (explanatory variables) on the performance of the system (response variables) is hypothesised through the observations of descriptive statistics (Figs. 9 and 10), it is possible to apply analysis of variance to obtain the values of statistical significance for each factor and their combination [33]. This is done below, where a combination of subject-related (age), technology-related (ageing), and process-related (time of day and time of year) factors are examined for statistical significance.

To avoid the variation due to kiosk location, the data from kiosks in a single airport (where variation due to kiosk location is small) are used. Age is presented as a nine-level factor (each level representing a decade), ageing is an eight-level factor (each level representing a year since enrolment), time of day and time of year are presented as four-level factors (as done in previous sections). Table 3 shows the result, as produced by running the analysis of variance in R language [33]. The plots showing 95% confidence level intervals on matching score differences for all pairwise combinations of factors values are presented in Fig. 11.

Table 3. Analysis of variance in matching scores due to various factors
Df Sum Square of mean Square of F value Pr(>F)
AGE 8 16 2.015 476.87 <0.0000000000000002 ***
ELAPSED_YEAR 7 0 0.024 5.76 0.0000010803 ***
timeOfYear 3 0 0.043 10.07 0.0000012431 ***
timeOfDay 3 1 0.257 60.91 <0.0000000000000002 ***
AGE:ELAPSED_YEAR 46 0 0.007 1.73 0.0015 **
timeOfYear:timeOfDay 9 0 0.026 6.19 0.0000000091 ***
  • Last column shows the probability urn:x-wiley:20474938:media:bme2ietbmt20185105:bme2ietbmt20185105-math-0005 of having the same mean output value despite the change in input factor value.
Details are in the caption following the image

95% Confidence level intervals on matching score difference for all pairwise value combinations of four factors: clockwise – age (nine-level factor), ageing (eight-level factor), time of day (four-level factor), and time of year (four-level factor)

It is seen that all listed factors are statistically (>99.9%) significant, with a combination of age and ageing being less significant than other factors. From a practical point of view, however, the important question is not which factors affect the technology performance but to what degree they affect it.

Critically, for an organisation deploying the technology, it is important to know whether any action is required to improve the system performance. According to the ‘technology–process–subject’ factor triangle (described in Section 4.3), three possible types of actions are possible: replacing the technology, improving the process, or implementing subject-based customisation of the decision rules or procedures. As presented in the concluding section, the recommendations related to these actions appear evident from the results presented in this paper without the need of doing the more detailed statistical analysis.

7 Conclusions

Iris biometrics was introduced to automated border control as an extremely robust biometrics [22]. Results obtained from a watch-list screening border application in United Emirates [23] have solidified this belief. When later the University of Notre Dame researchers published results showing that iris performance varied over time [1416], it brought a lot of concern from the technology users including many government organisations who actively rely on iris technology in their operations [1821]. To address these concerns, NIST undertook an effort to better understand the effect of ageing and other factors on iris biometrics [2]. This effort opened a whole new range of questions related to the factors that affect iris recognition and the ways iris biometrics is evaluated [49].

Thanks to the efforts of NIST and UND scientists, our understanding of the properties and limitations of iris biometrics and current evaluation practises has improved significantly. The results presented in this paper further contribute to these efforts. Three major conclusions from the obtained results are made.

First, in the applications where the use of technology is not mandatory, as in automated border control [27], it should be expected that subjects who experience problems using the system will use it less than those who do not experience problems. Hence, the performance of biometric systems in such applications, if measured using traditional transaction-based metrics, may show unrealistic ‘overly optimistic’ results. Therefore, the use of subject-based metrics introduced in this paper should be used when analysing and reporting the performance of such systems.

Second, in relationship to the ageing debate [3], where the CBSA-collected OPS-XING dataset played a very important role, it is concluded that the effect of ageing is negligible, compared with that of other factors such as kiosk location, time of day, and person's age.

While the effect of kiosk location and time of day on system performance has been already uncovered by the UND researchers [9] using the previous releases of the OPS-XING dataset, the discovery of the effect of person's age on system performance was made possible only now, using the previously unused portions of the dataset. It is shown that older (over 60 years of age) and young (under 20 years of age) travellers are disadvantaged by the system. The system log shows worst IQ and matching scores for these groups, compared with that of middle-aged travellers. The variation of system performance due to age difference is larger than that due to light changes or different kiosk location.

In a society concerned with providing equal quality services to its all demographic groups (see [36]), this finding may help to adjust its technology settings so that to mitigate the demographic bias exhibited by the iris recognition technology. A new guidelines document is being prepared by ISO in this regard [34].

To conclude, it may still be possible theoretically to improve the results of the analysis conducted on the OPS-XING dataset (e.g. by applying non-linear mixed-effect models [33]). From a practical perspective, however, this additional effort appears of little importance, since none of the analysed factors appeared to effect significantly the system performance, and critical recommendations related to auditing and improving iris recognition systems can be made based on the results already obtained. These are listed below.

Using the ‘technology–process–subject’ factor categorisation triangle, described in Section 4.3, the first step for improving iris recognition performance is seen in optimising the kiosk placement (process factor). Then, the performance can be further optimised by applying different matching decisions or process rules for different age group populations (subject factor). For example, a higher threshold or a larger number of attempts may be allowed for old and young subjects or a score normalisation formula can be further improved to take into account person's age and other IQ metrics, as discussed in Section 2.2. This will mitigate the demographic bias exhibited by the system. However, no action in relationship to ageing-related concerns (technology factor) appears to be needed.

8 Acknowledgments

This work was initiated and partially funded by the Canadian Safety and Security Program (CSSP) managed by the Defence Research and Development Canada, Centre for Security Science (DRDC-CSS), as part of the CSSP-2013-CP-1020 (‘ART in ABC’) project [27] led by the CBSA. It has also contributed to the DRDC-funded CBSA-led CSSP-2015-TI-2158 (‘Roadmap for Biometrics at the Border’) project deliverables related to the Gender-Based Analysis Plus (GBA + ) [36]. Feedback from Kevin Bowyer, Adam Czajka, Patrick Grother, and Jim Matey on iris technology-related matters, and assistance from Jordan Pleet and Rafael Kulik on statistical matters are gratefully acknowledged.