Advanced Image Quality Assessment for Hand- and Finger-Vein Biometrics
Abstract
Natural scene statistics commonly used in nonreference image quality measures and a proposed deep-learning (DL)–based quality assessment approach are suggested as biometric quality indicators for vasculature images. While NIQE (natural image quality evaluator) and BRISQUE (blind/referenceless image spatial quality evaluator) if trained in common images with usual distortions do not work well for assessing vasculature pattern samples’ quality, their variants being trained on high- and low-quality vasculature sample data behave as expected from a biometric quality estimator in most cases (deviations from the overall trend occur for certain datasets or feature extraction methods). A DL-based quality metric is proposed in this work and designed to be capable of assigning the correct quality class to the vasculature pattern samples in most cases, independent of finger or hand vein patterns being assessed. The experiments, evaluating NIQE, BRISQUE, and the newly proposed DL quality metrics, were conducted on a total of 13 publicly available finger and hand vein datasets and involve three distinct template representations (two of them especially designed for vascular biometrics). The proposed (trained) quality measure(s) are compared to several classical quality metrics, with their achieved results underlining their promising behavior.
1. Introduction
The term vascular biometrics describes a set of biometric modalities (commonly finger as well as hand vein biometrics, but also sclera pattern-based biometrics), which uniquely characterizes people by their self-specific pattern of the human blood vessels system. Due to the blood vessels being located inside the human body the vessels’ pattern can hardly be made visible by the use of visible light. Instead near-infrared (NIR) illumination in combination with NIR-sensitive cameras are utilized during the acquisition process to render the vessels visible as dark lines in images or videos [1]. This is based on the fact that the deoxygenated hemoglobin contained in the blood exhibiting a higher light absorption coefficient in the near-infrared spectrum than the surrounding tissue. The acquisition process of samples exhibiting vasculature patterns can be done by using either a reflected light set-up or a light transmission one, as shown in Figure 1. When using a reflected light set-up, the light source and the image sensor are positioned on the same side, opposite to the finger or hand that is to be captured. Thus, the NIR light emission and the recording of the reflected light are done on the same side. In the light transmission set-up, the light source and the image sensor are located on opposite sides. The finger or hand from which the vasculature pattern is to be captured is placed in between and thus, the emitted light needs to travel through the human tissue before it reaches the image sensor and can be captured.


The most popular vasculature biometric is finger vein (FV) recognition as the acquisition of FV samples is almost as easy as capturing a fingerprint, can be done in a contact-less manner, and is more resistant to forging than fingerprints (a vein pattern is not left on a surface like, e.g., a fingerprint is). However, especially with FVs, the area where meaningful vessel information can be detected is limited. Depending on the acquisition set-up, this can affect the quality of the recorded biometric sample’s information. In the majority of all known cases, it is easier to make the vessels visible if the acquisition of the FV images is done from the palmar view in a light transmission set-up. Hence, most publicly available databases contain only palmar FV samples (Table 1). In hand vein biometrics, a larger area exhibiting venous patterns can be utilized to extract biometric information. One downside with hand vein biometrics is that usually more skin wrinkles and skin hair will be present and visible in the acquired samples than in the FV ones. These wrinkles and hairs tend to be mistakenly detected as blood vessels, influencing the quality of the biometric information and thus, the recognition performance of the whole biometric system. Furthermore, the different tissue layers of the human skin and the bone structure also influence the quality of the vasculature pattern samples as they absorb parts of the emitted light, which results in a reduced light intensity and thus, in a reduced contrast of the vasculature patterns and the background, lowering the quality of samples in general.
Database name | Dor/Pal | Subj | Fgs | Imgs | S | Image size |
---|---|---|---|---|---|---|
Finger vein | ||||||
FV-USM [2] | Palmar | 123 | 4 | 5940 | 2 | 640 × 480 |
HKPU-FV [3] | Palmar | 156 | 6 | 6264 | 2 | 513 × 256 |
MMCBNU_6000 [4] | Palmar | 100 | 6 | 6000 | 1 | 640 × 480 |
PLasDOR [5] | Dorsal | 60 | 6 | 3600 | 1 | 1280 × 1024 |
PLEDDOR [5] | Dorsal | 60 | 6 | 3600 | 1 | 1280 × 1024 |
PLasPAL [5] | Palmar | 60 | 6 | 3600 | 1 | 1280 × 1024 |
PLEDPAL [5] | Palmar | 60 | 6 | 3600 | 1 | 1280 × 1024 |
SDUMLA [6] | Palmar | 106 | 6 | 3816 | 1 | 320 × 240 |
UTFVP [7] | Palmar | 60 | 6 | 1440 | 2 | 672 × 380 |
Hand vein | ||||||
CIE-HV [8] | Palmar | 50 | 8 | 1200 | 3 | 1280 × 960 |
PTrans [9] | Dorsal | 40 | 5 | 400 | 1 | 384 × 384 |
PRefl [9] | Dorsal | 40 | 5 | 400 | 1 | 384 × 384 |
VERA [10] | Palmar | 110 | 5 | 2200 | 2 | 580 × 680 |
For the successful application of any biometric authentication system, the sample quality plays a vital role. Hence, for a vasculature-based biometric authentication system, it is fundamental to determine the quality of the acquired sample data. This quality assessment is necessary for both enrollment and the actual authentication such that a recapturing can be initiated in case the sample’s quality turns out to be insufficient. Furthermore, quality assessment is useful for selecting subsequent steps in the signal processing pipeline (algorithms/parameters) and in unsupervised scenarios for user guidance during capturing. Especially, in an unsupervised scenario, it is highly likely that different amounts of rotation can be detected when a finger or a hand is presented to the capturing device. Even a small amount of finger rotation (±30°) is known to be one of the major causes of poor recognition performance in FV biometrics, in particular rotation in longitudinal direction [11]. Rotation is not only a FV-specific quality factor, also hand vein recognition is influenced by out-of-plane hand rotations. The influence of the rotation cannot be determined based on analysis of a single sample image but only in the context of an analysis of template comparison results (rotation difference of the two samples involved in the comparison).
1.1. Contribution of Work
This work is based on the study “Fingervein Sample Image Quality Assessment using Natural Scene Statistics” (NSSs) introduced in [12] and a preprint of this manuscript has previously been published [13]. The original idea of [12] was to propose a learning-based FV sample quality assessment scheme, based on training NSSs on FV sample data as used in general-purpose image quality metrics (IQMs) like NIQE (natural image quality evaluator) [14] and BRISQUE (blind/referenceless image spatial quality evaluator) [15]. In order to ensure sufficient generalisability (i.e., being independent from the utilized recognition features), the IQMs are trained on low and high-quality sample data, classified by human judgment, respectively. This approach was selected to demonstrate a sufficient generalisability among different datasets and FV recognition schemes, respectively.
- 1.
While in [12], the evaluation of the proposed quality assessment scheme was done on two FV databases, this work evaluates the methodology on nine FV databases. As most FV databases exhibit a high amount of variation introduced by the capturing process and the utilized capturing devices, the extension evaluating nine databases aims at confirming the generalisability of the quality assessment scheme.
- 2.
The second extension is based on the applied the proposed quality assessment scheme to another modality, hand vein biometrics, represented by additional four databases. The templates used for comparing finger or hand vein samples are similar as the feature extraction methods are the same. Hence, in the best case, the application of the suggested quality assessment schemes should result in similar, stable results.
- 3.
In recent years, more and more deep-learning (DL)–based applications for quality evaluation became feasible as well (see section on related work [Section 2]). Hence, as a third contribution, a new DL-based method for vascular quality assessment is proposed in this research article as well.
All in all, a total of 13 vasculature databases are evaluated by using a broad set of various quality assessment methods. The quality schemes based on NSS have been retrained to include characteristics of the additional databases instead of applying the retrained versions of BRISQUE and NIQE, as proposed in [12]
The remainder of the paper is organized as follows: At first an overview of related work in quality assessment for vascular biometrics is presented. Section 2 (on related work) lists the utilized datasets, followed by the description of the quality methods based on NSS. Afterward, the proposed DL-based quality estimation method is presented, followed by the details about the experimental protocol. The experimental results are presented and discussed in the penultimate section of the study. Finally, Section 8 concludes this work and gives an outlook on potential future work.
2. Related Work on Biometric Quality Evaluation
Biometric sample quality measures are applied in order to estimate if a recorded sample of a specific biometric trait can successfully be evaluated by an automated biometric recognition system. Hence, the ISO/IEC 29794:2016 Biometric Sample Quality standard contains a definition of how quality evaluation can be performed for most biometric modalities. As a consequence, for the well-established and widely employed biometric modalities like fingerprint, face, or iris dedicated image quality evaluation algorithms have been established and successfully applied [16, 17]. However, as already discussed in [12], the ISO/IEC 29794:2016 Biometric Sample Quality standard does not yet include a unified quality evaluation criterion for vasculature sample images.
Several studies on finger- and hand-vein quality evaluation were published in the past 10 years. Based on these publications, two main classes vasculature quality assessment techniques can be distinguished: (a) nonvasculature-feature–based techniques and (b) vasculature-features-based ones. Methods belonging to the first class can directly be used on finger or hand vein samples after the acquisition process is completed, while methods belonging to the second class need to extract vascular-specific features prior to the application of the quality measure. In the following, a short description of methods from these two classes is given.
Techniques from the first group make use of low-level image information like gradient, contrast, entropy, clarity, and brightness uniformity. Methods utilizing gradient, contrast, and entropy information have been proposed in [18] and [19] for an application on FV samples. The latter named three low-level features can also be used in a combined manner, as proposed by Peng, Li, and Niu [20]. The combination of gradient, contrast, and entropy can be done by fuzing them using a triangular norm scheme. In [20], the triangular norm scheme was also applied to FV images, while in [21], clarity and brightness uniformity information was used to estimate the quality of palm vein samples.
Other methods using nonvasculature specific features are based either on Radon transform as, for example, in [22] or on the use of the noise ratio based on human visual system (HSNR) method [23]. HSNR tries to represent the human visual system to evaluate the quality of vasculature samples by combining four different indices: the image contrast, the deviation of foreground areas’ center of mass with respect to the geometric center of the whole image, the effective area of an image, the area and locations where vascular information can be detected and the signal-to-noise index adapted with respect to the human visual system.
Techniques belonging to the second group make use of vasculature specific features, that need to be extracted during or prior to the quality evaluation process. In [24], the same features that are subsequently used during the recognition process are also utilized to estimate the biometric sample quality. The authors used the number of pixels representing vascular pattern information as the main feature for the quality estimation. In several studies employing learning-based approaches (e.g., [25–27]) incorrect and/or poor template comparisons in recognition experiments are analyzed and the gained information is used to improve the learning procedure. Furthermore, a traditional CNN can be trained to establish a vascular quality measurement, as described in [28]. This particular Light-CNN was designed to treat the FV quality estimation as a classification problem. Thus, the model was built to distinguish between vein images that contain rich and stable vein characteristics and those which only exhibit poor vein characteristics. The obtained results outperformed other well-established methods like [22]. The Light-CNN method was selected as comparison method to the new proposed DL of this study (details of the comparison can be found in Section 7.2, Table 2).
Biometric modalities and evaluation method | Mean accuracy over all folds | Folds | Poor | Middle | Good |
---|---|---|---|---|---|
Finger vein—dorsal | |||||
Light-CNN [22] | 0.6771 | — | — | — | — |
New approach | 0.7250 | — | — | — | — |
— | Mean accuracy | 0.5594 | 0.7461 | 0.8088 | |
— | over all folds | ||||
— | 1-fold | 0.4193 | 0.7714 | 0.6558 | |
— | 2-fold | 0.7571 | 0.5641 | 0.8460 | |
— | 3-fold | 0.4260 | 0.7918 | 0.8322 | |
— | 4-fold | 0.6351 | 0.8571 | 0.9010 | |
Finger vein—palmar | |||||
Light-CNN [22] | 0.6175 | — | — | — | — |
New approach | 0.7336 | — | — | — | — |
— | Mean accuracy | 0.2404 | 0.8348 | 0.7820 | |
— | over all folds | ||||
— | 1-fold | 0.2705 | 0.8689 | 0.6960 | |
— | 2-fold | 0.2193 | 0.8495 | 0.6969 | |
— | 3-fold | 0.2003 | 0.8031 | 0.7425 | |
— | 4-fold | 0.2715 | 0.8179 | 0.7201 | |
Hand vein | |||||
Light-CNN [22] | 0.5271 | — | — | — | — |
New approach | 0.6773 | — | — | — | — |
— | Mean accuracy | 0.4894 | 0.7447 | 0.7224 | |
— | over all folds | ||||
— | 1-fold | 0.5414 | 0.7507 | 0.6382 | |
— | 2-fold | 0.4196 | 0.7961 | 0.6779 | |
— | 3-fold | 0.4712 | 0.6901 | 0.7391 | |
— | 4-fold | 0.5251 | 0.7418 | 0.8344 |
- Note: The bold values signify the best performance achieved for each dataset and feature extraction method.
- Abbreviation: SVM, support vector machine.
Most recent in [12], a learning-based vascular sample quality assessment scheme was proposed which is based on retraining the general-purpose IQMs NIQE [14] and BRISQUE [15] on vascular images. This study found that due to the high differences between vascular sample images and typical natural scene images a retraining of the quality estimators is mandatory. Similar to the original training of NIQE and BRISQUE, the retraining of NIQE is done by using only high-quality images, while for BRISQUE high-quality images and low-quality images are considered. The results indicated that the retrained versions of both measures provide better results on the FV samples than compared to their original versions.
As mentioned in the introduction of this study (cf., Section 1), the findings of [12] are extended/validated by including additional vascular databases (especially also including hand vein ones as a further biometric trait) and feature types as well as proposing a new DL-based method for vascular quality estimation (cf., section on vascular quality estimation using DL). Hence, most of the quality measures evaluated in [12] are included in this study as well. This includes the following one from the first class (nonvasculature feature based): global contrast factor (GCF) [19], gray level energy score (EntropyBased) (based on the entropy information of the image [18]), the radon transformation approach (Radon) [22], the triangular norm scheme (TNorm) [20], and the scheme proposed in [21] (Wang). Furthermore, HSNR [23] is also included. From the second group (vasculature feature based), only the retrained versions of NIQE and BRISQUE are used, but not the same ones as in [12]. Instead, new versions are used that have been trained with a larger amount of training data.
3. Datasets
The experiments evaluating the discussed vasculature specific quality metrics were conducted on 13 publicly available vascular pattern databases (Table 1). Four of these databases are hand vein ones, while the remaining ones contain FV images captured from palmar or dorsal view. Example impressions of the utilized databases are visualized in Figures 2 and 3.


The FV-USM [2] contains 5904 palmar FV images with an image resolution of 640 × 480 pixels. The images have been acquired from 123 subjects in two independent acquisition sessions. The image-capturing process was the same for all subjects and both sessions, resulting in six images per finger, whereby a total of four fingers were recorded for each subject.
The “Hong Kong Polytechnic University Finger Image Database (Version 1.0) (HKPU-FV)” [3] is composed by 6264 palmar vascular finger images, exhibiting a resolution of 513×256 pixels. These images have been acquired from 156 volunteers in two acquisition sessions. Six images samples each are recorded from the index and middle finger of the left hand.
The “Chonbuk National University MMCBNU-6000 FV database (MMCBNU_6000)” [4] contains 6000 palmar light transmission FV images (resolution of 640 × 480 pixels) acquired from 100 subjects. For each subject six fingers (10 images per finger) have been acquired in a single acquisition session.
The “PLUSVein-FV3 FV Data Set (PLUSVein-FV3)” [5] is composed of a total of four subsets using two different capturing devices both being capable of acquiring samples from the palmar as well as the dorsal view. In Table 1, these subsets are mentioned separately as PLasDOR, PLEDDOR, PLasPAL, and PLEDPAL. Each subset contains 1800 images, the samples having an image resolution of 200 × 750 pixels, acquired from 60 subjects in a single acquisition session. In total six fingers per subject (index, middle, and ring finger of both hands) and five images per finger have been recorded.
The “Shandong University Machine Learning and Applications—Homologous Multi-Modal Traits Database (SDUMLAs-HMT)” [29] contains several biometric modalities (as the name implies). In the current study, only the subset containing the vascular finger patterns is used. This subset contains a total of 3816 palmar image from 106 subjects (each image has a resolution of 320 × 240 pixels). From each subject, the six FV samples of the index, the middle, and the ring finger of both hands were captured.
The “University of Twente Finger Vascular Pattern (UTFVP) Database” [7] contains 1440 palmar FV sample images, acquired in two acquisition sessions from 60 different subjects. For each subject vascular pattern images of the ring, middle, and index finger from both hands have been captured (two samples per finger/acquisition session). The acquired images exhibit a resolution of 672 × 380 pixels.
The “University of Poznan Hand Vein Data Set (CIE-HV)” [8] contains 1200 hand vein images, which have been acquired from the palmar view in a reflected light illumination set-up, exhibiting a resolution of 1280 × 960 pixels. Images of both hands (four samples per hand) from 50 subjects were acquired in three acquisition sessions.
The “PROTECT Hand Vein Dataset (PROTECT-HV)” [9] is composed of two subsets. For the acquisition of each subset, a different illumination technique (light transmission or reflected) was used. Hence, in Table 1, these subsets are mentioned separately as PTrans, PRefl. In contrast to the CIE-HV, the samples have not been captured from the palmar, but the dorsal view. Each subset contains a total of 400 images from both hands of 40 subjects (five images per hand). All images exhibit a resolution of 384 × 384 pixels.
The “Idiap Research Institute VERA Fingervein Database (VERA)” [10] contains 2200 palmar hand vein images, recorded from 110 subjects in two acquisition sessions from both hands using reflected light illumination. The images exhibit a resolution of 580 × 680 pixels.
4. NSSs in Vascular Image Quality
In the following details about nonreference (NR) IQMs, particularly focussing on methods based on the concept of NSSs are described.
4.1. NR IQMs
Current state-of-the-art NR image quality assessment algorithms are based on models that are able to learn to predict human judgments from databases of human rated, distorted images. These kinds of IQM models are necessarily limited, since they can only assess quality degradations arising from the distortion types that they have previously seen and been trained on. However, it is also possible to contemplate subcategories of general-purpose NR IQM models having tighter conditions. A model is said to be opinion aware (OA) if it has been trained on a database(s) of human-rated distorted images and associated subjective opinion scores. Algorithms like BRISQUE, as described below, are OA IQM measures. However, IQM like NIQE (see below) is opinion unaware (OU) and they make only use of measurable deviations from statistical regularities observed in natural images without being trained on human-rated distorted images and indeed without any exposure to distorted images.
Systematic comparisons of the NR IQM as used in this paper have been published in [30, 31]. Both, in nontrained [31] as well as in specifically trained manner [30], the correspondence to human vision turns out to be highly dependent on the dataset considered and the type of distortion present in the data. Thus, no “winner” has been identified among the techniques considered with respect to correspondence to subjective human judgment and objective distortion strength.
4.2. BRISQUE
BRISQUE [14] is a NSS-based spatial NR QA algorithm. It is based on the principle that natural images possess certain regular statistical properties that are measurably modified in the presence of distortions.
In the first stage, an image is locally normalized (via local mean subtraction and divisive normalization), resulting in the so-called mean subtracted contrast normalized (MSCN) coefficients. An AGGD (asymmetric generalized Gaussian model distribution) is used to fit the MSCN statistic from pristine as well as distorted images—in order to quantify the dependency between neighbors, the relationships between adjacent MSCN coefficients are analyzed via pairwise products at a distance of 1 pixel along four orientations. The parameters of the best AGGD fit are extracted for each orientation, which leads to a total of 16 parameters (4 parameters/orientation × 4 orientations). Because images are naturally multiscale, and distortions affect their structure across scales, these features are extracted at two scales—the original image scale and a reduced resolution. Thus, a total of 32 features are selected—16 at each scale.
A set of pristine images from the Berkeley image segmentation database is taken, additionally similar types of distortions as present in the LIVE image quality database were introduced in each image at varying degrees of severity to form the distorted image set: JPEG 2000, JPEG, white noise, Gaussian blur, and fast fading channel errors (thus, BRISQUE is a OA IQM). The computed AGGD features in combination with their associated difference mean opinion score are used to train a probabilistic support vector regression (SVR) model, which is then used for classification. The difference mean opinion score represents the subjective quality of each image and is obtained by averaging across humans for each of the visual signals in the study.
4.3. NIQE
NIQE [15] is a NR OU-DU (distortion unaware) IQM. Thus, it uses only measurable deviations from statistical regularities in natural images, without training on human-rated distorted images. The NSS features used in the NIQE index are similar to those used in BRISQUE; however, NIQE only uses the NSS features of natural images and is not—as BRISQUE—trained on features obtained from both natural and distorted images (and the corresponding human judgments of the quality of the latter). As a consequence, the NIQE index is not tied to any specific distortion type, while BRISQUE is limited to the types of distortions, it has been trained on and tuned to. The MSCN coefficients are computed in P × P image patches, but only patches with sufficient sharpness are selected for further processing. NIQE is applied by computing the 32 identical NSS features from those patches fitting them with the AGGD model, and then comparing its fit to the AGGD model derived from pristine images.
5. DL-Based Vascular Quality
Opposed to the quality methods discussed in Section 2 (on various related studies), the newly designed metric to predict the quality of finger and hand vein images is a CNN-based one. For training the CNN, the biometric image data discussed in the previous section (describing the utilized datasets), which is separated manually into the quality classes poor, middle, and good quality, are used. The detailed procedure for the class separation and further details regarding the experimental protocol are given in the subsequent section (discussing the experimental protocol). The CNN is based on the Squeeze-Net (SqNet) [32] architecture, whereby the employed SqNet has already been pretrained on the ImageNet database (http://www.image-net.org/). The CNN is then fine-tuned (all CNN layers are fine-tuned) using vein image training data in order to learn to predict the quality of biometric vein images. During the refined CNN training, data augmentation are applied by first resizing the input images to a size of 234 × 234 and then extracting a patch of size 224 × 224 at a random position of the resized image (±5 pixels random shift in each direction). The implementation of the network is realized in PyTorch. A batch size of 180 images is used for training (60 images per quality class). The CNN is trained for 400 epochs with the Adam optimizer, starting with a learning rate of 0.001, which is divided by 10 every 120 epochs.
A triplet of training images (Anchor, Positive, and Negative) is processed through the CNN resulting in a CNN output for each of the three images. This CNN outputs are then used to compute the triplet loss to update the CNN. In order to only select triplets for training that is able to improve the model, we employ hard triplet selection [33]. Hard triplet selection only permits those triplets (A, P, N) for training with L(A, P, N) > 0.
Summarized this means the CNN is trained to create a CNN output f(I), such that the squared Euclidean distances between CNN outputs of images from the same quality levels are small, whereas the Euclidean distances between CNN outputs of any pairs of images from different quality levels are large. So basically, the CNN clusters images of the same quality level together in the CNN output space, apart from images of other quality levels.
However, the applied CNN architecture is not able to directly predict the quality level of an image. For this prediction, task an support vector machine (SVM) is applied additionally. The SVM is trained using the CNN outputs from the training data. To predict the quality level of an evaluation image, the image is fed to the CNN and then the SVM classifier predicts the quality level based on the CNN output.
6. Experimental Protocol
Finger detection, finger alignment, and RoI extraction (being mandatory for the stable application of quality metrics) for all finger and hand vein databases are done, as described in [34]. After preprocessing, the extracted features are used to perform the baseline experiments, resulting in the equal error rate (EER), the false match rate (FMR) for a false nonmatch rate less or equal to 0.1% (FMR1000) and zero FMR (ZeroFMR) for each of the aforementioned databases. The experiments are conducted by utilizing the PLUS OpenVein Finger- and Hand-Vein Toolkit (http://www.wavelab.at/sources/OpenVein-Toolkit/ [35]).
To extract the feature information from the given vasculature patterns, three very distinct techniques are applied: feature extraction schemes that are binary vessel structure based (i) Gabor filter (GF) [3], (ii) maximum curvature (MC) [36], and (iii) keypoint-based SIFT [37]. The GF and MC feature templates are subsequently compared using a correlation-based approach proposed in [36], the so-called Miura matcher, while SIFT-based recognition is applied, as described in [37].
Prior to the training of NIQE, BRISQUE, and the DL method and the quality metric evaluation each of the databases have been manually assigned to three quality levels: poor, middle, and good. There are two reasons for this separation of the samples: first, as this study is an extension of [12], the quality metrics BRISQUE and NIQE needed to be retrained using more data. Due to the design of those metrics, a separation into different quality classes is mandatory, otherwise the applied training of the NSSs-based linear SVM classifier would not be possible. Second, the new DL-based quality metric is designed to separate different classes of quality levels. Hence, a class-specific data separation is necessary for training this method. Note, it is possible that images originally recorded from the same subject may be selected in different classes. Table 3 lists the number of images belonging to one of the three classes for each database including the average quality values per class of HSNR and Wang. Example images for each of the quality classes are presented in Figure 4.

Database name | Number of img./quality Meth. | Poor | Middle | Good | Database name | Number of img./quality Meth. | Poor | Middle | Good |
---|---|---|---|---|---|---|---|---|---|
F. vein—dorsal | — | 936 | 817 | 1790 | F. vein—palmar | — | 1977 | 11,747 | 10,167 |
PLasDOR | nr. img | 520 | 385 | 895 | FV-USM | nr. img | 235 | 3723 | 1946 |
HSNR | 59.84 | 59.84 | 59.97 | HSNR | 59.27 | 59.25 | 59.84 | ||
Wang | 0.24 | 0.23 | 0.21 | Wang | 0.13 | 0.13 | 0.14 | ||
PLEDDOR | nr. img | 416 | 489 | 895 | HKPU-FV | nr. img | 495 | 2009 | 628 |
HSNR | 59.72 | 59.70 | 59.91 | HSNR | 60.95 | 59.99 | 60.20 | ||
Wang | 0.23 | 0.22 | 0.19 | Wang | 0.14 | 0.11 | 0.11 | ||
Hand vein | — | 749 | 1,332 | 1360 | MMCBNU | nr. img | 346 | 3415 | 2239 |
HSNR | 61.04 | 60.99 | 60.98 | ||||||
Wang | 0.08 | 0.08 | 0.10 | ||||||
CIE-HV | nr. img | 80 | 436 | 684 | PLasPAL | nr. img | 305 | 792 | 703 |
HSNR | 60.67 | 60.67 | 60.62 | HSNR | 59.73 | 59.75 | 59.81 | ||
Wang | 0.25 | 0.25 | 0.25 | Wang | 0.19 | 0.18 | 0.18 | ||
PTrans | nr. img | 134 | 233 | 270 | PLEDPAL | nr. img | 164 | 776 | 859 |
HSNR | 57.08 | 56.91 | 58.55 | HSNR | 59.55 | 59.80 | 59.80 | ||
Wang | 0.26 | 0.24 | 0.20 | Wang | 0.20 | 0.18 | 0.19 | ||
PRefl | nr. img | 132 | 152 | 320 | SDUMLA | nr. img | 132 | 152 | 320 |
HSNR | 59.32 | 59.67 | 58.74 | HSNR | 59.32 | 59.67 | 58.74 | ||
Wang | 0.57 | 0.60 | 0.57 | Wang | 0.57 | 0.60 | 0.57 | ||
VERA | nr. img | 403 | 511 | 86 | UTFVP | nr. img | 124 | 228 | 1088 |
HSNR | 59.45 | 59.00 | 60.52 | HSNR | 60.44 | 60.52 | 60.18 | ||
Wang | 0.32 | 0.33 | 0.33 | Wang | 0.13 | 0.12 | 0.13 |
- Note: For each database and class, the HSNR and Wang quality values are shown as well.
Percentage wise, there are far more images categorized into the poor class originating from the hand vein and dorsal FV databases compared to palmar FV ones. This is not only due to the objective assessment of the persons involved in the manual selection process but also due to the fact that in dorsal images (especially in the FV ones) skin folds are often more prominent and visible than venous structures. Hence, the subjective perception of existing venous structures will be reduced, since a human observer focuses more on the skin folds during the quality assessment process or these folds simply overlay the existing venous structures. If the subjective classification is correct for most images, a clear quality difference between the classes middle and good compared to poor will be measurable. A large difference between middle and good would be desirable, but since the main subjective difference between middle and good was the clarity of vein structure (contrast between the vein lines and the background), it is likely that the metrics compensate for any nuance-based differences and thus, reflect similar good quality in these two groups, that is, they quality scores of both groups will not be significantly different. The reported quality values for HSNR and Wang (Table 4) are clearly not reflecting the desired behavior. Instead, the reported values are almost similar across all manually selected quality classes or even tend to decrease with high subjective quality. This overall trend is the same for the other, untrained quality assessment methods, clearly indicating that those are not suitable to reflect the chosen quality classes. As it has been shown in [12], a retraining of the NSS-based metrics on the vasculature databases improved the relation between the assessed quality values and the subjective quality. This motivated the refined retraining of BRISQUE and NIQE done in this work.
Database name | Feature | EER | FMR | Zero | Database name | Feature | EER | FMR | Zero |
---|---|---|---|---|---|---|---|---|---|
Type | 1000 | FMR | Type | 1000 | FMR | ||||
F. vein—dorsal | F. vein—palmar | ||||||||
PLasDOR | GF | 0.0088 | 0.0166 | 0.0452 | FV-USM | GF | 0.0400 | 0.1250 | 0.2799 |
MC | 0.0055 | 0.0083 | 0.0166 | MC | 0.0261 | 0.0662 | 0.2516 | ||
SIFT | 0.0164 | 0.0508 | 0.2000 | SIFT | 0.0753 | 0.2442 | 0.4333 | ||
PLEDDOR | GF | 0.0044 | 0.0080 | 0.0255 | HKPU-FV | GF | 0.1124 | 0.3033 | 0.5163 |
MC | 0.0013 | 0.0016 | 0.0061 | MC | 0.1028 | 0.2387 | 0.3771 | ||
SIFT | 0.0142 | 0.0669 | 0.4100 | SIFT | 0.1672 | 0.6417 | 0.9218 | ||
Hand vein | MMCBNU | GF | 0.1071 | 0.4126 | 0.7322 | ||||
MC | 0.0517 | 0.2042 | 0.5928 | ||||||
SIFT | 0.0809 | 0.9981 | 0.9997 | ||||||
CIE-HV | GF | 0.0532 | 0.0842 | 0.1696 | PLasPAL | GF | 0.0172 | 0.0241 | 0.0286 |
MC | 0.0433 | 0.0535 | 0.0600 | MC | 0.0213 | 0.0413 | 0.0663 | ||
SIFT | 0.2642 | 0.7642 | 0.8471 | SIFT | 0.0742 | 0.2788 | 0.6136 | ||
PTrans | GF | 0.0645 | 0.4601 | 0.8276 | PLEDPAL | GF | 0.0038 | 0.0050 | 0.0097 |
MC | 0.0348 | 0.1358 | 0.2079 | MC | 0.0019 | 0.0019 | 0.0044 | ||
SIFT | 0.0125 | 0.0178 | 0.0212 | SIFT | 0.0610 | 0.2786 | 0.9991 | ||
PRefl | GF | 0.0500 | 0.3125 | 0.7975 | SDUMLA | GF | 0.1198 | 0.2803 | 0.6027 |
MC | 0.0292 | 0.1270 | 0.6697 | MC | 0.0492 | 0.0975 | 0.5269 | ||
SIFT | 0.0210 | 0.0461 | 0.5026 | SIFT | 0.0884 | 0.2462 | 0.5007 | ||
VERA | GF | 0.0193 | 0.0413 | 0.0477 | UTFVP | GF | 0.0073 | 0.0203 | 0.0333 |
MC | 0.0310 | 0.0446 | 0.0506 | MC | 0.0069 | 0.0138 | 0.0337 | ||
SIFT | 0.0340 | 0.1126 | 0.1911 | SIFT | 0.1190 | 0.5388 | 0.8157 |
- Note: The bold values signify the best performance achieved for each dataset and feature extraction method.
For the training of BRISQUE and NIQE, two different experimental protocols are chosen: (i) leave one dataset out and (ii) 10-fold sampling. In the first case, all good (NIQE) or poor and good (BRISQUE) finger or hand vein images are chosen as training set, but excluding one particular databases’ images, for example, MMCBNU, if MMCBNU is the one, it shall be evaluated later on. Thus, the images and corresponding characteristics of one database are never included in the training process. The second protocol is used to validate the stability of the retrained versions of BRISQUE and NIQE and the independence from the quality values of the particular vasculature pattern samples (each random subset of the same class should exhibit the same average quality value). A traditional randomly performed 10-fold sampling is conducted. The images of all datasets from the same image type are combined together and then divided into 10 folds, where each fold consists of a tenth of the subjects from the combined data. It does not matter from which dataset, the images originate but all images of the same subject have to be in the same fold. This is done to prevent any bias between the training and evaluation data. Images that have been selected to be used as training images are excluded to be part of the evaluation databases and hence are excluded from subsequently performed evaluations.
In case of the training for DL method, the experiments and training procedures are applied separately for the three types of samples/biometric traits that are evaluated (palmar FV, dorsal FV, HV) using a fourfold cross-validation, similar to the 10-fold cross-validation protocol as described before. CNN-training requires a lot of time and hence we employed a fourfold cross-validation so that the CNN only needs to be trained four times instead of 10 times per experiment. Each fold is applied once for evaluation and the images of the remaining three folds are used as training data. The entire DL-based method is trained on the training portion and subsequently evaluated on the evaluation fold.
For calculating the quality scores, the IQM described in the NSS methods’ section and the DL-based method (cf., section on DL-based vascular quality estimation) are employed. While the DL-based method is implemented in Python, all other metrics (EntropyBased, GCF, HSNR, Radon, TNorm and Wang17) are implemented in MATLAB. For BRISQUE and NIQE, the MATLAB implementations from the developers of NIQE and BRISQUE (all available from http://live.ece.utexas.edu/research/quality/) are utilized. In all cases for the latter metrics, (i) the default settings and (ii) trained with the vasculature data as described are evaluated. For these two applied metrics, lower values indicate better quality, while for all other metrics, higher values indicate better quality.
Details regarding the exact implementation, in particular the new DL method, are available at https://wavelab.at/sources/Kirchgasser24a/.
7. Experimental Evaluation
In the following subsections, the results of the baseline performance evaluation of the selected databases (intending to establish a comparison benchmark for subsequently conducted experiments), the results of the extended evaluation based on [12] and the findings of the proposed DL quality assessment method are presented and discussed.
7.1. Baseline Performance Evaluation
Table 4 lists the corresponding recognition performance results in terms of the EER, the FMR1000 and the ZeroFMR for the finger as well as the hand databases, respectively, obtained by using FVC verification mode of the PLUS OpenVein Finger- and Hand-Vein Toolkit [35]. In both tables, the best performance values for each database are highlighted in bold. These results serve as baseline to enable a subsequent analysis of the recognition performance progress. If a portion of the lowest quality images is successively removed from the evaluation database, the performance should improve over the baseline one.
Table 4 shows that MC as feature extraction achieves the best recognition performance, especially for the FV databases. The only exception is the PLasPAL where the application of GF was superior than MC and SIFT. For the hand vein databases, in two of four cases, SIFT was the best feature extraction method, while MC and GF choices were best in one case each.
The overall recognition performance is on a very high level, with an EER of close to zero in several cases for most databases. Thus, it is assumed that the subsequently performed quality-based experiments, where the lowest quality images are successively discarded from the performance evaluation, will not significantly improve the EER values, but might result in an improvement of the FMR1000 and ZeroFMR.
7.2. Quality Evaluation Using Extended Training Data
In this extension of [12], BRISQUE and NIQE were retrained using all databases mentioned in Section 3. The evaluation results of the 10-fold BRISQUE and NIQE training are depicted in Figure 5, showing the whole range of the quality values for all dorsal (Figure 5a1,b1), palmar (Figure 5a2,b2) and hand vein (Figure 5a3,b3) samples, respectively, across all three quality classes. This statistical evaluation was only performed for BRISQUE and NIQE to highlight their stability or possible variations regardless which fold is used. Ideally, the quality values should remain stable across all folds. It can be seen that the training of BRISQUE (Figure 5a) and NIQE (Figure 5b) highly corresponds to the databases used. While, for NIQE, the boxplots representing the obtained quality values are quite stable (independent of the selected fold), BRISQUE exhibits more variation, especially using dorsal FV data samples. Hence, NIQE should also achieve more stable results as compared to BRISQUE for the leave one database out experiments. Furthermore, the overall quality values for NIQE are expected to be higher than for BRISQUE, which is a general trend as NIQE is trained on high-quality images only.


The following results focus on the extended evaluation based on the protocol of [12] (third experiment) with the difference, that in this study, a leave one database out evaluation is performed, whereas in [12], samples of all databases were included during training. The leave one database out protocol was performed by selecting each of the databases listed in the dataset section using the metrics described in the sections on related work and the NSS principle. In this experiment, an increasing number of low-quality samples (according to the assessed quality scores) is discarded from the databases and not used during the sample comparisons. The samples are sorted according to the assigned quality scores and starting with excluding those 5% images exhibiting lowest quality until excluding 50%. The recognition performance values (EER, FMR1000, and ZeroFMR) were recomputed using the remaining samples, resulting in the trend visualized in Figures 6–9. For a well-performing and suitable sample quality evaluator, all three recognition performance values should decrease with an increasing percentage of rejected images. From the EER and ZeroFMR progression figures, two aspects can be observed immediately: First, the quality assessment application highly depends on the combination of database and feature extraction method. There is no general trend and thus no conclusion can be drawn which quality evaluation method is suited best across all databases and feature types. Second, if only the EER is taken into account some of the quality measures do not work as expected, except for particular databases or feature extraction methods. For example, while Radon shows a poor performance on the PLasDOR database using GF, it turns out to be a good indicator for SIFT (both examples can be seen in Figure 8a). In general, the obtained EER and ZeroFMR values stay within a low value range, which is a desirable observation from a recognition point of view, but a complicating factor regarding the significance of the performance figure-based quality evaluation.








TNorm exhibits a constant behavior, an expectable monotonous decreasing EER, across all experiments focusing on FV images, except some outliers (e.g., see Figure 8a3,b3). The same holds for Wang. NIQE-train is a suitable method for quality assessment, especially for VERA (Figure 8b) and CIE-HV as well as for most FV databases. Compared to BRISQUE-train, NIQE-train is more stable and exhibits a monotonously decreasing EER in most cases, while the EER for BRISQUE-train is slightly increasing in several cases. The benefits of retraining NIQE are clearly visible as the original NIQE is usually among the worst performing metrics, clearly evident for UTFVP and MC (Figure 6a2). The results for VERA and CIE-HV are similar, thus the plots are omitted. For PTrans and PRefl, only TNorm and Wang provide reasonably realistic values, for the other metrics, there is a clear, almost linear increase in the EER with an increasing number of discarded low-quality images. Although fluctuations in the EER-based analysis were to be expected, as can be seen in Figures 6 and 8, this can not be explained by differences in the visual quality of the hand vein database samples as no major changes can be detected (cf., Figure 3).
Compared to [12], the retraining of both BRISQUE and NIQE on UTFVP and SDUMLA does not improve the results as expected. As visualized in Figure 6, the EER is increasing with a high number of discarded low-quality images. The most important difference between the current experiments and those in [12] is the data used during the retraining of BRISQUE and NIQE. In [12], a total of 50 poor and 50 good images each for UTFVP and SDUMLA have been utilized and no leave one dataset out training has been performed. In [12], the focus was on the general robustness of the retrained versions of BRISQUE and NIQE for a given biometric system where the capturing device as well the feature representation are known and thus, the quality assessment method can be optimized with respect to the particular biometric system. In the current study, focus is more on the generalisability of those metrics. In the current experiments, all images from the selected databases, except the ones where it was evaluated on (UTFVP or SDUMLA for Figures 6 and 7), are included in the training set. This updated training protocol was chosen to evaluate the generalisability of the retrained methods. As the results reveal, the impact of including data from the same database during training or not is a crucial one. While in [12], the retrained methods show the expected trend of a decreasing EER with an increasing number of discarded samples, independent of the database and feature type, the current leave one database out evaluation results cannot confirm this trend, as described above. It is also worth noting, that compared to [12], a parameter optimization of the feature extraction methods was done to improve the baseline and overall recognition performance. This makes the existing subtle fluctuations in Figures 6 and 8 more pronounced.
While the EER might not show the influence of a small number of low-quality samples on the biometric recognition performance, especially for high number of samples, the ZeroFMR (FMR1000) is influenced even by a single low-quality sample if it leads to a false negative comparison. Hence, the ZeroFMR, shown in Figures 7 and 9, allows a more detailed insight on the quality assessment performance. On the UTFVP and SDUMLA with a few exceptions (BRISQUE-train for UTFVP using all feature extraction methods, NIQE for UTFVP using GF and MC and NIQE for SDUMLA using SIFT), all of the evaluated quality estimators show the expected trend of a decreasing ZeroFMR toward a higher number of discarded low-quality samples. On PLasDOR and VERA, the same expected trend can be described, except NIQE for PLasDOR, BRISQUE for PLasDOR using MC, NIQE-train for PLasDor using SIFT, as well as BRISQUE-train for VERA using GF and TNorm for VERA using MC.
Summing up, there is no best performing quality assessment methodology across all databases and feature types. All evaluated methods suffer from a high dependence of the database and feature type. Hence, no metric is able to generalize in a sufficient manner independent from database and feature type choice. Furthermore, the retrained versions of BRISQUE and NIQE do not necessarily perform better as their not specifically retrained versions or other vein quality metrics.
7.3. DL-Based Vascular Quality Assessment
The last experiment evaluates the DL-based quality measure. As mentioned in the previous section (describing the experimental setup), a fourfold cross-validation using a triplet loss function was done for dorsal FV data, palmar FV data, and the remaining hand vein databases. The results reflecting the classification accuracy after SVM application are presented in Table 2 for each of the manually selected categories. This also includes the presentation of results obtained on the reference DL-based quality metric Light-CNN [22], which was clearly outperformed. All in all, the values presented in Table 2 clearly indicate that for the hand vein and palmar FV data the accuracy for the middle class is highest, while for dorsal FV samples the highest accuracy is achieved for the good class. If only the mean accuracy over all folds for each of the three databases’ groups is considered, the accuracy using palmar FV images is higher compared to the other two groups. However, this statement is only partly true. The classification accuracy of the manually categorized poor images is much lower compared to the middle and good category using palmar venous data and also lower compared to the accuracy of the poor class evaluating dorsal vein or hand vein images. There are two potential explanations for this observation. First, as described in Table 3, the percentage of images contained in the poor class of the palmar FV databases is much lower compared to dorsal and hand vein ones. Furthermore, the difference between the number of vascular images exhibiting a poor manually assigned quality is much lower compared to the number of images labeled as middle or good (using palmar databases only). As a consequence, the portion of poor-quality images selected during the training of the DL methods is low and thus, the proposed network focuses (learning) more characteristics given for middle- and good-quality images, which makes it more difficult to assign poor data samples to the correct class. This statement is not only true for the palmar FV samples, the same trend (reduced classification accuracy of images exhibiting poor quality) is also found in the other two data groups, as the overall accuracy for classifying poor-quality images correctly is also much lower for the other two groups.
Second, the differentiation between the single manually selected quality classes is not always an easy task (even for trained humans) due to the nature of the biometric trait. In general, the images exhibiting vasculature information are often of low contrast or illumination variations during the acquisition can lead to higher intensity values in some areas of the finger/hand (enabling a better visibility of the vein pattern), while other areas remain quite dark (reducing the likelihood of correctly determining the presence of vasculature patterns). For example, such illumination variations can easily be detected by the trained network as problematic areas and thus, samples containing such areas were mistakenly classified as poor-quality images (even if they are not).
The experiments reporting the EER performance for the DL method where conducted in a different way. While for the non-DL quality methods, each image is labeled with a certain quality value (allowing to sort the databases and remove a certain percent-wise portion of low-quality samples), the proposed DL method categorizes each data sample into one of three quality classes. Hence, only a rough ordering can be established by successively excluding all images that can be assigned to one of the three manually selected classes from the scores used during the performance calculation. To enable a comparison to the non-DL methods, for the non-DL-based methods, the samples are divided into three quality classes based on their quality scores with the same number of samples per class as assigned by the DL-based method (samples exhibiting lowest quality are assigned to the poor class, next lower quality to the middle one and the remaining ones to the good class). Thus, it is possible to perform a similar performance-specific evaluation using all quality assessment techniques. In Figures 10 and 11, the corresponding results using once more EER and ZeroFMR are depicted.




The overall performance trend is similar for the non-DL methods in that there is no best-performing quality assessment methodology. There is a high dependence on the selected database and feature type. However, two distinct trends are still observable. The first one, visible in Figures 10a and 11a by using the PLasDOR database, indicating nonmeaningful quality assessment (EER increases in some cases while it remains stable or even decreases in others). The second trend can be seen on the CIE dataset, where the exclusion of the poor and middle class finger and hand vein images resulted in a performance improvement compared to the other databases. It is expected that a significant improvement in performance can be observed if the quality assessment method is suitable to ensure the successful application of any biometric authentication system. This abovementioned trend is only detectable for using the DL method on FV-USM, PLasPal (only for GF), PLEDDor, PLEDPal (excluding SIFT), SDUMLA, UTFVP, and CIE. Thus, at least for the mentioned databases, an improvement from the traditional non-DL-based quality assessment can be reported for the DL-based method.
8. Conclusion
This work evaluated the suitability of several image quality assessment schemes as biometric quality estimators for vasculature pattern samples. In an extension of the previous work [12], additional databases as well as new DL-based approach were included. Similar as in the previous work, BRISQUE and NIQE turned out not to be suited as finger and hand vein quality measures if pretrained on common images and classical distortions only. Their trained counterparts, retrained on the vasculature pattern databases, exhibited a better performance. Although, compared to classical vasculature sample quality measures, these retrained versions of BRISQUE and NIQE do not necessarily perform better. The DL-based approach achieved only mediocre classification performance and due to its limitation of three quality classes instead of a dedicated quality score was not that satisfactory. However, with an adaption of the evaluation protocol allowing a comparison between the nonlearning and the DL methods, it was shown that for some databases the DL quality prediction ensured a performance improvement. Hence, if the number of quality classes is limited, it is likely that the DL method results in a more clear quality prediction, being beneficial for an successful authentication. In general, the results showed that the optimal quality measure highly depends on the selected database and feature representation, which is once again in-line with the previous findings in [12].
In the future, it is planned to refine and extend the DL-based approach, enabling to output a dedicated quality score and improving its classification accuracy.
Disclosure
For open access purposes, the author has applied a CC BY public copyright license to any author accepted manuscript version arising from this submission.
Conflicts of Interest
The authors declare no conflicts of interest.
Funding
This research was funded in whole or in part by the Austrian Science Fund (FWF) under project no FWF P32201 and by the Salzburg State Government.
Open Research
Data Availability Statement
All utilized datasets are publicly available using the following hyperlinks: FV-USM http://drfendi.com/fv_usm_database/, HKPU-FV https://www4.comp.polyu.edu.hk/~csajaykr/fvdatabase.htm, MMCBNU 6000 https://ieeexplore.ieee.org/document/6744030, PLUSVein-FV3 https://wavelab.at/sources/PLUSVein-FV3/, SDUMLAhttps://time.sdu.edu.cn/kycg/gksjk.htm, UTFVP https://pythonhosted.org/bob.db.utfvp/, CIE-HV http://atlas.put.poznan.pl/noindex.html, PROTECT Handvein dataset https://protect.mozello.com/dataset/, and VERA https://www.idiap.ch/en/dataset/vera-palmvein.