Full-reference tone-mapped images quality assessment

Various tone mapping operators have been proposed to convert the high dynamic range images to low dynamic ranges to improve visualization on low dynamic range displays. This paper presents a full-reference objective quality assessment index to evaluate the perceived quality of tone-mapped images. The proposed method seamlessly employs the multi-scale structural ﬁdelity, statistical naturalness, colourfulness and the multi-scale free energy of the image to create a similarity score between a produced low dynamic range image and its high dynamic range image. The extensive experiments on three publicly available datasets using Spearman’s rank-order correlation coefﬁcient, Kendall’s rank-order correlation coefﬁcient and receiver operating characteristics analyses indicate the proposed tone-mapped quality index is superior to recently proposed state-of-the-art objective quality indices.


INTRODUCTION
Traditional imaging methods, known as low dynamic range (LDR) imagery, only handle 8 bits or less per colour channel per pixel, whereas high dynamic range (HDR) imagery records very dark and bright areas of a scene at the same time while avoiding under-exposed and over-exposed areas [1].
Most display devices are not able to display HDR content. As a result, in the last two decades, various researchers have spent significant time and effort and employed the so-called tone mapping operations to compress the HDR images and videos in a way that the results are visualized, more naturally, on LDR displays. A typical tone mapping operator (TMO) keeps characteristics such as local and global contrast and details of the HDR content to some extent and, mathematically, is defined as follows [1]: (1) where T is the TMO operator, I is the image, w and h are the width and height of the image, c is the number of colour bands in I (for RGB colour space, c = 3), ℝ i ⊂ ℝ and o ⊂ ℝ i (for normal LDR monitors, o =[0, 255]). TMOs can be categorized into global, local, frequency/gradient and segmentation operators [1]. Global operators perform the same operation to all pixels. Local operators, for each pixel, perform the operation on a neighbourhood of the pixel. Frequency operators first separate image's low and high frequencies and then apply the oper-This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology ator on low frequencies while keeping the high frequency to preserve the fine details. Segmentation operators segment the image in broad regions and then apply a different mapping to each region [1]. Some representative examples include adaptive logarithmic [2], contrast-based scale factor [3] and quantization techniques [4] as global operators; photographic tone reproduction [5] and TMO for high contrast images [6] as local operator; low curvature image simplifiers [7], trilateral filtering [8] and image colour appearance model (iCAM06) [9] as frequency operator; and interactive manipulation (IM) [10] and lightness perception [11] as segmentation operator.
Reducing the dynamic range of the image would inevitably result in losing some information. Therefore, a TMO aims at producing a natural-looking realistic LDR image by preserving the structural information of a given HDR image [12]. TMOs can act differently on different HDR images, which means not every TMO performs well for any given HDR image [13]. Similar to image quality assessments (IQA) in other areas, TMO methods and their effectiveness on producing high-quality LDR images can be evaluated using two different approaches: traditional subjective IQA (SIQA) and objective IQA (OIQA). In SIQA, which relies on human subjective evaluations, a number of subjects are asked to rate LDR images produced using different TMOs based on their overall quality, brightness, contrast, naturalness, detail reproduction and colour [12]. The mean opinion score (MOS) obtained from SIQA is usually regarded as the best approach for IQA [14]. However, they are expensive, time-consuming and difficult to employ for improving TMOs, and some subjects may miss some valuable structures contained in HDR images [12].
OIQA designs computational models to predict the perceived image quality accurately and automatically [14]. Although various OIQA methods are proposed in different areas of IQA, only a few full-reference assessment methods are developed for HDR images. Mantiuk et al. [15] proposed to use HDR visible difference predictor (HDR-VDP) to distinguish between visible and invisible distortions. HDR-VDP is not specifically designed to compare HDR images with their LDRs as it only predicts the visibility of differences between two HDR images of the same dynamic range. Aydin et al. [16] further improved HDR-VDP to be dynamic independent, but their improved method does not result in a single quality score for a given image, which makes it almost impossible to be validated with subjective evaluations of overall image quality [12]. One of the first OIQA methods, specifically designed for evaluating TMOs, is the tonemapped quality index (TMQI) [12]. Inspired by structural similarity (SSIM) approach [17], its multi-scale derivations [18,19] and the natural scene statistics (NSS) approach [20], TMQI combines a multi-scale structural fidelity measure and a statistical naturalness measure to produce an overall quality score [12]. Ma et al. [21] later investigate each component of TMQI (i.e. the structural fidelity and statistical naturalness components) and improved each term separately. The improved index is known as TMQI-II. While both TMQI and TMQI-II yield a decent quality score compared with the subjective score of a tonemapped image, they do not consider the colourfulness of the image which indicates the image's colour vividness and the preserved colour information after the tone-mapping operation. Feature similarity index for tone-mapped images (FSITM) [13] is another OIQA metric which compares the locally weighted mean phase angle map of the HDR image with its converted LDR image. The phase angle in FSITM can be calculated using each of the three red (FSITM R ), green (FSITM G ) and blue (FSITM B ) channels. The final FSITM score, obtained by any of the three channels, can be combined with the TMQI score to further improve the performance of each of the two methods. FSITM effectively uses the gradient domain to capture the image quality, but similar to TMQI and TMQI-II, it does not also consider the colour information of the tone-mapped image. In addition, in real application, one challenge of using FSITM is to decide which colour channel should be employed as different channels (i.e. R, G or B) yield a different quality score. This paper, inspired by TMQI, presents a new quality assessment metric for tone-mapped images, which combines a multiscale structural fidelity, a statistical naturalness, colourfulness and image's free energy measures to produce a quality index. The quality index is called SNCF which is the acronym of the following four components: Sstructural fidelity, naturalness, colourfulness and free energy). The proposed index first computes the structural fidelity of the image in different scales to produce the multi-scale structural fidelity measure which is weighted based on the psychophysical experiment results. Then, it measures statistical naturalness and colourfulness of the image and combines them together to produce the colour quality measure. Next, SNCF employs the free energy principle to model the perception and understanding of an image by brain to produce a psychovisual quality metric. Finally, these metrics are combined together to produce the final tone-mapped image quality SNCF score.
The rest of the paper is organized as follows. Section 2 presents the methodology, including the four main features of the proposed measure. Section 3 shows the experimental results on the three publicly available databases and compares the proposed SNCF index with some of the recently proposed stateof-the-art methods. Finally, Section 4 draws the conclusion.

METHODOLOGY
TMOs are not able to preserve all structural information since they reduce the dynamic range of HDR images [22]. Therefore, assessing the preserved structural information is a key element in evaluating a TMO. Other key element in assessing TMOs and LDR images is naturalness of the LDR image [12]. Naturalness is the correspondence degree between human perception and reality world [23]. Yeganeh and Wang show some LDR images that maintain the structural information of their corresponding HDR images well look overly dark [22]. Their proposed index, TMQI, attempts to make a balance between the two factors (i.e. structural fidelity preservation and high naturalness). However, these two factors are not the only factors under influence of the tone-mapping operations. Colourfulness which presents the degree of colour vividness, and image's free energy which tries to infer the meaningful part from the visual stimuli, are two other important factors that need to be considered when evaluating the image quality [23,24]. Generally, colourfulness and naturalness are both decisive elements when images are judged for their perceived quality of colour reproduction [25]. Image's free energy, on the other hand, models the perception and understanding of the image with an active inference process in which the brain tries to explain the scene using an internal generative model [26]. It increases with presence of the detail and decreases with presence of the blurriness [27].
The following subsections first present these four factors and then propose the TMQI SNCF.

Structural fidelity
SSIM method is one of fundamental methods to measure structural fidelities, from the image formation point of view [17]. SSIM combines structure s, luminance l and contrast c comparisons of two images (considering one image has perfect quality, then SSIM quantifies the quality of the second image). l , c, s and SSIM for two image patches, and x and y are defined as follows [17]: s(x, y) = xy + C s x y + C s where x and y are the mean intensity values of the two signal images. x , y and xy are the standard deviations and crosscorrelation between the two images. C l , C c and C s are constant values which are used to avoid instability in each component, and , and are positive values which are used to adjust the relative importance of each component. The overall SSIM score for a given image is computed as the mean value of the SSIM values of all pixels. The dynamic range of a LDR image totally changes after conversion from its respective HDR. Therefore, the SSIM's luminance component is not proper to be considered for quality assessment of TMOs [12]. The SSIM's contrast component penalizes any change in signal strength, while differences between HDR and LDR images should be only penalized when one of the two signal strengths is above and the other is below visibility threshold [12]. This led researchers to modify the contrast component and use the psychometric function Galton's ogive [28], which is similar to the cumulative normal distribution function, to nonlinearly map signals above and below visibility threshold to 1 and 0, respectively. This mapping function is defined as follows: where t is the modulation threshold, and is the standard deviation of the normal distribution which controls the slope of detection probability variation [12]. t is set to be√ is the mean intensity value, is a constant value and A is the contrast sensitivity function.
is then set to be t k , where k is roughly a constant with a value between 2.3 and 4.
The structural fidelity for each pixel for a HDR and LDR image is then defined as follows: Finally, the mean value of the SF values of all pixels can be considered as the overall SF value. The structural fidelity can be computed in different scales to ensure the images details are visible [12,18]. To this end, the HDR and LDR images are iteratively low-pass filtered, and down-sampled before their corresponding overall SF value is calculated. The final multi-scale SF score MSF can be computed as follows: where L is the total number of scales, and SF l and w l are the overall structural fidelity score and its respective weight in l th scale. To compute the structural fidelity for an RGB colour image, it is first converted to Yxy space and then the structural fidelity of the Y component is measured (x and y are the chrominance channels and Y is the luminance channel).

Naturalness and colourfulness
While naturalness is a subjective quantity, TMO methods try to simulate the naturalness of reference images to obtain realistic and natural low dynamic images. Studies show that brightness and contrast are highly correlated with perceived naturalness, and brightness mapping is an especially inevitable issue in all tone-mapping operations [12]. Therefore, statistical naturalness models based on these two factors are generally used to obtain the perceptual naturalness [12,21]. In [12], Yeganeh et al. built a model by analyzing the histograms of the means and standard deviations of about 3000 grey scale image and found these histograms can be properly fitted using a Gaussian and a beta probability density functions, GP and BP, as follows: and where m and d represent the mean and standard deviation, and B(., .) is the beta function. The values for m , m , d and d are, respectively, found to be 115.94, 27.99, 4.4 and 10.1 using regression [12]. Then, the NSS measure is computed by where K is a normalization factor, set to max(GP (m), BP(d )), to ensure the NSS is bounded between 0 and 1. The naturalness is also applied on the Y channel of the Yxy colour space. On the other hand, colourfulness, which presents the degree of colour vividness, determines the attribute of colour appearance responsible for the subjective chromatic response's strength by which the hue is recognized [25]. When the dynamic range of HDR images are compressed to produce their corresponding LDR images, colour information get lost to some extent [29]. Therefore, one factor in evaluating TMOs can be measuring how much colour information is preserved by the LDR image.
Computing the perceived colourfulness of a natural scene is a challenging task. As a result, researchers have proposed different techniques to measure the image's overall colourfulness [23,25,29,30]. Inspired by the colourfulness indices proposed by [29,30], the index to compute the perceived colourfulness of the LDR image is defined as the follows. First, the RGB colour space is converted into opponent colour space as follows: Then, the colourfulness index (CI) is computed as follows: where rg , yb , rg and yb are the standard deviation and mean values of their respective rg and yb colour channels, and |.| indicates the absolute value operation. C CI is a constant value which is used to avoid instability in the CI index. Based on my experiment, images with more natural colour have CI values between 3 and 3.5 and the CI value for other images is either less than 3 or bigger than 3.5. The experiment for this setting is done using 20 collected natural colour images and adding different types of noise to these images. Then, the degrees of the colourfulness for each of the reference images and their respective noisy images are computed and scaled using different pair of values. As a result, 3 and 3.5 are the pair of values which give the 20 original natural images the best colourfulness score compared with the noisy images. To ensure the CI values are bounded between 0 and 1, the bell-shaped (i.e. Gaussian) multi-options function is used to scale them. This function sets the scaled CI for the images which their CI index is between [3 3.5] and 1, and for CI values less than 3 and higher than 3.5 uses the left and right sides of the function. The multioptions to calculate the scaled colourfulness index SCI is as follows: where p l = 3, p r = 3.5 and is a constant value that determines how the CI values, out of the [3 3.5] range, are scaled. Figure 1 plots the general form of the SCI measure for = 1. The figure clearly indicates that the colourfulness of every image with a CI value out of the [3 3.5] range is proportional to its distance from this range. Finally, the colourfulness and naturalness of the scene can be linearly combined to form the colour quality of the image NC as follows: where w is a weighting parameter to compromise between the naturalness and colourfulness constraints. I set w = 0.75 to be consistent with the Yendrikhovski et al. [25] research in which their proposed colourfulness and naturalness indices were linearly combined and the weighting parameter experimentally set to be 0.75.

FIGURE 1
The CI values scaled in accordance with the plot

Image free energy
Free-energy (FE) principle unifies several brain theories in biological and physical sciences about human action, perception and learning [26]. Since all adaptive biological agents resist the natural tendency to disorder in an ever-changing environment, FE principle avoids encountering 'surprise' to keep the agents' internal states at the low entropy level under different environments and maintain the states within some physiological bounds. This ensures that the biological agents can violate the second law of thermodynamics (which states a system's entropy that is not at its equilibrium tends to increase over time, approaching a maximum value at the equilibrium [26]). The biological agent cannot measures or avoids 'surprise' directly and free energy is usually used to upper bound it. The minimization of free energy implicitly minimizes 'surprise'. On the other hand, the Bayesian brain theory indicates that the human visual system uses an internal generative mechanism (IGM) for visual perception while some structural uncertainties are avoided for understanding [31]. FE minimization is highly related to the predictive coding [26]. Thus, if we consider IGM to be a linear autoregressive (AR) model which separates the disorderly uncertainty from the input scene, then the process of FE minimization is equivalent to encoding the input visual signal I with the minimum number of bits based on the AR model [24,26,31,32]. To minimize the coding length, the piecewise AR model is used, which ensures the model parameters adjust on a pixel-by-pixel basis [26].
As a result, the total description length of the image I with the kth-order AR model is defined as follows [24,26]: where is the parameter vector, N is the number of pixels and the model is selected by minimizing L( ). In addition, for the large samples, when N → ∞, the FE ℑ( ) is the total description length: This means, total description length of the image data computed using the AR model is an approximation of the image's FE. In other words, the entropy of the prediction residuals plus the model cost can be used as the image's FE estimation. In practice, a fixed-model order such as k = 8 is used to simplify the complicated model selection process and ignoring the second term k 2 logN [24,26]. In addition, changes in visual appearance such as direction, scale and frequency lead to images with different structures. Therefore, I propose to estimate the image's free energy in different scales. As a result, the free energy for a given image I is defined as follows: where fe i represents the approximated image's FE at a decreased resolution in which each non-overlapping i × i patch of the image I is replaced by the local mean value of the patch. Equation (19) not only guarantees employing the estimated FE in multi-scales, but also results in a more robust value by considering a proportion that approximately indicates the slope of changes in the FE in different scales. Finally, the four discussed features (i.e. Sstructural fidelity, naturalness, colourfulness and free energy) are combined together to propose the TMO quality assessment index SNCF as follows: where a, b and c are real values between 0 and 1 to adjust the relative importance of the three components.

Implementation
The proposed SNCF method has several parameters that need to be set to compute the final tone-mapped image quality score. Some parameters, which are somehow related to previous researches, are set using their corresponding recommended values and some others are set empirically. Following the psychophysical experiment results in [18], the number of scales L and the weights w l are set to be 5 and {0.0448, 0.2856, 0.3001,0.2363, 0.1333}, respectively [12,18]. C l and C s in the structural fidelity index are set to be 0.01 and 10, respectively. C CI in the colourfulness index is set to be 0.25. The real values a, b and c in Equation (20) are empirically set to be 0.7742, 0.1290 and 0.0968, respectively. As it will be discussed shortly, three datasets are used to evaluate the proposed metric. All the empirically set values are investigated using only one of these datasets (the Yeganeh&Wang dataset), and so the results on ALGORITHM 1 Algorithmic view of the proposed tone-mapped image quality score Input: The reference HDR image and the respective LDR image.

Output:
The final tone-mapped image quality score.
other two datasets can verify the robustness of these parameters. The algorithmic view of the proposed index is illustrated in Algorithm 1.

EXPERIMENTAL RESULTS
In this section, the proposed index is compared with some of the recently proposed state-of-the-art methods. The compared methods include TMQI, TMQI-II, the three variants of FSITM method (i.e. FSITM R , FSITM G and FSITM B ) and the three variants of FSITM_TMQI (i.e. FSITM R -TMQI, FSITM G -TMQI and FSITM B -TMQI). All of these methods are implemented in MATLAB using the code available at the authors' websites. Moreover, for each metric, the parameters are set as recommended by the respective researchers. These TMO quality assessment metrics are evaluated by conducting experiments on three publicly available datasets. The three datasets are Yeganeh&Wang [12], MMSPG [33] and TMQID [34] datasets. For the first two datasets, since their MOSs are available, the Spearman's rank-order correlation coefficient (SRCC) and the Kendall's rank-order correlation coefficient (KRCC) are employed to objectively evaluate the performance of the proposed method and other state-of-the-art methods. SRCC and KRCC are defined as follows: where r s and r o are the subjective and objective rank scores of the ith image, respectively. N is the number of LDR images, and N c and N d are the numbers of concordant rank order and discordant rank order pairs in the data set, respectively. For the third dataset, the authors in [34] provided two new measures to answer two questions in evaluating the metric performance. The two questions are as follows [34]: Q1) 'Can the metrics successfully predict which images are perceived to be statistically different by the observers?', and Q2) 'If the two images are different (with statistical significance), how well can the metrics determine which one is of better quality?'. To answer these questions, based on the scores obtained by a metric, a binary classification problem is developed where its decision ability (and therefore effectiveness of the respective metric) is evaluated by receiver operating characteristics (ROC) analyses and the area under curve (AUC) values. Answering each question results in different ROC analysis: different vs. similar ROC analysis for Q1 with AUC values called AUC-DS values, and better vs. worse ROC analysis for Q2 with AUC values called AUC-BW values. Higher AUC-DS and AUC-BW values for a metric indicate its better performance compared with its peer [34]. It should be mentioned that these measures do not need subjective rating/ranking of the images.

Yeganeh&Wang dataset
The Yeganeh&Wang dataset is introduced in [ To have a better visual comparison, the boxplot of the compared metrics for both SRCC and KRCC values is also included in Figure 2. In the boxplot, the lower hinge, the bold red horizontal line and the upper hinge, respectively, correspond to the first (i.e. 25th percentile), second (i.e. median) and third quartiles (i.e. 75th percentile). The upper whisker extends from the hinge to the largest value and no further than 1.5 × IQR from the upper hinge (where IQR is the distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value and no further than 1.5 × IQR from the lower hinge. Points which have value beyond the end of the whiskers are called 'outlying' points and are plotted individually. This further confirms the effectiveness of the proposed TMQI.

MMSPG dataset
The MMSPG dataset is introduced in [33] and contains 20 HDR images. The MMSPG dataset provides compressed versions of the display-adapted HDR images by three JPEG XT profiles, referred to as profiles A, B and C. For each of the three profiles, four different bit rate levels are used to compress the images. So for, each of 20 HDR images, 12 compressed images along with their MOSs are provided. Table 2 summarizes the SRCC and KRCC values for the proposed metric together with the TMQI, TMQI-II and different channels of FSITM and FSITM_TMQI metrics. While the first row indicates the compared methods, the other rows list the minimum, mean, median, maximum and standard deviation (STD) values across all the 20 HDR images, respectively. Clearly, the proposed SNCF index achieves the best minimum Similar to the previous subsection, the boxplot of the compared metrics for both SRCC and KRCC values is also included in Figure 3, which further shows the effectiveness of the proposed TMQI.

TMQID dataset
The TMQID dataset is introduced in [34] with two different types of contents: natural and computer-generated content. Each source content contains 10 HDR images. Each HDR image has nine LDR images which are developed by different settings of the Drago et al. [2], Mantiuk et al. [36], Kuang et al. [9], Reinhard et al. [5] and Mai et al. [38] TMOs. This paper used only the images with the natural content, and the AUC-DS and AUC-BW measures are computed to compare the performance of the proposed index with the state-of-the-art indices. Table 3 summarizes the AUC-DS and AUC-BW values for the nine compared indices. Clearly, the proposed SNCF noted that an AUC value around 0.5 is somehow equivalent to a random guessing [34], and among the obtained AUC-DS values of the compared state-of-the-art methods, only FSITM G -TMQI obtains the AUC-DC value higher than 0.6 (i.e. 0.6007) while the proposed SNCF metric achieves 0.6597, which is  significantly higher than 0.6. However, only one method achieves a AUC-BW value lower than 0.6 (i.e. FSITM G ). It is worth mentioning that, based on these extensive experiments, most of the compared methods do not have a consistent performance and behave differently on the three databases. For example, the FSITM R method obtains the fourth highest mean SRCC and KRCC values on Yeganeh&Wang dataset, the seventh highest mean SRCC and KRCC values on MMSPG dataset and the third highest AUC-DS and AUC-BW values on TMQID dataset. Similarly, the FSITM G -TMQI method obtains the third highest mean SRCC and KRCC values on Yeganeh&Wang dataset, the second highest mean SRCC and KRCC values on MMSPG dataset and the sixth highest AUC-DS and AUC-BW values on TMQID dataset. In addition, we can note from the experiments, while some of the six variants of the FSITM method yield good performance, not knowing which of the three channels (i.e. red, green and blue) to use is a major issue in practice. This further confirms the robustness and effectiveness of the proposed SNCF index on different datasets, since, except for the AUC-BW metric on TMQID dataset that achieves a comparable value, SNCF outperforms all the compared state-of-the-art methods on other comparisons across the three datasets.

Components analysis
This subsection analyzes the importance of each of the four SNCF components S (structural fidelity), N (naturalness), C (colourfulness) and F (free energy). To show the contribution of each of the components to the performance of the proposed SNCF index, quality of the images in the Yeganeh&Wang dataset is evaluated using four variants of the SNCF index. These four variants include the NCF index which excludes the structural fidelity component from the SNCF index, the SCF index which excludes the naturalness component, the SNF index which excludes the colourfulness component and the SNC index which excludes the free energy component. The SRCC and KRCC mean and median values for these four variants together with the results of employing each component individually are summarized in Table 4. The result in the table, clearly, indicates while the structural fidelity and naturalness, individually, are more sensitive to the image quality, the best metric to assess the quality of tone-mapped images consists of all the four components. This confirms that we need to measure the effect of the four components to better evaluate the image quality.
On the other hand, the computational run-time of each of the individual components is further analyzed and summarized in Table 5. This table illustrates the average run-time to evaluate the quality of a tone-mapped image for each of the main four components for Yeganeh&Wang dataset in seconds. The images in this dataset have the minimum size of 535 × 357 × 3 pixels, maximum size of 535 × 803 × 3 and average size of 521 × 572 × 3 pixels. As shown in Table 5, the S and C components have the highest and lowest computational run-time, respectively. However, since the average run-time of each of the four components is low, the run-time of the proposed SNCF index, which would roughly be the sum of the four components' run-times, also is low. This further indicates that while some variants of the proposed metric such as SNC yield a comparable results, it is still worth to include all the four components as SNCF is superior to all of its variants in Table 4 and the compared state-of-the-art indices.

CONCLUSIONS
This paper proposed a tone-mapped IQA index based on the FE principle, image naturalness and colourfulness and structural fidelity. The extensive experiments on three publicly available databases (i.e. Yeganeh&Wang, MMSPG and TMQID) illustrate the effectiveness of the proposed metric in terms of SRCC and KRCC accuracy, and AUC-DS and AUC-BW curves when compared to eight recently proposed state-of-the-art metrics. The comparison also clearly demonstrates the compared metrics behave differently on the three databases while the proposed metric always yields a better performance. The contributions are as follows: (1) applying the structural fidelity of the image in different scales to produce the multi-scale structural fidelity measure which are weighted based on the psychophysical experiment results; (2) utilizing statistical naturalness and colourfulness indices which are further scaled and combined together to produce the colour quality measure; (3) effectively employing the FE principle to produce a multi-scale psychovisual quality metric which models the perception and understanding of an image by brain; (4) and seamlessly fusing these four components which improves the impact of each component and leads to the robust SNCF index.