Graph-based saliency and ensembles of convolutional neural networks for glaucoma detection

Glaucoma, after cataracts, is the second leading cause of worldwide vision loss. An ophthal-mologist may use various tools and methods to diagnose a glaucomatous eye. Computer-aided methods involving deep convolutional neural networks also made it recently possible to detect glaucoma on fundus images. Previous studies traditionally trained a single convolutional neural network for automatic detection of glaucoma. In this study, a more advanced way of accurate automated glaucoma recognition is proposed. First, a graph-based saliency region detection is used to crop the optic disc and remove the redundant parts of the fundus images. Then, four methods are used to ensemble convolutional neural network models comprising up to three deep learning architectures for glaucoma classi-ﬁcation. The detection performance of this model is better than a recent study that used the same dataset. It is also as good as, or better than, the results reported by other recent research in the literature on glaucoma detection.

et al. [12] used the wavelet features of the segmented optic disc from fundus images to diagnose glaucoma. Odstrcilik et al. [13] used retinal nerve fiber layer texture analysis for fundus images. Joshi et al. [14] exploited the saliency to detect retinal image peripapillary indicators related to glaucoma.
The second major field of focus for glaucoma research is deep learning techniques. Shankaranarayana et al., for example, in [15] used fully convolutional networks and adversarial training for the joint segmentation of optic cup and disk. Likewise, Zilly et al. [16] used entropy sampling and ensemble learning to segment optic cup and disc. Panda et al. [17] used path classification based on deep convolutional neural networks to detect retinal nerve fiber layer defect in early glaucoma. Chen et al. [6] detected glaucoma using convolutional neural networks. Li et al. [18] used deep convolutional networks, along with holistic and local features, for glaucoma classification. Li et al. in [9] used a deep learning algorithm to detect referable glaucoma on fundus images. Fu et al. [8] integrated the deep hierarchical context and optic disc region of fundus images to detect glaucoma by using a novel disc-aware ensemble network. Cheng et al. [19] proposed methods based on superpixel classification to segment optic cup and disc and detect glaucoma. Shibata et al. [20] developed a deep residual learning algorithm to use fundus images to detect glaucoma and compare the results to those of ophthalmologists. As part of deep learning, saliency maps have likewise been used to emphasize pixels when getting the glaucoma classification result [21,22]. Hemelings et al. in [21] used deep learning for automated glaucoma detection and saliency maps for expert glaucoma analysis. In [22], Kucur distinguished healthy and early glaucomatous visual fields with the help of convolutional neural networks and saliency maps.
Note that the studies mentioned earlier focused on using a single CNN model to detect glaucoma on fundus images. This paper will propose using graph-based saliency region detection and ensemble of different CNN models to improve their results. The novelty is the proposed architecture uses a saliency map to guide the different CNN models to map the region of interest to early and advanced glaucoma detection. Using several CNN models, instead of one, makes this detection more robust and correct.
The main differences between this study and the earlier studies are: -We base this study on finding first the graph-based saliency regions of the fundus images. Hence, we get rid of the redundancies of the image regions and the CNN models can focus on crucial regions of interest for better glaucoma detection. -Earlier studies use a single CNN model. This research makes use of three parallel CNN model combination, fused by four different methods, for early and advanced glaucoma detection.
The main novelties of this work are: -We propose a new graph-based saliency region detection for early and advanced glaucoma classification. We automatically detect the optic disc and remove the redundant areas on the fundus images for CNN model supervision. -We propose a new ensemble CNN model for better early and advanced glaucoma classification. This new model comprises three parallel deep learning models obtained through extensive experimental results search. -We optimise the model arguments, fuse the output probabilities using four different methods and find the best CNN combination that gives more exact glaucoma detection.
The organisation of this paper is: Section 2 details the dataset used, the saliency and pre-processing techniques and the deep learning models of the parallel architecture. Then, Section 2.5 explains the deep learning ensemble methods and Section 3 gives the details of hardware and software used for implementation. At last, Section 4 tabulates, analyses and discusses graphbased saliency deep learning ensemble performance.

MATERIALS AND METHODS
We base the proposed unified convolutional neural network on detecting the glaucoma region and then feeding this into multi convolutional neural networks to learn to predict glaucoma. The proposed method includes several steps, as shown in Figure 1. First, we calculate the saliency of the fundus image. Second, we use a threshold value to separate more salient regions and create a region map of the fundus image. Third, the map generated is used to name that region.
To detect early and advanced glaucoma, we use an ensemble of three different deep convolutional neural network models. These models are AlexNet [23], ResNet-50 [24] and ResNet-152. The following sections give the details of the dataset processed by these models, the saliency and image processing techniques used during this processing and brief descriptions of each model.

Dataset
To train and test the deep learning ensembles, we used the image database of [5]. Table 1 lists this dataset's details. This dataset is public and has 1542 fundus images in three categories. These are no glaucoma, early glaucoma and advanced glaucoma. Out of 1542 fundus images, we use 1078 for training and 464 for testing. In each category, the number of training images is 550, 202, and 326, respectively, and the number of testing images is 236, 87, and 141, respectively. Note that we have likewise resized the images to 256×256×3 to make sure they have the same dimensions.

Detection with graph-based visual saliency
We achieve glaucoma region detection by using graph-based visual saliency [25]. This way we hope to imitate the amount of visual and cognitive attention a human being would show toward the glaucomatic region. The goal is to detect the region of interest, in this case the optic disc of the eye [26]. By setting the relative importance of the image's visual contents, the amount of processing is reduced, improving the performance.
We distinguish the salient region because of the change in intensity, contrast or pattern of the image. Here, we use static saliency where we carry out a single frame detection of the optic disc and remove the redundant regions of the images, as shown in Figure 2

Data augmentation
We do data augmentation as the number of fundus images in the dataset is not enough for estimating the deep learning model   Table 2 shows the number of images in the augmented dataset.

Prediction using single CNN
We used AlexNet, ResNet-50 and ResNet-152 CNN models for early and advanced glaucoma detection. We found, after extensive experimental results search, these models give the best single CNN early and advanced glaucoma detection performance. Table 3 gives the details of the generated models for the proposed fundus image based detection. Each model uses an 8 layer, a 50 layer, and a 152 layer fine-tuned CNN model, respectively.
The following sections summarise the deep convolutional neural networks used in this research.  [23] proposed AlexNet which ended up winning the ImageNet Large-Scale Visual Recognition Challenge in 2012. There are eight layers in this architecture, the first five layers being convolutional and the other three layers fully connected. The first convolutional layer convolves a 224×224×3 input image using 96 kernels of size 11×11×3. The second convolutional layer convolves the output of the first layer by using 256 layers of size 11×11×3. Likewise, the third convolutional layer convolves the output of the second layer by using 384 kernels of size 3×3×256. The fourth convolutional layer applies these 384 kernels to the output of the third layer. The fifth convolutional layer convolves the output of the fourth layer by using 256 kernels of size 3×3×192. At last, when we get the fully convolutional layers, each fully connected layer has 4096 neurons.

ResNet architecture
He et al. [24] proposed the residual network (ResNet) in 2015. Similar to AlexNet, this architecture won the ImageNet Large-Scale Visual Recognition Challenge the same year. Two of the deep learning models used in this work are finetuned 50 layer and 152 layer ResNet (ResNet-50 and ResNet-152) network architectures. These architectures have 50 and 152 convolutional layers, respectively.

Deep learning ensembles
We base the proposed method on fusing three saliency based CNN models in parallel. First, we represent fundus images as saliency maps. Then, we use the three CNN models to recognise glaucomatous fundus images. At last, we fuse the predictions of CNN models using a probabilistic model to detect glaucoma [27].
The following sections overview the four methods used to fuse the CNN model predictions.

Sum of the probabilities (SP)
A single CNN model provides probability values for healthy and glaucomatous fundus images x. Decision fusion allows combining these output probabilities. We denote these probabilities by p i where i = 1, … , n and n is the number of output probabilities of each single model. Here, n is equal to 2, as there two classes: healthy and glaucoma. Note that p i ∈ [0, 1] for i = 1, … , n and ∑ n i=1 p i = 1. The ensemble of CNNs likewise provides output probabilities, denoted by p i j where i = 1, … , n and j = 1, … , m. The value of m is the number of CNN models. In this study, we use three CNN models and the value of m is equal to 3.
We fuse the generated probabilities of CNN models p i j by the sum of probabilities (SP) method to give p i as where p i j is the ith class probability of model CNN j (x). Notice that we normalize p i so that it is between 0 and 1 and its sum is equal to 1 for i = 1, … , n.

Product of the probabilities (PP)
Besides, we can derive a generated probability combination of CNN models using the product of each output as where p i j is the ith class probability of model CNN j (x). Once again, the normalisation of each p i is there to make sure its value is between 0 and 1 and it sums to 1 for i = 1, … , n.

Majority voting
Similarly, we use majority voting (MV) to determine the image class based on the decision of the majority of the models. Here, p i is given by: where we normalise Equation (3) using m and

Sum of the maximal probabilities
Finally, we use sum of the maximal probabilities (SMP) of each model to determine the image class. Here, p i is determined using: where Again, we normalise p i so its value is between 0 and 1.

IMPLEMENTATION
We carry out the proposed methods on a desktop computer with 25 GB of memory and Intel Core i7-4790 3.6 Hz CPU. We do the data augmentation using C++ programming language containing OpenCV library. We train the AlexNet, ResNet-50 and ResNet-152 architectures using a NVIDIA GeForce GTX 1080Ti GPU running a Caffe deep learning framework. The model training took around 20 min.

PERFORMANCE EVALUATION
In this section, we will report, analyse and discuss the results of the different experiments we have conducted. There are three sets of experiments. The first set uses a single CNN, the second single CNN along with graph-based visual saliency and the third CNN ensembles and graph-based visual saliency.

Saliency and single CNN
We next conduct experiments by adding graph-based visual saliency to single CNN experiments of Section 4.1.1. Table 5 shows the results, with the AUC and accuracy now equal or higher than those of Table 4. The model including ResNet-50 architecture gives the highest AUC and that including AlexNet architecture the highest accuracy, as again shown in bold. The relatively better performances of ResNet-50 and AlexNet architectures can also be seen in ROC curves of Figure 3b. We further observe from Figure 3a,b that when we use a single CNN, AlexNet and ResNet-50 architectures are able to distinguish glaucomatous fundus images from healthy ones better than ResNet-152 architecture in the presence or absence of saliency.
ResNet-50, with graph-based saliency added, is overall the best model of these experiments (see Figure 3a,b). Clearly, the use of graph-based saliency improved the results of Table 4. Figure 3a,b also shows that when we include saliency to the glaucoma detection, AlexNet and ResNet-50 architectures can distinguish glaucomatous fundus images from the healthy ones better. This is true for ResNet-152 architecture as well.

Saliency and ensemble of CNNs
The third set of experiments aim to improve the results reported in Tables 4 and 5. We now do graph-based visual saliency and then ensemble the single CNN architectures of Section 4.1.1 using the methods of Section 2.5. First, we combine the single CNN architectures two at a time. Then, we combine all three architectures together.
The results in bold of Table 6 show combining AlexNet and ResNet-50 architectures with the SMP method results in the highest AUC. They likewise show that joining the same architectures with the MV method produces the highest accuracy. Figure 3c confirms these results.
The results of the same table indicate combining two and three CNNs with the SMP method gives the best overall results in terms of both AUC and accuracy (also shown in Figure 3c,d). They again show ResNet-152 architecture addition to the ensemble did not necessarily improve the AUC and

Discussions
The work in [5] uses a single CNN model and maps the same dataset of fundus images to healthy or glaucomatous classes. As duplicated in Table 7, the performance results of this mapping are inferior to graph-based saliency and ensemble of CNN architectures proposed here. Specifically, using graph-based visual saliency and fusing the architectures two (AlexNet+ResNet-50) and three (AlexNet+ResNet-50+ResNet-152) at a time with the SMP method result in higher AUC and accuracy than those reported in [5].
To the best of our knowledge, [30] is the only other work at this point that used Ahn et al. [5]'s dataset for research. However, that work involves transfer learning where the authors use different training and testing datasets for glaucoma detection. Hence, we next compare the work here with other glaucoma research where the training and testing datasets used are the same. Even though researchers used different methods for glaucoma detection (e.g. [31][32][33]), here we focus on studies that use CNN architectures. Table 8 is a performance comparison of the results of such research with the results reported here. The comparison is in terms of number of training and testing images, AUC, accu-racy, sensitivity and specificity values. Except two, every author has reported only AUC performance values. We see from this table that the AUC performance values reported by the authors, except two, are inferior to the results reported in this work. Those two authors, Shibata et al. [20] and Li et al. [9], reported better AUC values but the number of training images they used are about 3 times and 30 times, respectively, more than what we have used here. A higher number of training images has been to their advantage, improving the results.
We can compare accuracy, sensitivity and specificity performance values of this work to only those of Raghavendra et al. [29] (note that Christopher et al. [28] only reported sensitivity and specificity values and they are less than our models' values here). This comparison shows three performance values of this work are worse than that reported by Raghavendra et al. [29]. With around the same number of training and testing images, we believe the reason behind the performance variation is something else. We believe the dataset of [5] we used is more challenging than Raghavendra et al. [29]'s dataset because of its three categories (normal, early glaucoma, advanced glaucoma) as compared to two categories (normal, glaucoma) of the dataset of [29]. It is easier for our model to not be able to distinguish between fundus images with no glaucoma and early glaucoma. Hence, our model has produced results that are worse than those reported by [29].
Overall, Table 8 proves that the best performance results reported here are as good as, or better than, the recent studies that used CNNs for glaucoma detection on fundus images.

CONCLUSIONS
This paper proposes a glaucoma detection model on fundus images that use graph-based saliency and ensembles of convolutional neural networks. Using graph-based saliency, we first detect the optic disc on the fundus images. We then feed this to an ensemble of three powerful CNN architectures whose output we join with four different methods to recognise glaucoma.
The results show that our model outperforms a similar work in the literature that uses the same dataset. The results here further show that they are as good as, or better than, those reported by recent research on glaucoma detection.