Selective focus saliency model driven by object class-awareness

Current many salient object detection (SOD) models only focus on highlighting visual conspicuous region but fail to make saliency detection for speciﬁc targets. In this paper, a selective focus saliency model driven by object class-awareness (SF-OCA) to run saliency detection is proposed. The framework consists of a visual saliency detection ﬂow, a segmentation-classiﬁcation ﬂow, and a class-awareness selection module. It combines bottom-up visual perception with a top-down task-driven manner, which is capable of detecting speciﬁc category salient targets and eliminating the interference from other saliency areas, providing a new idea for saliency detection. Experimental results show that the method achieves comparable performance with state-of-the-art models on four public saliency datasets. In addition, a new dataset was also built to test the proposed framework for the selective focus saliency detection. Compared with other SOD methods, the method not only highlights visual saliency regions but can choose more important or more noteworthy targets in a class-awareness manner. The method also shows better robustness under a variety of conditions including multi-targets, small targets and complex background.


INTRODUCTION
How to pick out the interested regions in a complex visual scene and ignore uninterested areas? Given multiple foreground objects, we human often choose to focus on the interested targets and overlook uninterested ones based on our knowledge or motivation. Figure 1 shows an example that simultaneously contains two foreground objects zebra and elephant. On account of different motivations like 'We need to find out the interested foreground salient targets (e.g. zebra or elephant in Figure 1) in the image!', the detection model should confine the saliency areas on the interested target (e.g. zebra) and ignore unrelated one (e.g. elephant). In this paper we propose to add subjective category selection capabilities to a saliency detection framework from the perspective of visual saliency perception. We aim not only to perceive a saliency area from a visual point of view, but also to add a subjective willingness to make a choice among multiple salient objects. Salient object detection aims to imitate human visual system (HVS) to localise and segment the most conspicuous regions, acting as an important pre-processing step in computer vision field such as visual tracking [1,2], video compression [3,4] and This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology image segmentation [5,6]. Generally, SOD methods can be divided into two classes: methods based on bottom-features and those based on deep-features. Due to the rapid development of convolutional neural network (CNN), the methods based on CNNs like [7][8][9][10][11][12][13] already made remarkable advancement on visual saliency detection. However, there are still challenges like: (1) unable to randomly to choose interested categories salient targets; (2) most saliency detection methods suffer from heterogeneous salient target interior and boundary detail loss. To address these issues, in this paper we present a framework named SF-OCA that unites bottom-up visual saliency with a top-down task-driven manner to run selective focus saliency detection. Figure 2 shows comparison between visual saliency FIGURE 2 Comparison of visual saliency detection only and class-aware saliency detection detection only and our selective focus saliency detection. Figure 1 illustrates that comparing with top-down attention method [14], our method not only locate but depict the targets with better homogeneous target interior and clearer border.
Recent models like [15,16] aim to improve the saliency segmentation performance by integrating multi-scale features. Motivated by these observations, we propose a selective focus saliency model driven by object class-awareness, which simultaneously leverages two tasks, including visual saliency detection (VSNet) as well as object segmentation and classification (SCNet). Additionally, we also design a class-awareness selection module (CASM) that utilises the fusion of multiple features to output class-aware saliency map. Benefiting by advancement of deep CNNs like VGG [17] and ResNet [18], extracting highlevel features to train a valid CNN based saliency detection model is popular and practical. The proposed VSNet, which follows a simplest encoder-decoder manner, generates visual saliency map by extracting conv features and upsamping the feature map with the same size with the input. The proposed SCNet shares mutual conv layers with the VSNet but run object segmentation and classification. Furthermore, the designed CASM unites the VSNet and SCNet to the final selective focus saliency detection. Under circumstance of multiple salient objects coexisting, the proposed method is able to focus on the interested salient targets and eliminating unrelated ones.
Our main contributions are summarised as follows: • We propose a selective focus saliency model, which combines bottom-up visual perception with a top-down taskdriven manner to selectively highlight specific saliency areas in accordance with detection task. The proposed method utilises multiple cues including visual saliency, region segmentation and classification, which jumps out of the limitation that using visual information only. • We design a universal saliency detection framework, of which the choice of CNN baseline is not crucified. The layout of the proposed framework comprehensively takes account of saliency detection, task-driven and boundary maintenance, which is able to not only highlight class-awareness salient target but also retain homogeneous interior and distinct boundary. • The proposed method executes multiple task simultaneously.
Saliency detection and a segmentation module are already designed together in the proposed framework, so that our method can run selective focus saliency detection with no extra post-processing.
Extensive experiments on four universal saliency datasets, including ECSSD [19], DUTS-TE [20], DUT-OMRON [21], and HKU-IS [22], validate the effectiveness of the proposed framework. To run selective focus detection, we built an image dataset named MCSD to test our SF-OCA. The proposed dataset contains saliency binary masks, object segmentation labels and corresponding category labels simultaneously, and the details are introduced in Section 4.3.

RELATED WORK
Due to the strong generalisation capacity, CNN based approaches are popular in achieving high performance saliency detection. However, most methods focus on pure visual feature [10,22], leading to ambiguous boundary and inhomogeneous salient object interior. On the other hand, exploring which target is more significative with respect to the task is a meaningful and unexploited in the saliency detection research. We introduce the related work in the following two topics.

Salient object detection
In former years, traditional SOD methods like [23,24] utilise image low-level features such as color, intensity and shape to detect saliency area, whereas these methods generally unable to adapt to complex scenes. With the rapid development of deep learning techniques, SOD models based on deep neural networks have substantially improved detection performance. For instance, SOD methods built on CNNs have been explored in [10,15,16,[25][26][27][28][29][30][31]. These methods developed various networks using CNN as baseline and achieved satisfied advancement. According to the architectures, these methods are designed with a single-stream network [8,[32][33][34][35][36] or a multi-stream network [37][38][39][40]. The former is a standard architecture consisting of a sequential cascade of convolution layers, pooling layers and non-linear activation operations. While the latter has multiple network streams for explicitly learning multiscale saliency features from inputs of varied resolutions or with different structures. Some multi-stream networks handle different tasks at separate pathways. The outputs from different streams are combined to form final prediction [41]. In this paper, we propose to run visual saliency detection using a multi-stream architecture. However, many CNN based methods like [22,25,42,43] are generally easy to be influenced by background and lose target boundary details. To address these problems, Li, et al [29] proposed to combine saliency detection and semantic segmentation by sharing common convolutional layers, meanwhile the segmentation results are used to enhance the saliency map. For achieving accurate saliency region boundary, Qin, et al [27] added salient edge information to saliency network to refine visual saliency map. While Feng, et al [15] used multi-scale convolutional features including boundary-aware information and [44] fused multiple scale outputs from the network to generate accurate boundary. Wu, et al. [16] designed an intertwined multi-supervision form to merge saliency detection, edge Overview of the proposed SF-OCA. The framework comprises VSNet, SCNet and CASM. Of which the VSNet follows an encoder-decoder manner, and shares mutual conv layers, that is, Res-1 to Res-5, with SCNet. VSNet and SCNet output visual saliency map, object mask and corresponding category, respectively. And the CASM realises selective saliency detection task in a Concatenation (C) and Refinement (R) manner detection and foreground contour detection together. Moreover, the atrous spatial pyramid pooling module (ASPP) [45] and the pyramid pooling module (PPM) [46] are used to extract multi-scale context-aware features and enhance the single-layer representation. And recently, Pang, et al [47] aimed to highlight the fore-/back-ground difference and preserve the intra-class consistency to obtain more efficient multi-scale features. Although these up-to-date approaches are achieving remarkable saliency detection performance, yet they can not run saliency detection for specific target category in accordance with the detection task.

Task-driven saliency detection
Most SOD methods are object-level methods, that is, designed to detect pixels that belong to the salient objects without being aware of the individual instances [41]. Based on our motivation, we aim to realise saliency detection with being aware of the individual instances. Recently task-driven saliency detection researches reveal human subjective focus area, such as eye fixation models [48][49][50], which aim to predict the locations of human eye-fixations. Cao, et al [51] presented a feedback CNN to uncover the circulate mechanism of deep neural networks and to capture visual attention on salient objects. From another point of view, distributing finite attention resources to interest regions is also a way to run task-driven saliency detection. Zhang, et al [14] proposed excitation backprop scheme to pass along top-down signals downwards in the network hierarchy. By weighing connection between high-level and low-level neurons, neural attention in CNNs can be focused on spe-cific targets. While Fernando, et al [52] proposed a task-driven visual saliency method by learning contextual semantic information and relationship among different tasks together in a generative adversarial manner. However, the methods above can localise specific salient target but generally with ambiguous target boundary. Some recent works start to rethink the relation among multiple salient objects. Wang, et al [49] infers object saliency using the fixation prior, which imitates human visual attention mechanisms and allows the suggested model to explicitly segment out the most visually important objects in an interpretable manner. Yo Umeki, et al [53] suppress inaccurate saliencies and estimate importance degree for multi-object images.
From above mentioned approaches, we proposed a method that simultaneously localise class-awareness salient target meanwhile preserve homogeneous saliency area as well as clear target boundary, as described in Section 3.

Architecture overiew
In this paper, we propose a framework named SF-OCA. As seen in Figure 3, the SF-OCA mainly consists of three modules, that is, VSNet, SCNet and CASM. The proposed framework is capable of detecting class-aware salient targets with homogeneous interior and distinct boundary. The following subsections start from the visual saliency detection network in Section 3.2, then the segmentation and classification network in Section 3.3, and the class-awareness selection module in Section 3.4.

Visual saliency detection network
The essential concept of the proposed SF-OCA is to overall utilise visual saliency detection, object segmentation and classification to selective focus saliency detection. To underline the significance of the object segmentation and classification module on promoting visual saliency performance, we design the VSNet following a simplest encoder-decoder form. Each encoder in the encoder part performs convolution with a filter bank to produce a set of feature maps, which are then batch normalised and then an ReLU (max(0, x)) is applied. Following that, maxpooling with a 2×2 window and stride 2 is performed and the resulting output is sub-sampled by a factor of 2. In the encoder part, we do not pay much attention to boundary information, which we will resolve later in our SCNet. While each decoder in the decoder network upsamples its input feature map using the memorised max-pooling indices from the corresponding encoder feature map. It goes without saying that a more complicated visual saliency detection network could improve the performance, however, it is not a major concern in this paper. The purpose of the VSNet is to locate the visual saliency area, which is donated as S, and lay the basis for the following segmentation and optimisation. The encoder part can be various feedforward convolutional networks like ResNet [18] based or VGG [17] based saliency detection networks. Since the stateof-the-art performance, we take ResNet-50 [18] as an example to extract deep saliency features. In detail, the encoder part includes five convolution blocks as being introduced in [18] and [54], which are Res-1 to Res-5. The pooling layers are placed for parameters reduction. Correspondingly, in the the decoder part we apply upsampling with stride 2 and deconv layers to output visual saliency map, which has the same size with the input. And we use ReLU in the hidden layers. For training, we employ ImageNet pretrained model and use a saliency dataset to train VSNet The loss for VSNet is the binary cross entropy [55]: where G (r, c ) ∈ {0, 1} represents the ground truth label of the pixel (r, c ) and S (r, c ) is the predicted probability of being salient object.

Segmentation and classification network
The VSNet design is able to output all visual saliency targets but cannot achieve class-aware saliency detection and suffers from boundary challenge. In this case, we propose the SCNet that shares mutual conv layers with the VSNet to help gaining saliency region with distinct boundary and homogeneous interior. The SCNet consists of two branches, which are capable of simultaneously running object segmentation and classification. For statement convenience, we donate target segmen- Taking a 224×224×3 input as an example, we upsamle the RPN's output by 2×2 with stride 2 to enlarge the scale of the feature map tation result and category label as M and L, respectively. In addition, a Canny [56] edge detector is introduced to extract the contour from object segmentation result, and the contour output is donated as C. In detail, after Res-5,the feature maps are feeded into region proposal network (RPN) [57]. RPN is a way to generate candidate region of interest (RoI). We donate P for original proposal,Ĝ for predicted ground truth and G for labeled round truth. For each proposal, we use RoIAlign [54] to extract a small feature map and align the extraction features with the input. RoIAlign uses bilinear interpolation to extract feature map from each RoI, and the feature map has floating point form coordinates after pooling. Next we present a two-branch design to obtain the object segmentation results and category labels, respectively. On the one hand, as seen in Figure 4, we operate one upsamling on RPN's output by 2×2 with stride 2 to enlarge the scale of the feature map, which help promoting segmentation resolution. After upsampling, the map was imported into a fully convolutional network to gain segmentation map, where the edge extractor was employed to extract foreground contour. On the other hand, we employ fully connected layers on RPN's output and then use softmax function for classification to acquire the object category labels. Notice that ReLU is used in the hidden layers. In our implementation, for each testing image, SCNet generates N proposals and runs box prediction on each proposal but preserve the highest scoring N 0 proposals, which are used to run object segmentation. Particularly, for each RoI, the segmentation flow predicts K masks (where K represents object class number), we follow the instructions in [54] and only preserve the k-th mask for class k that predicted by classification branch.
For training the SCNet, we follow the suggestions in [54]. The loss for SCNet includes three parts, which are   for bounding box,   for classification and   for mask, respectively. Specifically, due to the need of generating mask for each object,   adopts an average binary cross-entropy loss. For an input image, assuming it contains K categories, the mask branch generate K h × h binary masks. For an RoI, assuming its ground truth is k-th, only the k-th mask contributes to its   .   is a cross entropy loss and   is a regression loss following the suggestion in [58].
where ground truth label p * i is 1 if the anchor is positive otherwise 0. p i is the predicted value. t i is a coordinate vector of the predicted bounding box, and t * i is that of the ground truth box

Class-awareness selection module
As introduced in Sections 3.2 and 3.3, we gain visual saliency map S, object masks M i and contours C i , as well as object category labels L i , i ∈ {1,2, …, K}. As shown in Figure 5, we design a Class-Awareness Selection Module to generate classaware saliency map. Firstly, on the basis of detection task, we use L as instruction to localise the corresponding targets. Particulary, we take target category label L T as the criteria to pick out the target M T . Assuming an object instance belongs to an RoI, L T is defined as:

None otherwise
Thus we obtain target category label for the class-aware saliency map. We use L T to sort out the corresponding M T , which ought to be a binary mask map. And C T is gained by using Canny [56] detector on M T . Moreover, we apply C T on S to pick out the task-relevant saliency region, which is donated as S T .
where A C T represents the non-zero area in M T limited by C T . And ∩ indicates region intersection. In other words, we only preserve saliency area in S that also belongs to mask region in M T limited by C T . Pixel values in S T are the same to the values those defined in S. Ultimately, M T and S T are merged by weight factors. Comparing with S, S final achieves more homogeneous interior and distinct boundary.
Here, it is significant to address the discrepancy between object segmentation and saliency detection. The former aims to classify every single pixel to compose multiple sets, and each set represents an object. Nevertheless, the latter outputs saliency map, whose pixel value varies from 0 to 255 (e.g. 8-bit map) and higher value indicates higher saliency. In the proposed method, we use object segmentation work to enhance visual saliency detection so that the final detection is more approximate to the ground truth.

Training strategy
As introduced in Section 3, the proposed SF-OCA runs classaware saliency detection. We train our framework as follows.
Step 1: For visual saliency detection, we import a saliency detection dataset to train the VSNet. The dataset should contain images and corresponding binary saliency masks.
Step 2: For class-aware SOD implementation, we proceed to train the SCNet with the supervision of   ,   and   , as suggested in Section 3.3. We import MS-COCO [59] to train the SCNet.
Implementation details are introduced in Section 4.2.

Parameter settings
The VSNet and SCNet share common CNN baseline and we use ResNet-50 [18] as the backbone. 1

Running environment
We implement our method using Python with the PyTorch toolbox. The train and test environment is a 3.2GHz CPU, 32GB RAM and GTX 1080Ti GPU.

Datasets and evaluation metrics
We carry out our experiments on four public saliency detection datasets, which are ECSSD [19], DUTS-TE [20], DUT-OMRON [21] and HKU-IS [22]. Of which, ECSSD contains  [59]. Considering that each image contains one or multiple saliency instances, we re-label each salient object, and label the object segmentation as well as category ground truth by Wada's toolbox [60]. Thus the MCSD contains three kinds of annotations, that is, 293 saliency binary masks, object segmentation masks and corresponding category labels. The details are introduced in Table 1.
For the purpose of visual saliency evaluation, we use four effective metrics for testing, including the precision-recall (PR) curve, F-measure, mean absolute error (MAE) and area under curve (AUC).

PR curve
PR curve is a universally used metric to evaluate saliency performance. It binarises saliency map with threshold from 0 to 255 (e.g. 8-bit map) and compares the binary maps with ground truth.

F-measure
It is an overall measurement computed from precision and recall by weighted combination, as follows: where we set 2 = 0.3 to weigh the precision more over the recall.

MAE
It calculates mean absolute error between prediction and ground truth pixel by pixel, as follows: where M and N indicate the size of the visual saliency map S. S (x, y) and G (x, y) represent prediction values and ground truth values of coordinate (x, y), respectively.

AUC
It represents the area of ROC curve, sliding from 0.5 to 1. And higher value means better accuracy. Considering the proposed framework simultaneously proceeds visual saliency detection and class-awareness segmentation, we particularity introduce a new metric, that is, FAP, to make evaluation on both saliency accuracy and segmentation precision. The FAP is calculated as follows: where F indicates F-measure for visual saliency detection and mask AP means averaged over IoU thresholds for segmentation. 1
In the proposed framework, we design the VSNet for visual saliency detection by a simplest encoder-decoder architecture without additional tricks, which is marked as Ours-baseline. We also replace the VSNet with two up-to-date visual saliency detection approaches [16,30], which are donated as Ours-MLM and Ours-PFA . Figure 6 illustrates PR curves and F-measure scores, our method performs favourably against other methods on all four datasets. Table 2 shows various metrics across four datasets. Our method is very comparable with the state-ofthe-arts over four metrics. For instance, Ours-baseline achieves 0.8801 F-measure and 0.0488 MAE on ECSSD [19] dataset, higher than [7,10]. While Ours-MLM and Ours-PFA achieve state-of-the-art performance on DUT-OMRON [21] and the proposed MCSD.
However, the most significant advantages of the proposed SF-OCA can be introduced as follows. On the one hand, our method can visual saliency detection. On the other hand, with respect to multiple classes coexisting, our method could preserve task-relevant salient targets meanwhile eliminate insignificant background and irrelevant targets in accordance with detection task. In Figure 7, we present several visual  2 Performance of the proposed SF-OCA and the state-of-the-art methods on four public datasets and the proposed MCSD. We take F-measure scores, MAE and AUC as evaluation metrics. The best two scores are marked in bold and italic, respectively. Ours-baseline: Using the VSNet for visual saliency detection. Ours-MLM: Replacing VSNet with MLMSNet [16]. Ours-PFA: Replacing VSNet with PFA [30] ECSSD illustrations by comparing our baseline framework with other methods. For the images with simple background and legible texture, our method has little gap with the state-of-thearts. Nevertheless, under circumstances like eye-catching background and low contrast, possibly many the state-of-the-art visual saliency detection methods failed to locate and detect foreground salient objects. However, our method can effectively detect foreground saliency areas. We take the last three rows in Figure 7 as examples. The state-of-the-art models are easily effected by the striking background and give improper predictions (e.g. the background tower on the last row), but in contrast our method successfully outputs the foreground salient

Methods
The class-aware saliency detection results of the proposed SF-OCA. The output saliency maps indicate that our method is able to generate class-aware saliency map with homogeneous interior and distinct boundary. Zoom in for better review targets, which demonstrates the robustness of our method. Except naive visual saliency detection, our method could run class-aware saliency detection. Additionally, for the purpose of detecting saliency map for specific targets, we test our model on the proposed MCSD. Comparing with popular saliency datasets, the MCSD consists of 200 samples containing multiple labels, including saliency binary masks, object segmentation annotations and category labels, and the details are introduced in Section 4.3 and Table 1. The test results on the proposed dataset are shown in Figure 8. As can be seen the output saliency maps are not only visual saliency but also class-awareness, that is, for different objects belonging to variance categories, the proposed method is able to generate saliency map for certain categories with homogeneous interior and distinct boundary. For example, for the first image on the first row in Figure 8, there are two salient objects including person and sports ball. Our model can detect and output class-aware saliency map for the corresponding target, respectively.

Baselines
The VSNet in the proposed SF-OCA framework plays the role of providing visual saliency areas. And different baselines of the VSNet could achieve different performance on visual saliency detection and the final class-awareness results. The test dataset is the proposed MCSD, which support visual saliency  Figure 9 illustrates PR-curves and correlative metrics including Precision and Recall. However, the curves in Figure 9(b) drops fast against high recall. To our best knowledge, this is because the CASM eliminates the irrelevant targets and areas, which allows the proposed framework focuses on taskrelevant saliency target with homogeneous interior. As seen in Figure 9(b), employing various baselines in VSNet has little effect on promoting performance. To our best guess, we evaluate the impact of 1 and 2 on ECSSD [19] and the  proposed MCSD. As seen in Table 4, diverse weights of 2 make different attributions to the final performance. Comparing with employing the VSNet only, that is, 2 = 0, the testing indexes indicate that the proposed SCNet and CASM could effectively improve the visual saliency performance. Nevertheless, simply increasing the weight of 2 is not always benefiting the final result, which shows that the segmentation work could enhance the saliency detection performance but there still is limitation. For example, on ECSSD dataset, when 2 = 0.5, the proposed method achieves the best F and MAE. While on the proposed MCSD dataset, F score is the highest when 2 = 0.7. However, the best MAE score occurs when 2 = 1.0. So for different datasets, the best results may be achieved by different parameter settings. In addition, the runtime analysis of each part of the proposed framework is exhibited in Table 5. Please note that the testing batch size is set as 1.

Segmentation performance
The proposed SCNet runs object segmentation and classification. The segmentation result contributes to the final The results are shown in Table 6. The SCNet achieves 28.7 mask AP and 39.0 mask AP on two testing datasets, respectively. As introduced in Section 4.3, we use FAP to evaluate the selective focus saliency detection result, which simultaneously takes into account of visual saliency and class-awareness segmentation. The results are shown in Table 7. Comparing with our baseline model, that is, using VSNet for coarse visual saliency detection, more robust visual saliency detection models like MLMSNet [16] and PFA [30] lead to better performance due to the high quality visual saliency output.

Tough circumstances
The proposed SF-OCA utilises multi-cue to the final robust prediction. Figure 10 displays some test samples in several difficult circumstances. Comparing with state-of-the-art saliency detection models [10, 16 30], the row (a) demonstrates that our method can effectively eliminate interference of complex background and generate preferable saliency map. The rows (b) and (c) illustrate that our model can be applied to small objects and low contrast scenes, respectively. While the row (d) indicates testing results on multiple objects situation, and for different target categories (e.g. giraffe and elephant on the last row), our method successfully detect both kinds of salient targets. Moreover, under the circumstance of multiple objects adhesion, our method can segment and classify each salient target individually, offering the choice of outputting class-aware saliency map with distinct boundary. Figure 11 gives some failure examples of the proposed method, we see that under circumstances like small, multiple targets coexist in large visual field, or small target with strong interference and suchlike challenging situations, the proposed method may not be very qualified. For example, These issues may will be further explored in the future work.

CONCLUSION
In this paper, we propose a novel framework named SF-OCA, aiming to achieve selective focus saliency detection. Our framework balances salient object detection, segmentation and classification to output saliency target with homogeneous interior and distinct boundary. Experimental results show that our framework achieves comparable performance with the state-ofthe-arts. To sum up, this paper is proposed to overall consider human visual attention and subjective task-driven manner to run class-aware saliency detection driven by object class-awareness, going further than visual saliency detection only. Additionally, the backbone of the proposed framework is not unchangeable, there is still potential to improve the performance by using more effective network architectures.