Multi‐scale object detection by bottom‐up feature pyramid network

The deep neural networks has been developed fast and shown great successes in many significant fields, such as smart surveillance, self-driving and face recognition. The detection of the object with multi-scale and multi-aspect-ratio is still the key problem. In this study, the authors propose a bottom-up feature pyramid network, coordinating with multi-scale feature representation and multi-aspect-ratio anchor generation. Firstly, the multi-scale feature representation is formed by a set of fully convolutional layers which is catenated after the backbone network. Secondly, in order to link the multi-scale feature, the deconvolutional layer is involving after each multi-scale feature map. Thirdly, to tackle the problem of adopting object with different aspect ratios, the anchors on each multi-scale feature map are generated by six shapes. The proposed method is evaluated on PASCAL visual object detection dataset and reach the accuracy of 80.5%.


Introduction
In the last 10 years, due to the deep neural networks, the object detection is developed by three stages. In the first stage, the handcrafted feature is taken into consideration and the classification method is almost discriminative. Some classical hand-crafted features like histogram of oriented gradient [1] and scale-invariant feature transform were designed for representing the characteristics of objects. Then, the discriminative classifiers, like support vector machine (SVM) [2], Adaboost and Random Forest, were proposed for classification. In this stage, the computer vision tasks mainly contain hand-craft feature representation and classification. For object detection, the object location is proposed by sliding window. Felzenswalb et al. [3] proposed a deformable part model (DPM), which takes the object as a spring model and the object is divided into many components. Although DPM is an excellent detector, the computational cost of sliding-window strategy is huge and the detection efficiency is far from real time. In the second stage, the region proposal methods replace the sliding window way for its efficiency, such as Edge Box and Selective Search. In this way, the detector detect the target form the regions of proposal and the computation resources reduce because the candidate regions of an image decrease by almost two order of magnitude. The third stage begins in 2012, when the convolution neural networks based classification method won the first place in the competition of Imagenet Large Scale Visual Recognition Challenge (ILSVRC 2012). In the competition, Krizhevsky propose Alexnet based on convolutional layers, fully connected layers and a simple classifier of Softmax and then, turned the convolutional based methods into the main stream in the field of computer vision. After that, there are two classes' frameworks for object detection: two-stage framework and one-stage framework.
Two-stage object detection framework: Ross proposes a fancy object detection framework based on CNNs, named Region with convolutional neural network feature (R-CNN) and R-CNN become the baseline for detection framework. The detection flow of R-CNN mainly contains three steps: (i) R-CNN get candidate regions by Selective Search on the original image; (ii) the deep feature of candidate regions are generated by feeding forward a CNN (VGG16) after fixing into an unified size; (iii) for each candidate region, there is a set of SVMs for classification; From the comprehensive experiments, the features generated by CNN are more discriminative than the hand-crafted features. It is noteworthy that the Selective Search always takes a number of seconds in an image and the deep feature in different regions is extracted repeatedly and the training procedure is not end-to-end. Then, the acceleration on region proposal named Fast-RCNN and Faster RCNN [4] is proposed.
One stage object detection framework: Joseph proposes You Only Look at Once (YOLO) [5] which takes the detection as a regression task. Then Liu et al. [6] propose Single Shot MultiBox detector (SSD) combine the anchor thought in R-CNN and the regression thought in YOLO and thus get the higher accuracy and faster detection speed. The detection flow of SSD contains four parts (i) use a set of convolutional layers to make a multi-scale feature representation; (ii) generate anchors on multi-scale feature map with different shapes; (iii) classification and bounding box regression for each anchor; (iv) for each detection result, use nonmaximum suppression (NMS) for eliminate redundant result. The SSD also meets the low detection accuracy of small scale object. Then, the Deconvolutional Single Shot MultiBox Detector (DSSD) is proposed to tackle this problem by increasing the feature map size with deconvolution procedure.
With the repaid development of convolutional neural networks, object detection in still image has a tremendous progress mostly due to the emergence of deep neural networks and their region based descendants. However, there are still problems needed to be solved. Firstly, small object detection is still a tough task in computer vision field due to its small scale and the loss of information. Secondly, most detection frameworks take multi-scale feature independently, the relationship between the upper and lower feature maps are not fully used.
In this paper, we propose a bottom-up feature pyramid network based on fully convolutional network, which maintains fast speed detection and high accuracy. In the proposed network, in order to make a relation between multi-scale feature map, we put a deconvolutional layer after each scale feature map and merges lower scale feature map and upper scale feature map after deconvolution procedure. In this way, the lower scale feature map could contains some of context information in the upper feature and then, increase the detection accuracy of small-scale object with merging the upper scale feature map.

Backbone network
The backbone network of the proposed BUFPN is the VGG-16 and it is pre-trained on the ImageNet Classification-Location dataset. The whole pipeline of the VGG-16 is shown in Fig. 2, which contains 13 convolutional layers and three fully connected (FC) layers. Firstly, the deep features are generated from the convolutional layers. Secondly, similar to the Single Shot MultiBox, the three fully convolutional layers are removed.
The details about the backbone are following: (i) Input: images with RGB channels.
(ii) Convolutional layers: There are five sets of convolutional layers, which contain the Conv5, Conv4, Conv3, Conv2 and Conv1. The Conv1 is consisted of two convolutional layers in which the convolutional kernel size is 3 × 3. Similar to the Conv1, the other sets are also consisted of two or three convolutional layers and the kernel size is 3 × 3.
(iii) Rectified linear units (RELUs) are the activation function, which makes the network easy to be trained.

Bottom-up multi-scale feature representation
As shown in Fig. 3, after backbone network, the bottom-up multiscale feature representation is cascaded, which contains a set of fully convolutional layers and pooling layers. Considering the problem of detecting small objects is that small object loss much information and only feed forward convolutional procedure is not enough. Inspired by FTF [7], the context information is useful for detecting small objects, so we add deconvolution process after the multi-scale feature map generation, which is called bottom-up multi-scale feature representation.
Here the deconvolution we use is bi-linear up-sampling, which is used in FCN [8] for up-sample feature map.

Anchor generation and prediction
The anchor in deep neural network has two functions. The first is selecting positive samples. The IOU of the anchor and the ground truth is the key index which decides the selection of the positive samples. In many object detection frameworks, if the IOU is >0.5, this window is a positive sample. The second role is deciding the classification and bounding box results. In the training procedures, the loss of classification and bounding box is computed from the anchors and in the testing procedure, the location and class results of the object are computed after NMS of the bounding boxes. In the proposed BUFPN framework, each feature map corresponds to a certain scale (s n ) where s min and s max are the detected object with minimum and maximum scale, respectively. m refers to the number of feature maps.
There are six shapes of anchors for each pixel on feature maps: anchor_w n = s n a r anchor_h n = s n / a r , a r ∈ 1, 1 2 , 1 3 , 2, 3, s n s n + 1 Here, anchor_w n and anchor_w h are the width and height of this anchor. a r is the aspect ratios of the anchors. In prediction process, each anchor has a classification score and bounding box offsets computed by classification and bounding box regression kernel with size of 3 × 3 (Fig. 4). Take n is the number of classes we want to classify. The output of each anchor is (n + 1) + 4. (n + 1) refers to the number of classes and the background. Then, 4 refers to the bias for this anchor.

Loss function
The design of the detection loss is the key procedure in during the training procedure. and the detection loss involves the information of the object. Take the loss function in faster-RCNN into  consideration, the classification loss L cls and bounding box regression loss L bbox combine the detection loss. The sum of the classification loss and bounding box regression loss form the detection loss. Take f is the feature in a certain anchor with the IOU >0. 5: Here, n refers to the number of the anchors in feature maps and the IOU of anchors and ground truth is >0.5. f refers to the feature in a certain anchor and c is the detection score. Moreover, l and g refer to the bias of the locations and ground truth, respectively. The L cls is formed by multi-class softmax: (see (4)) where where c i label refers to the probability value for certain label of ith anchor. Take c 5 person as an example, c 5 person refers to the prediction score of the person in fifth anchor box. Here, take f ij label = 1, 0 be the index for the ith anchor and jth ground truth of the certain label. Then, 0 is the label of the background, which means if The L bbox is formulated on smooth L 1 loss based on the detection box and ground truth, similar to R-CNN. Smooth L 1 loss has a constraint compared to the L 2 . The L bbox is formulated as The location has four indexes, which is (bx, by, bw, bh). Here, the centre of the location is bx and by and the width and height of the location box is bw and bh. Then, g refers to the ground truth and 'pre' is the prediction location. At last, the parameterisations are employed in the bounding box regression loss: This parameterisation makes the loss is more sensitive to the centre of the location, because the centre of the location is more important than the width and height.

Results
In this section, we first describe the datasets used for the experiment and analyse the character of the dataset. Then, we explain the details in the training and testing procedure. At last, we show the statistic accuracy results on this dataset.

Description of dataset
We evaluate the BUFPN on PASCAL VOC detection benchmark, which contains 20 object categories. We provide our results on  Fig. 5 shows the object sizes in the PASCAL VOC dataset. From the figure, less than half of the objects are >0.33. This is why we set much more small-scale anchors than large-scale anchors. Fig. 6 shows the small object area statistic. The characteristics of the dataset could guide the selection of the anchors. In this dataset, if the size of the smallest anchor is 0.1 which means the input image size is 300, the framework will miss ∼10% ground truth. If the size of the smallest anchor is 0.07 which means the input image size is 512, the framework will miss ∼5% ground truth.

Experiment setup
Following the baseline Single Shot multibox Detector, firstly, the input is resized to a 300 × 300 (512 × 512  We use Conv10_2 (Conv11_2), Conv9_2, Conv8_2, Conv7_2, Conv6_2 and Conv4_3 as the feature pyramid to detect the objects. From the object area characteristics, the minimum scale is 0.1 (0.07) in order to involve more positive samples considering the receptive field.

Conclusion
This letter introduces a multi-scale feature representation method named BUFPN in order to use the feature fully by making a relation between feature maps and merging lower scale feature map and upper scale feature map. In addition, various shapes of anchors are generated on feature maps to solve the problem of object with different scales and aspect ratios. Comparing with the baseline, the detection results have a bonus of 0.9% and 0.8% on PASCALVOC dataset. Moreover, the proposed method only contains Bottom-Up flow and in the future, the Top-Down flow should be considered as well.

Acknowledgments
The work is supported by 111 Project of Chain under Grant B14010 and the Chang Jiang Scholars Programme (Grant no.T2012122).