Fast Genre Classification of Web Images Using Global and Local Features

A number of images are present on the Web and the number is increasing every day. To effectively mine the contents embedded in Web images, it is useful to classify the images into different types so that they can be fed to different procedures for detailed analysis, such as text and non-text image discrimination. We herein propose a hierarchical algorithm for efficiently classifying Web images into four classes, namely, natural scene images, born-digital images, scanned and cameracaptured paper documents, which are the most prevalent image types on the Web. Our algorithm consists of two stages; the first stage extracts global features reflecting the distributions of color, edge and gradient, and uses a support vector machine (SVM) classifier for preliminary classification. Images assigned low confidence by the first-stage classifier is processed by the second stage, which further extracts local texture features represented in the Bag-of-Words framework and uses another SVM classifier for final classification. In addition, we design two fusion strategies to train the second classifier and generate the final prediction label depending on the usage of local features in the second stage. To validate the effectiveness of our proposed method, we also build a database containing more than 55,000 images from various sources. On our test image set, we obtained an overall classification accuracy of 98.4% and the processing speed is over 27FPS on an Intel(R) Xeon(R) CPU (2.90GHz).


Introduction
On the Internet and mobile network, the explosive growth of multimedia data including texts, images and videos brings us rich information and also the difficulty of efficiently mining relevant information. While the texts are explored by most web mining tools, to mine the contexts in images is also important. Particularly, the texts embedded in images provide easy understandable semantics and such images occupy a considerable proportion on web pages. A study [1] showed that 17% of the words visible on the web pages are in image form and a large proportion (76%) of text information embedded in images cannot be found anywhere in the web pages. The texts in images, however, are hard to extract by computers, though easily read by humans. For text detection and reading methods to process efficiently in the Internet environment, we need to quickly classify the images into different types of sources such that each type of images undertakes detailed analysis by a special procedure. Also, for accurate processing, different types of text images (document images) such as natural scene text images, born-digital images (BDIs), scanned and camera-captured paper documents (CPDs) are better analysed in different procedures.
In this paper, we propose a fast classification algorithm for classifying web images into four major types, namely natural scene images (NSIs), BDIs, scanned paper documents (SPDs) and CPDs. NSIs (photographs) are captured by surveillance cameras or mobile cameras and are most popular on the web. Whether they contain texts or need not to be judged using the more detailed procedure but the fast identification of this image type is helpful for the overall process of web image analysis. The other three types: BDIs, SPDs and CPDs usually contain rich texts. They also show different characteristics of image quality, e.g. BDIs usually have large areas of constant colour, and SPDs are more uniform in intensity and less distortion than CPDs. For a good tradeoff between classification accuracy and processing speed, our algorithm consists of two stages. The first stage uses global features capturing the difference of appearance between four types of images for preliminary classification with a support vector machine (SVM) classifier. Images assigned low confidence by the first-stage classifier are then processed by the second stage, which extracts local texture features encoded in the bag-of-words (BoW) framework and uses another SVM classifier for final classification. Compared to global features, local texture features are able to represent different patterns of colour transitions and properties of edges between four types of images in a more detailed way and yield higher classification accuracy. To validate the effectiveness of our proposed method, we built a large image database by collecting images from various sources such as web crawling, the camera capturing and other standard public databases. On our test image set, we obtained an overall classification accuracy of 98.4% and the processing speed is over 27 fps on a central processing unit (CPU) (2.90 GHz).
The rest of this paper is organised as follows. Section 2 briefly reviews related works; Section 3 describes the proposed method; Section 4 introduces the image database; Section 5 presents experimental results; and Section 6 makes a conclusion.

Related work
A large variety of feature extraction and classification methods have been proposed in the context of image classification and content-based image retrieval [2] but these existing methods are not directly applicable for our purpose of image genre classification. In the following, we outline some works related to our purpose.
Hammoud et al. [3] distinguished art paintings from scene photographs using colour texture signatures derived from the human visual system. The receptive field profiles and composite visual features they presented are helpful to solve our problem. Motivated by the physical image generation process, Ng et al. [4]p r o p o s e da novel geometry-based model for classifying photographic images and computer graphics in the context of image forgery detection. They exploited global geometry information at different scales as well as local patch statistics to discover the distinctive physical characteristics of images such as the gamma correction of photographs and the sharp structures in graphics. Despite that the method was shown effective, the feature extraction there is very time-consuming, e.g. only global fractal geometry feature extraction takes 128.1 s on a 1280 × 1024 image. Athitsos et al. [5]p r e s e n t e d a method for separating photographs and graphics on web pages. The graphics they considered such as corporate logos, maps and navigation buttons, are very simple even compared with our BDIs which contain both texts and graphics. Lienhart and Hartmann [6] also tried to solve this problem, and the metrics they designed depend mostly on statistics of global visual cues such as colour and edge orientation histogram. Lee et al. [7]t r i e dt o categorise images into art, photograph and cartoon using a neural network model. Five standard MPEG-7 visual descriptors [8]i n their work were employed for extracting features such as Colour Layout, Colour Structure, Homogeneous Texture, Region Shape and Edge Histogram, which are not only redundant but also time-consuming. Pourashraf et al. [9] adopted an ensemble model for classifying images embedded in commercial real estate flyers into one of five genres: aerial photograph, map, inside building, outside the building and schematic drawing. However, the model was only evaluated with a small database and the processing speed was not reported.
In recent years, deep neural networks, especially the convolutional neural network (CNN) [10][11][12][13][14][15] has achieved a great success in image recognition tasks including image categorisation, object detection, scene text detection and recognition [16]. The superiority of CNN is partly attributed to its ability of automatic feature exaction by learning from the large training dataset. However, the CNN suffers from the heavy computation in both training and testing, and so, is usually implemented using graphics PU (GPU) for parallel computation. This hinders its application in processing huge amount of images on the web.
Our proposed method for fast genre classification of images uses both global visual features and local texture features which consume low computation complexity and is of moderate dimensionality. The local texture features, extracted from different types of image patches and represented in the BoW framework [17,18] are shown to be effective in differentiating photographs versus non-photograph and scanned versus CPD. Fig. 1 shows a schematic diagram of our hierarchical classification system. The first stage extracts global features and uses an SVM for preliminary classification. In this stage, images with high confidence (over a threshold T c ) are made a decision of class directly. While the images with lower confidence are fed into the second stage, which extracts local texture features represented in BoW framework and uses another SVM for final classification. In the second stage, different types of texture descriptors are extracted from local patches and each of them is represented into a BoW histogram. In particular, we carefully design four types of local patches such as edge patch, key point patch, smooth region patch and random patch. This design is aimed to balance the computational complexity and classification accuracy for the second classifier, as the extraction of local features is much more computationally demanding than that of global features. Given a set of local features of a certain type, a two-step clustering method is adopted to generate a discriminative codebook, which is used in the following BoW framework. Finally, we concatenate four BoW histograms into the local feature vector. Depending on the usage of local features, two fusion strategies are proposed to train the second classifier and generate final prediction result.

Training the second classifier with global and local features:
The first fusion strategy uses global and local features together to train the second classifier. Considering that global visual features alone are still not very discriminative for those 'difficult' images, which are assigned low confidence by the first classifier, we train the second classifier using both global and local features. Since the local texture features are more suitable for representing image details such as patterns of colour transitions and properties of edges, they can effectively compensate for the deficiency of their global counterparts. In particular, for each image sample, we concatenate its global visual feature calculated previously in the first stage and the new BoW features into a final feature vector and use it to train the second SVM classifier. In testing, the second classifier gives the final classification.

Training the second classifier with local features only:
In the second fusion strategy, we take advantage of ensemble learning. Ensemble methods use multiple learning algorithms (classifiers) to obtain better predictive performance than the constituent classifiers alone. For our problem, we can train our two SVM models in different feature spaces, namely global and local features, respectively, and improve final classification performance by combining the predictions of two classifiers. Herein, we use the global features to train the first classifier, and the local features to train the second classifier. In testing, for those images that cannot be labelled with high confidence by the first classifier, local features are extracted and fed into the second classifier. After that, we fuse the predictions of two classifiers by a weighted combination of posterior probabilities to make the final decision of image class.

Global features
For our first-stage classification, the global features are extracted based on the different appearances between four types of images. Compared to NSIs, BDIs tend to have fewer colours, shaper edges, larger constant colour regions and more highly saturated pixels. As for the other two types, SPDs are clearly more uniform in intensity and less distortion than CPDs. We carefully designed our global features so that the differences mentioned above could be easily captured. Meanwhile, it is also necessary to consider the computational complexity of each type of global feature. Computationally intensive feature extraction methods such as scale invariant feature transform (SIFT) or wavelet transform may be more discriminant and give higher classification performance but they also consume more CPU time and memory. By contrast, our procedures of feature extraction only involve the first-order gradient computation and some basic image processing techniques such as thresholding, colour space conversion [from red, green and blue (RGB) to hue, saturation and value (HSV)] and binary erosion.

Coherence of highly saturated pixels f 1 :
This feature is aimed to measure different patterns of colour transitions from pixel to pixel appearing in four types of images. NSIs often depict objects of the real world and have rarely regions of uniform colour or coherent pixels of highly saturated because of the natural texture of objects, noise and diversity of illumination conditions. On the other hand, BDIs tend to have larger regions of constant colour and more blocks consisting of highly saturated pixels. Let I rgb , I hsv and I s denote a 3-channel RGB image, its HSV version and saturation channel, respectively. A binary image I mask1 is obtained by thresholding I s with a given threshold T s .A morphological erosion operation is then performed on I mask1 with a 3 × 3 square structuring element to generate a new I mask2 . The number of non-zero pixels in I mask1 and I mask2 are calculated and denoted as N 1 and N 2 . Finally, we define f 1 = N 2 /N 1 .T o demonstrate the effectiveness of this measure visually, we randomly selected 3000 images from NSI, BDI and CPD categories in our database: 1000 samples per class, and calculated the normalised histogram of three types of images over f 1 . From Fig. 2a, we can observe that BDIs which have more coherent and highly saturated regions tend to have higher scores than NSIs.

Average contrast of edge pixels f 2 :
The second global feature focuses on the intensity transition between edge pixels in images, which also reflects different patterns between NSIs, BDIs, CPDs and SPDs. For example, edges in NSIs and CPDs are usually generated by occlusion, illumination and changing of surface property, while BDIs tend to have more 'colour edges' [3] resulting from adjacent uniform regions. Accordingly, sharp transitions occur more frequently in BDIs than others. Let I g and M c denote a grey-scale image and its Canny edge [19] map, respectively. We define the max sharpness map M ms where current pixel (x, y) and its neighbour ( In our experiments, D is set to 2. Then, f 2 can be obtained by calculating the average value of M ms with M c as the mask. We also calculated the normalised histogram of three types of images over f 2 . As expected, we can observe that BDIs tend to have sharper edges than others, as shown in Fig. 2b. Given M g , a binary mask I mask3 indicating smooth regions and its eroded version I mask4 is generated with a threshold T g in the same way described in Section 3.2.1. Finally, we define

Coherence of smooth region
and N 4 denote the number of non-zero pixels in I mask3 and I mask4 , respectively, and N p is the total number of image pixels.

Colour histogram f 4 :
This feature is designed based on the assumption that certain colours occur more frequently in a certain type of images. For example, BDIs come from business websites as ad images tend to be filled with highly saturated red or yellow blocks to grab people's attention. By contrast, most SPDs and CPDs are relatively monochrome and their colour boxes mainly consist of colours of papers and notes, which have very limited fashions. The colour histogram should be effective to represent this special characteristic. Instead of directly calculating the histogram in original RGB colour space, we here choose hue channel I h for speed in practise. The dimensionality of the histogram vector in our implementation is 180 and the histograms are normalised to 0-1 range. We also provide a scatter plot in Fig. 3a to visualise the discriminability of this feature. Similarly, 100 images per class are selected randomly and grouped together as a small visualisation dataset. Considering the high dimensionality of this feature, we adopt a dimensionality reduction algorithm, the t-stochastic neighbor embedding (SNE) algorithm [20], to map the feature vectors from 180-dimensional (180D) to 2D. From Fig. 3a, we can see that most of the SPDs and CPDs points are clustered and form two very distinguishable curves on the reduced map. However, it is also worth noting that there are large overlaps between the BDIs and NSIs points, which means that the colour histogram feature alone cannot differentiate between BDIs and NSIs. Fortunately, features f 1 and f 2 calculated above complement very well. For the large overlaps, one reasonable explanation is that the BDIs samples contain certain small NSI patches, which we will discuss in Section 4.

Gradient magnitude histogram f 5 :
The distribution of gradient magnitude values also reflects the style of images. We calculated an equal interval histogram of M g as f 5 . The gradient value, in a range of [0, 510], is quantised into 200 bins. We also show a scatter plot of gradient magnitude histogram by dimensionality reduction in 2D in Fig. 3b.

Local features and BoW coding
Although capturing the different characteristics of appearance in common Web images successfully, global features proposed in the above section are not sufficient to discriminate 'difficult' images. While global features are aimed to quickly classify relatively 'easy' images, local features are aimed to discriminate the difficult images at costs of higher computation. We introduce local texture features based on the observation that different types of images show distinct local texture patterns, e.g. BDIs and SPDs often have large constant regions, and CPDs show different texture patterns from SPDs due to the non-uniform illumination in photographing. In addition, some objects possessing certain typical texture patterns such as sky, trees or walls occur frequently in NSIs. To extract local texture features, we adopt local feature aggregation methods [21,22], which have been widely used for image classification or retrieval in recent years. We exploit four types of local patches and organise their corresponding descriptors in BoW framework [17,18], which represents an image as a histogram of certain key descriptors and has been demonstrated very effective in image categorisation tasks. For computational simplicity and efficiency, the local patch types we adopt in this paper are edge patch, key point patch, smooth region patch and random patch. The details of different types of patches and feature vectors construction are as follows.

Local patches and descriptors:
Four types of local patches are designed in this paper, i.e. edge patch, key point patch, smooth region patch and random patch. The local binary pattern [23] descriptors are used for the first three types of patches, and reduced colour index histogram for the last. The number of each type of patch we sampled from each test image is N lp and all patches have the same size: S lp × S lp .
Edge patch: Inspired by the concept of 'intensity edge' and 'colour edge' [3], we randomly select N lp local patches whose centres are exactly located at Canny edge point and then build an edge patch collection for each image. Combined with the BoW framework, the differences of texture in the vicinity of an edge between four types of images are reflected in the local features.
Key point patch: Key point detectors and descriptors have been widely used in image analysis and categorisation. Considering that certain specific objects occur frequently in particular types of images, extracting key pointers can be useful for image genre classification. We adopt the features from accelerated segment test (FAST) corner detection algorithm [24] to locate key points for speeded processing. Moreover similarly, N lp key points are randomly selected as the centres of corresponding patches.
Smooth region patch: Another distinctive texture comes from smooth regions, e.g. sky, lawn and water surface in NSIs, constant colour regions in BDIs and SPDs. Pixels coming from these regions usually have a low gradient magnitude in images. Therefore, we randomly select patches that have high overlap area with I mask3 . To make sure smooth pixels are able to occupy sufficient areas in the patches, the overlap ratio threshold is set as 0.7.
Random patch: As the name suggests, patches of this type are cropped randomly from the image and mostly play a complementary role to other types of patches. We use the histogram of reduced colour index map of the raw image to describe these types of local regions for speed. Given the original 256 3 colour space, a uniform quantisation is performed and generates a 64-level (4 3 ) one: each axis is divided into four equal-sized segments. We then convert the quantised 3-channel image to a 1-channel colour index map by replacing the original triple value (r, g, b) with r × 4 2 + g × 4 1 + b × 4 0 pixel by pixel. Finally, a 64D histogram based on the reduced colour index map is calculated and used as random patches' descriptors. The reduced colour index histogram proposed in this paper, despite its simplicity, is efficient for image patch representation, in respect of the relatively lower cost and complexity of applying the reduced colour index histograms as local descriptors.

Concatenated BoW representation:
After local feature extraction, each image is abstracted by several local descriptor vectors. Since the traditional 'hard' coding methods in BoW framework fails in capturing spatial layout of descriptors of local patches, we herein adopt the locality-constrained linear coding (LLC) [18] algorithm to organise local descriptors. An approximate version is used for speed to incorporate locality constraint by reconstructing each descriptor with a few closest K entries in the codebook. All the reconstructing vectors are then averaged to generate a final histogram vector. To achieve a more discriminative codebook, we also adopt a two-step clustering method: at first, for each image in training set, N c1 sub-centres are selected with the K-means clustering algorithm, and all the sub-centres are gathered and then clustered again to generate a codebook containing N c2 entries. Finally, we build codebooks for each type of patches, generate corresponding histogram vectors with LLC coding and concatenate them into a 4N c2 D vector as the final local feature representation.

Database
To validate the effectiveness of the proposed method, we have built a large database of four types of images, i.e. NSI, BDI, CPD and SPD. Depending on the degree of difficulty of labelling images, we divide our database into two sets: the single-label (SL) and the double-label (DL). The first set consists of such images that are easily classified by their appearances and tagged with only one label. Roughly more than 90% images in our database belong to the SL. However, there are a small fraction of 'complex' samples such as photorealistic images produced by cutting edge computer graphics effects, images spliced or embedded by other different types of smaller ones, computer graphics that are displayed on liquid-crystal display monitors and then recaptured by a camera, and so on. With such confusing appearances, they cannot be classified clearly into one type. Hence, for these images, we carefully selected two proper labels as their ground truth labels in order to describe them as accurately as possible. Given the complexity and variety of images on the web, the DL of 3693 images is a salutary supplement to the SL database and makes it more proper and scalable. Totally, we collected 55,185 images from various sources such as web crawling, manually camera capturing and other public databases including SUN397 [25] and the multilingual hand written (HW) dataset [26]. More details regarding the distribution of different types of images are listed in Table 1 and some samples are shown in Figs. 4 and 5. In addition, we also calculate the distributions of image square size for each type of images, as shown in Fig. 6.

Experimental results and discussion
In this section, we first describe the experimental setting and implementation details including the selection of classification models and feature parameters. Then we present our experimental results for the proposed method with global and local features and compared the proposed method with the popular CNN models.

Experimental setup
In our experiments, we adopt the radial basis function (RBF) kernel SVM as our learning algorithm, and all the classification experiments were implemented with the library for SVM [27] package. Note that for the first-stage classification, though using a linear SVM can largely improve the classification speed, its accuracy is evidently lower than an RBF-kernel SVM. So, we use RBF-kernel SVM in both stages. Since the global feature vector in the first stage has low dimensionality, the speed of non-linear SVM is still acceptable. As for CNN models, we implemented all the models using the Pytorch [28] platform, which is a popular Python package and widely used by researchers in the field of deep learning in recent years. The maximum image size allowed by the system is 1000 × 1000 for processing speed. If the original image is larger than that, a 1000 × 1000 sub-region will be randomly cropped and then alternatively tested. About 70% images from each class are selected randomly for training classifiers, and the rest is used for testing. Although our hierarchical classification algorithm involves several parameters, the ranges of their values are relatively broad. For convenience, we divide all key parameters into two parts: feature parameters and system parameters. The first set contains parameters related to feature extraction such as T s , T g , N lp , S lp , N c1 and N c2 . Moreover, the second consists of the confidence threshold T c (default 0.95) and weighting coefficient w (default 0.80) of local feature classifier in the second fusion strategy. We observed in experiments that the feature parameters influence the final results only slightly.  [5,30], N c1 [ [5,20] and N c2 [ [50, 200]. As for the two system parameters, we will give a detailed analysis in the following section.  NSI  BDI  CPD  SPD   NSI  26,410  ----SUN397:5175  BDI  12,153  1792  ----CPD  6805  1484  282  ---SPD  6124  12  72  51  -MHW:6036  total  51,492  3,693  55,185 There are 5175 NSIs in our database coming from the public database SUN397, and most of the SPDs (6036) used here come from the multilingual HW dataset.  Table 2.A sw e expected, both global and local features are discriminative for different types of web images. Our ad hoc global features can achieve 93.97% classification accuracy at the speed of 28 fps. Furthermore, compared to using global features alone, introducing local texture features evidently increases the final classification accuracy by around 5% but at the sacrifice of processing speed. Such results also justify the rationality of our hierarchical classification algorithm, which quickly filters most 'simple' images using global features and extracts time-consuming local features only for those small number of 'difficult' images (DL) to achieve further fine classification.

Performance of the proposed hierarchical classification method:
As we know, deep learning models, especially deep CNNs have achieved a huge success in many computer vision tasks. To further validate the effectiveness of our classification method, we also compared our method with several most popular CNN models such as AlexNet [11], VGGNet [12], ResNet [13], DenseNet [14] and SqueezeNet [15] on both SL and DL datasets. We briefly introduce the key part of each network model and preserve their original structures provided in Pytorch by following the default configuration. All the models we used are pre-trained using the ImageNet [29] database, and we then use a transfer learning strategy to fine-tune the models (number of output nodes changed) on our training dataset.
As mentioned above, there are five types of CNN models adopted in our comparison experiments. The first is AlexNet [11], which is one of the classic CNN architectures for image classification. It first demonstrated superior performance on the large-scale ImageNet task [29].Typical AlexNet consists of five convolutional layers, five max-pooling layers and three fully connected layers. Some regularisation techniques such as dropout and batch normalisation are also used for reducing overfitting and accelerating model training. The second network model is VGGNet [12], which is a family of neural networks sharing the same three-layer fully connected classifier. We perform experiments on VGG-11, VGG-13, VGG-16 and VGG-19, which differ from each other only on the numbers of maps of convolutional feature extractor. The third model is ResNet [13], which has shortcut connections between different layers so as to better fit the difference between input and expected output (residual) other than fitting the output directly. The submodule that fits the residual between expected output and input is named block. Configurations of blocks are either two convolutional layers with size 3 × 3, stride = 1 and padding = 1 followed by batch normalisation or three convolutional layers with stride = 1 and padding = 1. All ResNets start with a convolutional layer with filter size 7 × 7, stride = 2 and padding = 3 followed by a batch normalisation layer, and the last layer is always a softmax classifier. The DenseNet [14] has direction connections between each layer with all the layers before it. This makes it able to alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse and substantially reduce the number of parameters. We experiment its four variants named DenseNet-x, x here denotes the depth of the models. The last one we chose is a small and energy-efficient deep neural network (DNN) named SqueezeNet [15]. With fewer parameters, SqueezeNet can more easily fit into computer memory and can more easily be transmitted over a computer network. Table 3 shows the experimental results of different models on the SL dataset. We give the classification accuracy, speed, number of parameters and GPU memory usage for each classifier. Compared to direct classification (Table 1), our hierarchical algorithm with two fusion strategies can achieve a comparable accuracy but at much faster speed (comparable to the speed of global feature only). This is because most 'simple' samples have been confidently classified and filtered by the first-stage classifier using global features. As to the comparison with CNN models, it is shown that our proposed method yields comparable accuracy with some typical CNN models such as AlexNet, VGG-11, VGG-13, ResNet-18, ResNet-34 and two SqueezeNet models. Some deeper architectures such as ResNet-152 and DenseNet-201 obtain higher (over 99%) accuracy but their classification speed has fallen to 0.2 fps on CPU, much slower than the proposed method. It is also worth noting that deeper models have to learn much more parameters and occupy much more memory during the training and validation phases. By contrast, our hierarchical classification algorithm achieves a good tradeoff between classification accuracy and processing speed. Although CNN can run very fast on GPUs, this limits its application in occasions that GPU is not available, and GPUs are also much more energy consuming. On the other hand, implementing our proposed global + local feature-based classification on GPU can also obtain a hundred times of speedup.

Performance on DL dataset:
In testing the images in the DL dataset, when the classification decision is identical to one of the ground truth labels, it is considered as correct. In training, the label of the DL image is randomly selected from the label set. The results are listed in Table 4. In the setting 'SL + DL', the test performance of the proposed method is inferior to those of CNN models ResNet-152 and DenseNet-201 but is comparable with those of other CNN models. While in the setting 'DL→DL', the proposed method performs comparably well with the CNN models. The proposed   method has little loss of performance when training without DL images. In contrast, the loss of performance for CNN models from 'SL + DL' to 'SL→DL' is considerable. This is because deep neural networks largely rely on the training set to guarantee the generalisation performance.

Effects of parameters T c and w on classification
performance: Fig. 7 shows the effects of two major hyperparameters in testing on the SL dataset. In particular, the parameter -T c controls the number of images sent into the second-stage classifier. When it increases, more images are selected for sending to the second-stage classifier, thus the classification accuracy increases but the speed slows down. As the weighting coefficient of the local feature classifier in the second fusion strategy w is influential to the accuracy of fused classification. It is seen that when w increases from a small value, the accuracy increases gradually and gets saturated at a larger value of w (from 0.5 to 0.95) because the local feature information has played a sufficient role. Fig. 8 shows some images misclassified by our method. We can see there are many confusions between NSIs and BDIs. Specifically, we show examples of three cases: NSIs misclassified into BDI, BDIs misclassified into NSI and CPDs misclassified into NSI. As we can see, most misclassified NSIs in Fig. 8a have large flat regions and highly saturated pixels, which violate the previous assumptions we proposed above, thus are categorised into BDI. The BDIs and CPDs containing a large proportion of scene photographs in Figs. 8b and c are more likely to be classified as NSI. In the future work, we will try to introduce more elaborated features to fix them.

Conclusion
In this paper, we proposed a fast two-stage classification method for categorising web images into one of four categories, i.e. NSIs, BDIs, CPDs and SPDs. The first-stage classifier uses global features which have low dimensionality and low computation for guaranteeing high speed. The second-stage classifier extracts local texture features and represents them in BoW framework. Our experimental results show that the proposed method yields high classification accuracy and high speed. Even comparing with the popular CNNs (convolutional networks), the proposed method still provides competitive and its computational speed on CPU is much higher than CNN models. For practical application, a next work is to further differentiate between NSIs with text and NSIs without texts or to detect texts in NSIs.