Efficient recurrent attention network for remote sensing scene classification

Correspondence Guoli Wang, Tsinghua University, Beijing 100000, China. Email: wangguoli1990@mail.tsinghua.edu.cn Abstract Scene classification for remote sensing is a popular topic, and many recent convolutional neural networks (CNNs)-based methods have shown the great model capacity and learning ability of highly discriminative features. Given a large number of training data, CNN can extract extensive features and learn to predict a remote sensing image. However, for supervised learning tasks, deep models often rely on a large number of labelled remote sensing images, which are difficult to pre-process. Thus, training a lightweight deep learning model is essential. Easy-classified and hard samples may also cause an imbalance of training set and lead the model to overwhelm the loss function. Accordingly, a novel Efficient Recurrent Attention Network (ERANet) for remote sensing scene classification is proposed. Different from traditional deep learning methods, Efficientnet-B0 is introduced as a lightweight backbone for the ARCNet framework, replacing the original one. By applying the modified efficient backbone, the low Floating Point Operations (FLOPs) and parameter numbers of the proposed ERANet are maintained. The significance of focal loss is determined and applied to address the sample imbalance problem and yield a desirable performance. Extensive experiments on several challenging remote sensing scene classification data sets prove the efficiency of the proposed ERANet.


INTRODUCTION
Scene classification via high-resolution remote sensing images has become an increasingly heated topic in academic researches and industrial applications [1][2][3]. In recent years, given the rapid development and wide use of deep learning techniques, scene classification also turns to this field and has received desirable results [4]. However, when applying deep learning techniques in scene classification with high-resolution remote sensing images, the difficulty in extracting the most significant features from images through a convolutional neural network (CNN) becomes the main problem that limits the performance of algorithms. High-resolution satellite images are different from images collected from normal sources. These highresolution images are usually taken from an overhead view covering a large area. Thus, these images include far more types of objects and features than normal images. The negative effects of redundant parts that are irrelevant to scene classification severely limit the performances of deep learning-based scene classification methods [5]. In the development of remote sensing scene classification by using different methods that extract features from images, previous works generally can be sorted into three categories [6]. The first category includes methods with handcrafted features. These methods commonly belong to extremely early works, including the following: the method that devices colour histograms and human-designed texture descriptor with global features that can be used by classifiers directly [7]; the method with local features and mid-term descriptors to produce the entire representations, such as scale-invariant feature transform (SIFT) [8]; and the method combining global and local features [9]. The second category is unsupervised learning features. Many unsupervised methods have achieved better performance than handcrafted feature-based methods [10,11]. However, unsupervised learning remains not robust in terms of distinguishing the variations between scenes due to the lack of label information. The third category is deep learning features. In recent years, deep learning has been proven to be the most effective way to conduct the feature extractions in almost all computer vision tasks, including remote sensing scene classification. Nearly all of the state-of-the-art methods in remote sensing scene classification use CNN to extract features. For example, Hu et al. [12] proposed two scenarios for producing features for high-resolution remote sensing images through extracting CNN features from different layers. Chaib et al. [13] proposed to use pre-trained VGG-Net as a deep feature extractor to acquire informative features from high-resolution remote sensing images.
Three significant problems that limit the performance of the large-scale industrial application of the previous methods can be concluded as follows. In the above works, the main problem limiting the performance of proposed algorithms for remote sensing image scene classification remained unchanged and unsolved. All three categories, including deep learning feature extraction, which is proven as the most efficient type, still suffer from redundant information in the features extracted or fabricated. Regarding this problem, some investigations have been conducted, and some breakthroughs have been achieved. The first patch of investigation is relevant to allowing the models to select parts of inputs, rather than all of them used as saliency detection methods [14][15][16]. Although it has achieved some remarkable breakthroughs in making the model concentrate on useful parts of the remote sensing images, the performance is not desirable enough. Wang et al. [5] proposed ARC-Net, which is a new effective method with desirable performance. ARCNet innovatively employs attention mechanisms in remote sensing scene classification. It is a relatively successful attempt to solve the first problem of existing scene classification methods. The second problem lies in the efficiency of ARCNet and other previous methods. The backbones of ARCNet and other deep learning methods are large, and the training could be slow and difficult. This problem hinders the large-scale industrial application of remote sensing scene classification because these bulky backbones cannot allow the models to achieve a desirable performance with limited computational resources. The third problem is the design of loss function. By making the ARCNet framework a little diverse from easy-classified samples and focusing on hard samples during the application, a satisfactory performance can be acquired. The study of Lin et al. [17] indicated that in many cases of training, the majority of the wellclassified samples tend to overwhelm loss function and produce a negative influence on the optimization process in backpropagation.
In the current study, a novel method is proposed as a solution to the above three problems. Three main measures are taken in the proposed method as the corresponding solution to each of the three problems, respectively. First, the framework of ARCNet [5] is employed to use the attention mechanism to distill useful information to extract effective, helpful, and relevant features from remote sensing images. Moreover, long short-term memory (LSTM) units are used to process recurrent attention features. Second, EfficientNet proposed by Tan and Le [18], specifically Efficientnet-B0, is employed as the backbone of the feature extraction network. EfficientNet is a newly proposed, light, efficient, and effective backbone that has been successfully adapted to many computer vision tasks concerning extracting features from images. With EfficientNet, the proposed method can acquire a state-of-the-art performance within a relatively tight limit of computational resources and the number of parameters. Third, in terms of the usage of loss functions, the focal loss proposed by Lin et al. [17] is employed to enhance further the performance of the proposed method. Focal loss reshapes the cross-entropy criterion by giving a factor to the standard cross-entropy loss. It downweights the loss assigned to well-classified samples and makes the model focus on sorting hard samples.
To sum up, this study contributes to the literature as follows: 1. By introducing the Efficientnet-B0 lightweight backbone into the ARCNet framework, the low FLOPs and parameter numbers are maintained, and the overall performance of the classification is improved. 2. Focal loss is selected to address the sample imbalance problem and yield a desirable performance. 3. The proposed model has achieved state-of-the-art performance on the three challenging remote sensing scene classification data sets: UC Merced (UCM) land-use data set, Aerial Image Data set (AID), and NWPU-RESISC45 (NWPU) data set.
The remainder of this study is structured as follows. Section 2 introduces some related works. Section 3 provides a detailed description of the proposed method. Section 4 reports the experimental results and analysis. Finally, Section 5 provides the conclusions of this study.

RELATED WORKS
In this section, related works on the application of attention mechanism in remote sensing and remote sensing scene classification methods are reviewed.

Attention mechanism
Although the attention mechanism had already been an overheated and widespread method to enhance performances in numerous types of tasks in computer vision, it was rarely used in the field of remote sensing image scene classification. Attention is divided into three kinds in accordance with the dimension on which they put selective weights. Channel-wise attention is for situations that place attentive weights along channels. Spatialwise attention is for situations that place attentive weights inside each feature map. Recently, many works about remote sensing scene classification have employed attention mechanisms to enhance their performances and received satisfactory results. For instance, Tong et al. [19] devised a channel-wise attention mechanism in DenseNet as a backbone to construct a remote sensing scene classification network. Li et al. [20] proposed a set of augmentation operations over attention mechanism Focal Loss The pipeline of our proposed method. A remote sensing image is first inputted to the pre-trained EfficientNet-B0 backbone and produce a feature map, which will be dropped into the recurrent attention module for better representative ability afterwards. In the end, the focal loss is used to further improve the classification performance during the training stage feature maps to compel the model to find class-discriminative features and reduce redundant information as much as possible. Gao et al. [21] proposed to use spatial-wise attention and channel-wise attention mechanisms simultaneously to explore the contextual dependencies from the channel and spatial directions separately. In the above studies, ARCNet was used with novel LSTM structures to process recurrent attention. The use of ARCNet has marked a milestone in the set of methods [5].
ARCNet not only uses attention mechanisms to extract useful features but also views the process of attention feature extraction in scene classification as a recurrent procedure. It also uses LSTM to address this issue and has received a significantly desirable performance. In some other related studies, attention mechanism was used in aerial image scene classification and has become quite mature that it had been integrated into unsupervised learning and deep learning [10,22,23]. The task and background of aerial scene classification are quite similar to that of remote sensing scene classification. Consequently, many techniques and inspirations from aerial scene classification can be adopted by remote sensing scene classification.

Remote sensing scene classification
Remote sensing high-resolution scene classification is a relevant topic in computer vision and remote sensing. As mentioned in the previous section, remote sensing scene classification has three kinds of approaches, namely, handcrafted features [7][8][9], unsupervised learning [10,11], and deep learning [5,12,13].
With the rapid development and widespread application of deep learning, methods with handcrafted features have been outrun by deep learning methods. Unsupervised learning also involves deep learning techniques. However, due to the lack of robust information from labels, unsupervised learning has been not as reliable as deep supervised learning. Currently, the mainstream method with the best performance is deep learning-based scene classification. Nearly all state-of-the-art remote sensing scene classification methods are based on deep learning [5].

METHOD DESCRIPTION
The pipeline of the proposed method is shown in Figure 1. In this section, each procedure involved in the proposed method is elaborated. First, the backbone architecture for high-level feature extraction is explained. Second, a brief introduction of the used recurrent attention module is provided. Finally, the focal loss is used to further improve the classification performance of the proposed method.

Efficient backbone
Most of the deep learning-based methods for scene classification [5] always utilize popular CNN architectures as a feature extractor, which is also called as 'backbone'. These famous backbones have been compared on different data sets to select the one with the best performance. AlexNet [24], VGGNet [25], The illustration of MBConv6, k5×5. DWConv denotes depthwise convolution, BN denotes batch normalization. k5×5 is kernel size and H×W×C represents tensor shape of (height, width, depth) and ResNet [26] are the three most frequently used models.
Researchers can also apply their pre-trained models for fast training and high performance. In the current study, a novel neural network architecture is employed as the backbone. Different from the above traditional CNN networks [24][25][26], which try to find out a connection between layers or design a complicated structure from some well-designed handcrafted experiences, EfficientNet [18] takes an unusual way. Similar to [27], EfficientNet simultaneously optimizes accuracy and FLOPs by conducting a multi-objective neural architecture search. Suppose the accuracy and FLOPs of model m are denoted as Accuracy (ACC) (m) and FLOPS (m), respectively, then the optimization goal is designed as: where T = 400M is the target FLOPs, and = −0.07 is the trade-off hyper-parameter between accuracy and FLOPs. EfficientNet utilizes the same search space as MnasNet [27] and mainly builds on the mobile inverted bottleneck conv (MBConv) [27,28] with the squeeze-and-excitation module [29]. It also uses Swish activation [30] instead of ReLU [24] for better performance. An example of MBConv6, k5×5, is shown in Figure 2. The searched network architecture is treated as the baseline and noted as EfficientNet-B0. Considering that

Recurrent attention module
High-resolution remote sensing images often include redundant information, which may go against the scene classification task. Therefore, Wang et al. [5] proposed ARCNet to generate attention features from high-level neural network features for better scene classification performance.
Suppose F = { f 1 , f 2 , … , f P×P } is a feature map in the size of P × P with D channels ( f i ∈ R D ). Each f i is the highlevel feature representation of the corresponding receptive field. Assuming that a t = {a t ,1 , a t ,2 , … , a t ,P×P }, t ∈ 1, … T is a P × P sized mask matrix for the attention weights at t time, the multiplication between F and a t is computed as follows: where x t is a P × P × D feature block that is subsequently processed by the recurrent attention module at t time, as shown in Figure 1.
ARCNet employs a recurrent neural network (RNN) to obtain recurrent attention features and generate a series of sequential attention representations and persistent learning and update them during the training stage. The state update function can be calculated as follows: where h t −1 and h t are the hidden state at t − 1 and t time, respectively. The LSTM unit is used as the processor in ARCNet as follows: where c t −1 and c t denote the memory cell state at t − 1 and t time, respectively. Similar to ARCNet [5], three stacked LSTM layers whose hidden layers with D memory cells and a softmax activation are used to generate the prediction vector as follows: where y t denotes the prediction vector at t time, and T is the total number of recurrent time. L denotes the category number, and thus y t ,l means the probability belongs to the l th category at t time. All prediction vectors in different time are directly summed up to obtain the final prediction vector V , as shown in Figure 1: The output of stacked LSTM h t is also input to another softmax, which outputs the probability revealing a significant degree of each pixel in the feature block at t + 1 time, i.e. the attention matrix a t +1 . The recurrent attention architecture used in the proposed method is the same as in ARCNet. The ablation experiments in ARCNet are directly used as a reference, and the parameters are finally set. The settings are elaborated in Table 2.

Loss function
For scene classification tasks, most of the state-of-the-arts (SOTAs) methods choose cross-entropy loss and its variants to compose its loss function [5]. Although traditional crossentropy loss shows its performance on benchmarks, for actual scene classification problems, the imbalanced data and classes are likely to be tackled with high similarity. For instance, the AID data set includes 30 scene categories with 600 × 600 resolution, where the number of each class varies from 220 to 420. Moreover, the NWPU-RESISC45 data set contains classes of basketball courts and tennis courts, which intensifies the characteristic of small interclass variance. The focal loss is applied in this study to release the imbalance of different classes of scene images and learn discriminative features from all classes of remote sensing images. Consequently, intraclass distance is reduced, and the interclass distance is enlarged. The focal loss [17] derives from cross-entropy loss, and its details are discussed in the subsequent sections.
The focal loss for the one-stage object detector is designed to obtain the outstanding performance of two-stage detectors, even if an extreme imbalance exists between foreground and background pixels. Consider the following binary classification problem for cross-entropy loss: The two classes, namely, y ∈ {±1} and p ∈ [0, 1], are considered as the probability for the ground-truth class (y = 1). p t can be defined as follows: Equation (7) can also be rewritten as CE(p, y) = CE(p t ) = − log(p t ) for convenience.
In this study, the cross-entropy loss is multiplied with a modulating factor (1 − p t ) . is a tunable focusing parameter restricted by ≥ 0. Intuitively, (1 − p t ) reduces easy samples' contribution to total loss and extends the range of receiving a low loss. By adding this modulating factor, CE(p t ) is converted to the focal version as follows: As mentioned previously, a common sense is to bring a weighting factor ∈ [0, 1] for the imbalance problem. In practice, can be tuned in accordance with its inverse class frequency or adjusted as a hyper-parameter by cross-validation. By combining Equation (9) with the weighting factor, the final version of the focal loss function is defined by Equation (10) as follows: To sum up, the weighting factor can balance the weights of positive and negative samples. However, it neglects the fact of easy and hard examples. By introducing the modulating factor (1 − p t ) , can tune the weight decay speed of easy examples. In particular cases, the focal loss degenerates to cross-entropy loss when = 0. For the scene classification tasks in this study, compared with cross-entropy loss, the imbalance problem can be solved, and the classes that have small interclass variance can be dealt with by tuning , where is equal to 1.

EXPERIMENTS
In this section, experiments are carried out to demonstrate the efficiency of the proposed method. First, the data sets and evaluation metrics used to verify the proposed method are exhibited. Second, the detailed setting of the parameters is introduced. Finally, the results of comparison experiments with other state-of-the-art methods and an ablation study are displayed to illustrate the performance of the proposed method and the sources of enhancement.

Experiment data set
The experiments are conducted on three well-known data sets that are widely used for evaluating remote sensing scene classification tasks. These data sets are the UCM land-use data set, AID, and NWPU-RESISC45 data set. All the details are illustrated in Table 3.
1. UCM land-use data set: The UCM land-use data set [31] is the earliest data with ground truth developed from a public high-resolution overhead image. It is processed from aerial orthography by humans, and it can be accessed from the United States Geological Survey National Map. The landuse scene has 21 categories, including aeroplane, baseball court, beach, agricultural land, buildings, chaparral, dense residence, forest, expressway, golf court, harbour, intersection, medium-density residence, overpass, mobile home park, river, runway, parking lot, sparse residence, storage tanks, and tennis court. Each category has 100 images with a size of 256 × 256 pixels, and each pixel is of a resolution of 30 cm in RGB space. The interclass similarity among classes is extremely high, thereby making UCM one of the most challenging remote sensing classification data set. 2. AID: The AID [32] is a data set for aerial scene classification with 10,000 images whose sizes are 600 × 600 pixels. It consists of 30 aerial scene types: viaduct, storage tanks, stadium, square, sparse residence, school, river, resort, railway station, port, pond, playground, parking lot, park, mountain, medium residence, meadow, industrial land, forest, farmland, centre, baseball field, bridge, commercial areas, church, beach, and airport. The ground-truth labels are made by experts in remote sensing image interpretation. 3. NWPU-RESISC45: The NWPU-RESISC45 data set [1] is a free and publicly available data used for remote sensing image scene classification. It was created by NWPU. It has 31,500 images with 45 scene classes, thereby implying that each class has 700 images. Compared with other data sets for remote sensing image scene classification, NWPU-RESISC45 is of great large scale in terms of categories and the total number of images. It is also a highly challenging data set for remote sensing scene classification for its in-class diversity and interclass similarity.

Evaluation metrics
In scene classification, the most widely used evaluation metrics are OA (overall accuracy) and CM (confusion matrix). We apply OA and CM to the above three data sets for evaluation.
1. OA: By not considering which class of scene the images belong to, the number of all correctly classified images is taken and divided by the number of all the images in the whole data set. In training, the testing results on the training set are usually extremely good. However, when placing the trained model on some samples that have never occurred in the training process, the performance may fall drastically. Therefore, not all images are taken in the data set for training. Only 10-80% of the whole data set is used as the training set, and the corresponding remaining data are used as a validation set. Crossvalidation is also conducted. For each of the three data sets, cross-validation is repeated five times. For each experiment on each data set, every model is trained and tested five times with a different train-validation split each time. 2. CM: It is a well-known method to illustrate the performance of supervised learning algorithms visually, especially for classification tasks. For a N classes data set, its N * N CM can be computed using truth labels and corresponding predicted labels and normalized to a clear version. In a CM, every row is considered its ground-truth categories, and every column is predicted categories. Therefore, it can be viewed as a category-level classification result.

Training details
In this section, the detailed training parameter settings are exhibited. The experiments on the focal factor in the focal loss are explained.
1. Training Parameter: In the experimental part, we first resize all the input images to 256 × 256, and then we randomly crop them in the size of 224 × 224 for training and centrally crop them in the size of 224 × 224 for testing. In the three data sets, the batch size is set as 64, with a learning rate of 0.001. The optimization method in updating parameters is Adam with a weight decay of 0.0001, a momentum of 0.9, and patience of 5. In training, a total of 120 epochs are covered. The implementation of the proposed method is based on Pytorch. The training process is conducted on one NVIDIA Titan X. 2. Focal Loss Parameter: The parameter in the exponential position of the mathematical equation of the focal loss controls how much the model will focus on hard samples, given that samples are easy to be misclassified. During experiments on the AID data set, a range of different is provided to  explore the most effective value that represents the most suitable intensity of focus. is set to 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, and 1.0 in each experiment, as shown in Table 4. In this serial experiment about on AID, 20% of the images from AID are used in training and 80% in testing. We finally set as 0.3.

Experimental results
In this part, the comparison experiments between the proposed method and some other state-of-the-art methods are explained. In all these experiments, the evaluation metrics are OA and CM. All comparison results are covered and analysed in accordance with different data sets.
1. UCM land-use data set: Comparative experiments with other state-of-the-art methods in remote sensing scene classification are conducted on the UCM data set. Two experiment settings are used, as shown in Table 5. In the first setting, 80% of all data are used as a training set and 20% as a testing set. In the second setting, 50% is used as a training set, and the other 50% as a testing set. The proposed method is compared with ARCNet [5], GoogleNet [32], CaffeNet [32], SalM3LBPCLM [33], MS-CLBP+FV [34], and MSDFF [35]. Among all SOTAs, the proposed ERANet has achieved the best scores with 80% and 50% training set. ERANet even has 100% OA when it is trained with 80% images. The CM of 50% of training images is shown in Figure 3 to analyse the results in detail. All the categories are predicted correctly, and only the classes of 'aeroplane','building', 'harbour', and 'sparseresidential' have some difficult samples  that are wrong. This result proves the relatively good performance of the proposed ERANet. 2. AID: Comparative experiments with other state-of-the-art methods in remote sensing scene classification are conducted on the AID data set. Two experiment settings are used, as shown in Table 6. In the first setting, 50% of data is used as a training set and the other 50% as a testing set. In the second setting, 20% is used as a training set and 80% as a testing set. The proposed method is compared with ARC-Net [5], MG-CAP [36], DCNN [37], MSCP [38], Combined CNN and GCN [39], MSITL [40], ADFF [41], and MSDFF [35]. Among all SOTAs, the proposed ERANet has achieved the best scores with 50% and 20% training set. ERANet has boosted the OA to 98% and 95% with 50% and 20% training images, respectively.  The CM of 20% of training images is shown in Figure 4 to analyse the results in detail. Most categories are predicted with 90% precision. Only the classes of 'Resort', 'School'. and 'Square' are lower than 90%. These categories have some confusing features and even cannot be classified by humans easily. To sum up, the proposed ERANet outperforms the others on the AID data set. 3. NWPU-RESISC45: Comparative experiments with other state-of-the-art methods in remote sensing scene classification are conducted on the NWPU data set. Two experiment settings are used, as shown in Table 7. In the first setting, 20% of all data is used as a training set and 80% as a testing set. In the second setting, 10% is used as a training set, and the other 90% as a testing set. The proposed method  is compared with MG-CAP [36], RTN [42], DCNN [37], MSCP [38], Combined CNN and GCN [39], MSITL [40], ADFF [41], and MSDFF [35]. Among all SOTAs, the proposed ERANet has achieved the best score with 20% and 10% training set. ERANet is 1.57% higher than the others in OA with 20% training images. It is still better than MSDFF when a small training set is used. The CM of 10% of training images is shown in Figure 5 to analyse the results in detail. As for the hardest training setting in the experiments, the proposed model is only trained with 10% images, and the data set has up to 45 confusable classes. Although the best scores are still achieved by the proposed method, it has still performed relatively poorly on the classes of 'church', 'palace', and 'railway_station'. On this difficult NWPU data set, the superiority of the proposed ERANet over the other SOTAs is still evident.

Ablation study
In this part, we address and discuss an ablation study on the AID data set with 20% training data to elaborate the influence of each composition. Based on the experimental results, we analyse the effect of every components. In Table 8, each line represents a combination of several components, and ✓ means the corresponding part is applied in that model. Model 1 gets the poorest performance in the ablation study. By adding the recurrent attention module, ACC increases from 94.86% to 95.78%. The above results prove the essential of the recurrent attention module. The difference between Models 2 and 3 is the choice of loss function. Model 3 predicts better than Model 2 with an increase of 0.15%. Although traditional cross-entropy loss shows its performance on benchmarks, the imbalanced data and classes with high similarities are tough to be deal with for actual scene classification problems. Focal loss can be a good choice for this kind of problem.

CONCLUSIONS
In this study, a novel ERANet is proposed for remote sensing scene classification. Different from traditional deep learning methods, Efficientnet-B0 is introduced as a lightweight backbone for the ARCNet framework, replacing the original one. By applying the modified efficient backbone, the low FLOPs and parameter numbers of ERANet are maintained, and the overall performance of the classification is improved. Moreover, the significance of the focal loss is discovered and applied to address the sample imbalance problem and yield a desirable performance. Finally, extensive experiments are conducted on several challenging remote sensing scene classification data sets, and the results prove that the proposed ERANet has achieved a relatively good SOTA performance.