Maximising robustness and diversity for improving the deep neural network safety

This article proposes a novel yet efﬁcient defence method against adversarial attack(er)s aimed to improve the safety of deep neural networks. Removing the adversarial noise by reﬁning adversarial samples as a defence strategy is widely investigated in previous works. Such methods are simply broken if an attacker has access to both main and reﬁner networks. To cope with this weakness, the authors propose to reﬁne the input samples relying on a set of encoder–decoders, which are trained in such a way to reconstruct the samples on completely different feature spaces. To this end, the authors learn several encoder–decoder networks and force their latent spaces to have a maximum diversion. In this way, if attacker gets access to one of the reﬁner networks, other ones can play as a defence network. The evaluation of the proposed method conﬁrms its performance against adversarial samples.

Vulnerability of encoder-decoder networks as a defense method. M: main convolutional neural network. R : Refiner network. The targeted network for attack is bordered by a green box. As can be seen, R works efficiently when the attacker has access to the M network only, not both of them Fig. 2 An overview of our approach including a set of refiners, three in this case, with R 1 and R 2 trained with L 1 and R 3 trained with L 3 . An schematic representation of the latent space of the refiners and their accuracy after an adversarial attack on S 1 can be seen sent and reconstruct the samples relying on diverse (independent feature spaces) representations (See Figure 2). In this way, by analysing one of the networks, the behaviour of other ones is not tractable, and the attacker cannot successfully attack on the refiner and targeted network.
Problem definition: It is investigated that a very small but targeted noise such as can compromise the safety of CNNs. Contaminating images X with such a noise confuses the CNN M and makes the CNN missclassify its input (i.e. M(X ) = M(X + )) where ≈ 0.
Proposing a defence approach against such targeted noise will extremely help with the safety of CNNs. The duty of this article is to find a (or a set of) Refiner network(s) aiming to remove such targeted noise from samples before feeding them to the main network. As the attacker Å generates targeted noise by having only one of the refiner networks that is, (A(M, R, X ) =⇒ ), and other ones are completely different from that one, the attacker is not to compromise the whole system, even when the attacker Å has access to both M and one of Rs networks, that is, M(R ? (X )) = M(R ? (X + )) where R ? is a randomly selected refiner network.
Proposed method: As mentioned, previous proposed techniques for refining the adversarial samples are not robust against those attackers which know the refiner network parameters. Inspired with cryptography methods, we intended to add diversity and randomness to the procedure of defence aiming to make it more robust. Randomness is added by considering several refiner R networks. These networks in training procedure are impelled to be completely independent from each other. In this way, the attacker Å cannot compromise the targeted network M by accessing one of the refiners. The proposed approach consists of a classifier M(x; θ i ) as the main network that should be defended against attacks and K refiner (i.e. encoder-decoder) networks R j (x; θ j ). The parameters of R j network, that is, θ j are learned in such a way to map the clean input sample X to a latent space l j and recover the sample by decoding such a latent space. Independence among refiner networks is achieved by distancing each l j from other latent spaces. Each of samples, before being fed to the M network, will be processed by R h (h∈ [1, . . . , k]) is randomly selected). As the R h is trained to map the input samples to a clean version of them, it can act as a de-nosing network to refine the samples which were contaminated with adversarial noise such as .
As mentioned previously, having only several refiner networks learned in the same way does not work efficiently. To take advantage of all refiner networks, we have suggested those networks refine the input samples based on completely different features spaces. The intuition behind this idea is, the adversarial noise is highly dependent on the learned features by the target neural network. To this end, all of the refinery networks lean on the same training samples, but a new constraint is imposed on their latent spaces (i.e. bottleneck of encoder-decoder networks) to maximise the distance between the learned features for the same inputs.
R: Diverse feature learning: In previous works, encoder-decoder networks are widely used for removing the adversary noise. Similarly, we also learn an encoder-decoder network R 1 by minimising the reconstruction error on available training samples. Such a network is trained by optimising L 1 (see Equation (1)): where X refers to the training samples, andX = R 1 (X ). Learning more encoder-decoder networks and selecting one of them (Similar to [7]) by optimising the same loss function (i.e. L 1 ) results in a set of networks with the same vulnerability. As mentioned, to cope with this weakness, we force R j representing and reconstructing the samples using different features. To this end, we iteratively learn the encoder-decoder networks. Firstly, R 1 is learned by optimising the L 1 loss function. Afterwards, the next encoder-decoder, that is, R 2 is considered to be learned. Not only does R 2 learn to map adversarial samples to clean ones by optimising L 1 , but also its latent space for the same samples is impelled to be different from the latent space of R 1 . To this end, in training duration, the cosine similarity L 1,2 2 (between the latent space of R 1 and R 2 ) is considered to be minimised by optimising L 2 loss function (See Equation (2)).
where (.) is dot product operation and ||.|| is L2 norm. In order to train the third encoder-decoder network, its latent space must not be similar to the latent spaces of previous learned encoder-decoder networks, that is, l 1 and l 2 . Generally, for training the kth network, in addition to minimising L 1 , its latent space for the same samples must be dissimilar to l 1 · · · l k−1 . In summary, the parameters of the kth encoder-decoder network that is, θ k are learned by optimising the L 3 loss function. (See Equation (3)) Experimental results: In this section, we have evaluated the performance of the proposed method on a standard dataset for classification task. The experimental results confirm that the proposed method works better than the other considered methods which are based on refiners. Dataset: The task we examined was digit classification on the MNIST dataset [9] for training, attacks and evaluation. There are 48,000, 12,000 and 10,000 images for training, validation and evaluation, respectively. All experiments have been done with Python programming language and Tensorflow machine learning platform. For generating adversarial samples, we used a class of white-box attacks known as FGSM from Cleverhans library, which is a well-known Tensorfllow library to implement adversarial attacks. White-box attacks are a class of attacks made assuming the attacker has access to the parameters of a network, while black-box attacks do not.
Experimental setup: The main network M which is considered here for classification task is a LeNet [10] trained on MNIST with Stochastic Gradient Descent (SGD) optimiser, learning rate of 1e −1 and batch size of 128. The architecture (a fully connected encoder-decoder network with 400 neurons in its hidden layer) for all of the refiner networks, that is, R i=1···k , are the same. For simplification, k is selected to be equal to 3. R i=1···3 are trained for 400 epochs with Adam optimiser, a learning rate of 1e −4 and batch size of 256. R 1 is trained by optimising the L 1 , while R 2 and R 3 are trained to optimise the L 3 loss function.   Table 3. Accuracy for S 1 , S 2 and S 3 after adversarial attack with = 3 on S 1 when R 1 , R 2 and R 3 are trained with L 1 and also for when R 2 and R 3 trained with L 3 As mentioned, a very small and targeted noise described as can compromise the safety of CNNs. In our experiments, in order to generate adversarial samples, the sequence of R 1 and M is considered the compromised CNN, so that we could examine the effect of R 2 and R 3 when the sequence of R 1 and M is attacked. Moreover, different values of were examined. From now on the sequence of M and R 1 is called S 1 , while the sequence of M with R 2 and the sequence of M with R 3 are called S 2 and S 3 , respectively. You can find the results of attacks described in Figure 1, in Table 1 which leads to a better understanding of the vulnerability of having only one refiner R 1 . On the other hand, the results of our approach, described in Figure 2, can be seen in Tables 2 and 3. The results in Table 1 are based on attacks on two targeted networks, one is the classifier M on its own and the other is S 1 . In Table 2, each row includes the average accuracy values of three models after adversarial attacks on S 1 for a range of values. The first row is the average when R 1 , R 2 and R 3 are trained solely on L 1 loss function and the second row is the average when R 1 is trained the same, while R 2 and R 3 are trained with L 3 . Table 3 reports the accuracy after attack on S 1 with = 0.3 for R 1 , R 2 and R 3 trained with L 1 and also for R 2 and R 3 trained with L 3 . In Table 1, you can see the performance before and after attacks with various numbers of . Adding a refiner network (R 1 ) as a pre-possessing step before an attacked main network (M) can somehow retrieve the lost performance, however, when the sequence of the refiner and the main network (S 1 ) is attacked, presence of R 1 does not improve the performance. Further results in Tables 2 and 3 support the validity of having multiple refiners and using L 3 as the loss function for R 2 and R 3 . The reason is that as you can see in Table 2, when S 1 is under attack, S 2 and S 3 , where R 2 and R 3 are trained with L 1 , have higher accuracy than S 1 which supports adding multiple refiners. Additionally, S 2 and S 3 , where R 2 and R 3 are trained with L 3 , have higher accuracy than when R 2 and R 3 are trained with L 1 , which supports using L 3 . These results confirm that adding multiple refiners with our proposed loss function and adding one of them in a random manner can reduce the effect of adversarial attacks. Our results were reported on FGSM attack, however, it could be generalised to other attacks as well.
The results for every value of prove that adding a second refiner (R 2 ) with Equation (3) as the loss function improves the performance after the attack and adding a third refiner (R 3 ) with Equation (3) as the loss function improves the performance more than R 2 .

Conclusion:
We examined the idea of adding randomness and diversity to refiner networks which defend a main network (i.e. classifier) against a class of attacks called adversarial attack. These attacks target the main network by adding noise to input. Adding more refiners and randomly choosing one of them results in randomness while adding a proposed loss function brings about diversity. Our results consistently approve of our proposed method by showing better accuracy for refiners which were trained based on our proposed loss function.