Compressing deep quaternion neural networks with targeted regularization

In recent years, hyper-complex deep networks (e.g., quaternion-based) have received increasing interest with applications ranging from image reconstruction to 3D audio processing. Similarly to their real-valued counterparts, quaternion neural networks might require custom regularization strategies to avoid overfitting. In addition, for many real-world applications and embedded implementations there is the need of designing sufficiently compact networks, with as few weights and units as possible. However, the problem of how to regularize and/or sparsify quaternion-valued networks has not been properly addressed in the literature as of now. In this paper we show how to address both problems by designing targeted regularization strategies, able to minimize the number of connections and neurons of the network during training. To this end, we investigate two extensions of $\ell_1$ and structured regularization to the quaternion domain. In our experimental evaluation, we show that these tailored strategies significantly outperform classical (real-valued) regularization strategies, resulting in small networks especially suitable for low-power and real-time applications.


I. INTRODUCTION
D EEP neural networks have achieved remarkable results in a variety of tasks and applicative scenarios over the last years. The field of quaternion deep learning aims at extending these results to problems for which a hypercomplex (quaternion) representation (as opposed to a realvalued representation) is more adequate [1], [2] (see also [3] for earlier works on the field). The resulting quaternion-valued neural networks (QVNNs) have been successfully applied to, among others, image classification [1], [4], image coloring and forensics [5], natural language processing [6], graph embeddings [7], and 3D audio processing [8]. By exploiting the properties of the quaternion algebra, QVNNs can achieve similar or higher accuracy than their real-valued counterparts, while requiring less parameters and computations.
The majority of the literature on QVNNs up to this point has focused on extending standard deep learning operations, such as convolution [1], batch normalization [5], or weight initialization [2], to the quaternion domain. Less attention, however, has been devoted to extend properly other aspects of the training process, including accelerated optimization algorithms [9] and regularization strategies. In particular, in many real-world scenarios (e.g., embedded devices) users need to take into careful consideration complexity and computational Authors  costs, by making the networks as small as possible while maintaining a good degree of accuracy [10], [11].
In the real-valued case, compression of neural networks can be achieved while training by applying several regularization strategies, such as 2 , 1 , or group sparse norms [12]- [14], which can target either single weights or entire neurons. A direct extension of these strategies to the case of QVNN, as done in the current literature, applies them independently on the four components of each quaternion weight. However, in this paper we argue, and we show experimentally later on, that this trivial extension results in highly sub-optimal regularization procedures, which do not sufficiently exploit the properties of the quaternion algebra.
The problem of sparsifying a quaternion is not restricted to neural networks, but it has received attention from other disciplines, most notably quaternion extensions of matching pursuit [15], and compressive sensing [16]. In this paper, we leverage on these and other works to propose two targeted regularization strategies for QVNNs, filling an important gap in the literature. The first one (Section III-A) extends the use of 1 regularization to consider a single quaternion weight as a unitary component, and it is akin to a structured form of regularization in the real-valued case. The second one is instead defined at the level of a single quaternionic neuron, extending the ideas from [17] to a quaternion domain, thus allowing to remove entire units from the network.
In our experimental evaluation, we show that the two proposed strategies significantly outperform naive application of classical regularization strategies on two standard image recognition benchmarks. The resulting QVNNs are thus smaller (both in terms of neurons and weights) and require a markedly smaller computational footprint when run in inference mode, with up to 5x reductions in the number of connections and 3x speedups in inference time.
Organization of the paper: Section II recalls quaternionalgebra and how to design and train QVNNs. Section III describes our two proposed targeted regularization strategies. We provide an experimental evaluation in Section IV, before concluding in Section V.

A. Quaternion algebra
A quaternion-valued number x ∈ H can be represented by a tuple of four real-valued numbers (x r , x i , x j , x k ) ∈ R 4 as [18]: arXiv:1907.11546v1 [cs.
LG] 26 Jul 2019 where the three imaginary units i, j, k satisfy the fundamental axiom of quaternion algebra i 2 = j 2 = k 2 = ijk = −1. Given two quaternions x and y, we can define their sum as: and similarly for multiplication by a real number. More importantly, the (Hamilton) product between the two quaternions is given by: Note that the product is not commutative, setting apart quaternion algebra from its complex-and real-valued restrictions.

B. Quaternion-valued neural networks
QVNN are flexible models for transforming quaternionvalued inputs x ∈ H d to a desired target value y, which in the majority of cases is real-valued (e.g., a probability distribution over a certain number of classes). A standard, fully-connected layer of a QVNN is given by: where h l is the input to the layer, W is a quaternionvalued matrix of adaptable coefficients with components (W r , W i , W j , W k ) (and similarly for b), ⊗ performs matrix-vector multiplication according to the Hamilton product in (3), and σ(·) is a proper element-wise non-linearity. Similarly to the complex-valued case [19], choosing an activation function is more challenging than for real-valued NNs, and most works adopt a split-wise approach where a real-valued function σ r is applied component-wise: where s is a generic activation value. Customarily, the input to the first layer is set to h 1 = x, while the output of the last layer is the desired target h L = y. If the target is realvalued, one can transform h L to a real-valued vector by taking the absolute value element-wise, and eventually applying one or more real-valued layers afterwards. In addition, (4) can be easily extended to consider convolutive layers [1] and recurrent formulations [2].

C. Optimization of QVNNs
Now consider a generic QVNN f (x) obtained by composing an arbitrary number of layers in the form of Eq. (4) or its extensions. We receive a dataset of N examples {x(n), y(n)} N n=1 , and we train the network by optimizing: where θ is the set of all (quaternion-valued) parameters of the network, l is a loss function (e.g., mean-squared error, crossentropy loss), and r is a regularization function weighted by a scalar λ ≥ 0. Because the loss function in (6) is non-analytic, one has to resort to the generalized QR-calculus to define proper gradients for optimization [9]. Luckily, these gradients coincide with the partial derivatives of (6) with respect to all the real-valued components of the quaternions, apart from a scale factor. For this reason, it is possible to optimize (6) using standard tools from stochastic optimization popular in the deep learning literature, such as Adam or momentum-based optimizers.
While most components described up to now have received considerable attention in the literature, the design of a correct regularization term r(·) in (6) has been mostly ignored, and it is the focus of the next section.

III. TARGETED REGULARIZATION FOR QVNNS
In the real-valued case, a classical choice for the regularizer r(·) in (6) is the 2 norm. Whenever sparsity is desired, it can be replaced with the 1 norm, or a proper group version acting at a neuron level [12], [13]. In most implementations of QVNNs, these regularizers are applied element-wise on the four components of each quaternion weight. We argue that this operation results in far less regularization and sparsity than one could expect, which is inconvenient both from the generalization point of view, and from an implementation perspective, where smaller, more compact networks are clearly desired. In this section we present two targeted regularization strategies, acting on each quaternion as a unitary component, resulting in a more principled form of regularization for QVNNs.

A. 1 regularization for quaternion weights
Given any weight w ∈ H of the QVNN (i.e., a single element of θ from Eq. (6)), the first method we explore is to regularize its norm as: where Q is the number of weights in the network (allowing to split the influence of (7) on each weight w.r.t. the loss function (6)), and w * is the conjugate of w. This method can be seen as the natural extension of 1 norm minimization on a quaternionic signal [16]. It is also equivalent to a structured form of sparsity [13] where we group all the components of the quaternion w together. As a result, minimizing (7) will tend to bring the entire quaternion weight to 0, instead of each component independently.

B. Sparse regularization with quaternion batch normalization
The method described in Section III-A is effective for removing single quaternion weights, but in many real-world scenarios we also require a principled way to remove entire neurons during the training process [13]. To this end, we investigate a hyper-complex extension of the technique originally proposed in [17]. The basic idea is to compose each layer in (4) with a batch normalization (BN) layer [20], 1 and then perform sparse regularization on the parameters of the BN layer, indirectly removing the original neurons in the network.
For implementing the BN model in the quaternion domain we build on [5]. Consider a single output in (4), i.e., the quaternion-valued output of a single neuron in the network. During training, we observe a mini-batch of B inputs x(1), . . . , x(B) (a subset of the full dataset) and corresponding outputs of the neuron h(1), . . . , h(B) (we do not use an index for the neuron for notational simplicity). We can compute the mean and variance of the mini-batch as: These values are computed dinamically during training, while they are set to a fixed (pre-computed) value during inference. The output of the BN layer is defined as [5]: where ε is a small value added to ensure stability, while γ ∈ R and β ∈ H are trainable parameters initialized at 1 and 0 respectively. Key for our proposal, the γ parameter in (10) is real-valued, allowing us to apply standard real-valued regularization. In particular, similarly to [17] we apply (realvalued) 1 regularization on the γs, since pushing a single γ to zero effectively allows us to remove the entire neuron in the QVNN.

C. Mixed regularization strategies
The strategies described in the previous sections are not exclusive, and we can explore several mixed strategies with different regularization weights. In our experimental section we consider combining the two strategies, as long as combining one of the two strategies with a classical 1 regularization applied independently on each component.

A. Experimental setup
We evaluate our proposal on two quaternion-valued image recognition benchmarks taken from [1]. We use the standard MNIST dataset by converting every image pixel to a quaternion with 0 imaginary components, and the more challenging CIFAR-10 dataset by converting its RGB representation to the three imaginary components of a pure quaternion with 0 real part.
Similarly to previous literature, for MNIST we use a quaternion convolutional network with two convolutive layers having 16 and 32 filters, interleaved by (quaternion-valued) max-pooling operations. After the second convolutive layer we apply a dropout operation for regularization and a final quaternion fully connected layer for obtaining the class probabilities. For CIFAR-10, we increase this to five convolutive layers having respectively 32, 64, 128, 256, and 512 filters. In this case we also apply droput every two convolutive layers. Overall, the MNIST network has ≈ 10k parameters, while the CIFAR-10 network has ≈ 500k parameters.
All networks use ReLU applied component-wise as in (5). After the last layer we take the absolute values to each output to obtain a real-valued score, and apply a softmax activation function to convert these to probabilities. The networks are trained to minimize the cross-entropy using the Adam optimization algorithm.
All experiments are implemented in the PyTorch framework extending the QVNN library from [2]. 2 . For replicability, we release our demo files on a separate repository online. 3 All hyper-parameters are fine-tuned independently for each network and dataset using the corresponding validation data. Importantly, this means that all regularization coefficients are optimized separately for every method.

B. Results for the quaternion-level sparsity
We start by evaluating the quaternion-level regularization strategy described in Section III-A, denoted as R Q in the experiments. We compare with classical 2 and 1 regularization applied independently on every component. In addition, we evaluate a mixed regularization strategy combining our proposed R Q method with an additional 1 regularization on the components, denoted as R QL , which is similar to the sparse group sparse technique in [13]. For this case we consider a single, shared regularization factor to be optimized to make comparisons fair.
Results, averaged over 5 different repetitions of the experiments, are presented in Tab. I. We see that applying a regularization has only a marginal effect on accuracy in the MNIST test accuracy, while it improves the accuracy in the more challenging CIFAR-10 case, possibly counter-acting any overfitting effect. In terms of sparsification effects, we show both the component sparsity (i.e., ratio of zero-valued quaternion components) and quaternion sparsity (i.e., ratio of quaternions where all components have been set to 0). We can see that the proposed R Q strategy results in significantly sparser architectures in both cases, with corresponding gains when considering computational power and inference speed. The mixed strategy R QL performs very well on MNIST and poorer on CIFAR-10, possibly because we are using only a single shared regularization factor. For a clearer visualization, in Fig. 1 we show the corresponding sparsity levels during training (for the first 20 epochs of training).

C. Results for the neuron-level sparsity
Next, we evaluate the inclusion of the neuron-level sparsity strategy described in Section III-B. We consider the R Q strategy from the previous section, and compare to a network where we add BN layers after every convolutive layer, penalizing the γ coefficients with an 1 strategy. For fairness, we also compare with two additional baselines where we add the BN layers, but regularize only with the R Q or R QL strategies. For space constraints, we only consider the CIFAR-10 dataset, which was the most challenging in the previous section.
The averaged results are presented in Fig. 2. We see that when considering only quaternion-sparsity, the proposed neuron-level strategy (denoted as L 1 (BN )) is marginally superior to the proposed R QL applied on the network having BN layers. However, when evaluating the level of structure in this sparsity, we see that the proposed neuron-level technique in Fig. 2c vastly outperform all other strategies, leading to a network having 83% of neurons less than the original one, having less than 15k remaining parameters. As a result, the final network has a memory footprint of only 1/5 of the original one, with an inference time speedup of approximately 3x. From Fig. 2d, we also see that this is achieved with no loss in terms of test accuracy of the final networks.

V. CONCLUSIONS
The field of quaternion neural networks explores the extensions of deep learning to handle quaternion-valued data processing. While several models and training algorithms have been previous extended, less attention was given to the problems of regularizing and compressing the networks, which is essential in time-critical and embedded applications. In this paper, we proposed two regularization techniques that are specific to quaternion-valued networks. In the first case, we apply some results from quaternion compressive sensing, regularizing each quaternion weight with an 1 -level norm. In the second case, we consider the problem of removing entire neurons from the network, by regularizing appropriately batch normalization layers. Our experimental results on some image classification benchmarks show that the two techniques vastly outperform standard regularization methods when applied to quaternion networks, allowing to obtain networks that are extremely smaller (and cheaper to implement) with no loss in testing accuracy at inference time.