Accelerating Deep Neural Networks implementation: A survey

Recently, Deep Learning (DL) applications are getting more and more involved in different fields. Deploying such Deep Neural Networks (DNN) on embedded devices is still a challenging task considering the massive requirement of computation and storage. Given that the number of operations and parameters increases with the complexity of the model architecture, the performance will strongly depend on the hardware target resources and basically the memory footprint of the accelerator. Recent research studies have discussed the benefit of implementing some complex DL applications based on different models and platforms. However, it is necessary to guarantee the best performance when designing hardware accelerators for DL applications to run at full speed, despite the constraints of low power, high accuracy and throughput. Field Programmable Gate Arrays (FPGAs) are promising platforms for the deployment of large ‐ scale DNN which seek to reach a balance between the above objectives. Besides, the growing complexity of DL models has made researches think about applying optimization techniques to make them more hardware ‐ friendly. Herein, DL concept is presented. Then, a detailed description of different optimization techniques used in recent research works is explored. Finally, a survey of research works aiming to accelerate the implementation of DNN models on FPGAs is provided.


| INTRODUCTION
Recently, DL technology has been used successfully for a variety of tasks in several fields of applications related to signal, information and image processing such as computer vision [1], Natural Language Processing ( NLP) [2], medical [3], video games [4] and all areas of science and human activity. DL models like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) continue to make great progress in solving complex problems. However, the deployment of such models is a hard task considering the massive amount of computation and the big storage requirements. Therefore, the performance of the model depends on the target hardware resources. The training and the inference phases of DL models are being executed on powerful computation machines using advanced technologies such as new multicore Central Processing Unit (CPU), Graphics Processing Unit (GPU) or clusters of CPUs and GPUs. Usually, GPU platforms are better on supporting training and inference of more sophisticated models. GPU technology offers a high computation capacity but ensures the interdependence of the data which is expensive in terms of power. Application Specific Integrated Circuits (ASICs) can achieve even higher performance and can improve the energy efficiency, which is a key factor in embedded systems. However, the deployment of DL model on a customised ASIC requires high investments due to a long and complex design cycle. Recently, FPGAs become a promising solution to accelerate inference, they offer the performance advantages of reconfigurable logics with the high degree of flexibility. Specific hardware design on such platforms could be more efficient in speed and energy compared to other platforms. Moreover, the deployment of large-scale DNNs with large numbers of parameters is still a daunting task, because the large dimensionality of such models increases the computation and data movement. So, to deploy such sophisticated models in embedded platforms and to obtain a more robust model, the internal operations and number of parameters can be reduced by optimising the network architecture. Several optimizations techniques were discussed in literature. One of the most popular optimization approaches that makes models faster, energy efficient and more hardware friendly is model compression which includes the low data precision, pruning network, low-rank approximation, etc. Furthermore, for efficient implementation of an optimised DL model, further acceleration improvement is required. Indeed, it is necessary to maximise the utilization of all offered opportunities at several levels of hardware/software codesign to achieve high performance in terms of precision, energy consumption and throughput. This survey takes a deep dive into DL implementation on advanced and dedicated computation platforms and reveals its bottlenecks. In addition, it is focussing on hardware and software techniques to optimise the implementation of DNNs and also provides a summary of recent research work. There are some surveys that have been published dealing with DL implementation. However, those papers have not discussed the state of the art in different hardware platforms. Most of the recent surveys have focussed on FPGA-based CNN acceleration without pointing out the choice of FPGA over other platforms. Another strong aspect of our work is that we discussed the optimization of DNNs in both levels' software and hardware. Furthermore, we presented a classification of advanced hardware acceleration techniques based on throughput and energy optimizations. An investigation of the algorithmic side and its effect on designing accelerators is also included in this survey. Additionally, we exposed the tools that can automatically generate hardware design from software that are used for implementing and evaluating deep learning approaches. Herein, -Section 2 presents the basics of DL and its popular models and architectures currently in use and turns the lights on the complexity of these models. -Section 3 describes the various hardware platforms used to implement DNNs. -Section 4 exposes the optimization techniques that can be applied to make the model more efficient in terms of speed and power.
Finally, synthesis of different acceleration techniques explored in recent research works is given and analysed.

| BACKGROUND AND MOTIVATIONS
Currently, DL represents the leading-edge solution in virtually all relevant machine learning tasks in a large variety of fields [5,6]. DL algorithms are showing significant improvement over traditional machine learning algorithms based on manual extraction of relevant features (handcrafted features) [7]. DL models perform a hierarchical feature extraction and show also better performance with the increase of the amount of data [8].  [9]. These models have covered several fields with a variety of applications. Particularly, CNN models have demonstrated impressive performance in computer vision applications such as autonomous car vision systems [10], drone navigation, robotics [11], etc. CNNs, have also proved to be more effective in medical field and specially in image recognition. It have been adopted at detecting a tumour or any other type of lesion than the most experienced radiologists [12]. In Ref. [13], an image extracted from Magnetic Resonance Imaging (MRI) of a human brain was processed to predict Alzheimer's disease using CNN. DL models are also used in drug research by predicting molecular properties such as toxicity or binding capacity. In particular, DL can be used to simulate biological or chemical processes of different molecules without the need for expensive software simulators and is 30,000 times faster [14]. Moreover, RNN models have exiled in natural language processing including automatic speech recognition, recommendation systems, audio recognition, machine translation, social media filtering, etc. For example, various LSTM models have been proposed for sequence to sequence mapping that are suitable for machine translation [15]. Furthermore, CNNs and RNNs were combined to add sounds to silent movies [16] and to generate captions that describes the contents of images [17]. Besides, it is important to note that the effective implementation of DL models on embedded platforms is behind this diffusion of such applications. The performance of such AI algorithms using DL models lies on the capacity of processors in supporting the DNN with its varied number of layers, neurons per layer, multiple filters, filter sizes and channels while treating large dataset. Indeed, DL workloads are both computation and memory intensive. For example, the well-known CNN network ResNet50 [18] requires up to 7.7 billion floating point operations (FLOPs) and 25.6 million model parameters to classify a 224 � 224 � 3 image. As shown in Figure 1, the complex and larger model VGG16 [19] with 138.3 million parameters model size, requires up to 30.97 Giga FLOPs (GFLOPs). Thus, the number of operations and parameters increases with the complexity of the model architecture. Table 1 presents the state-of-the-art models' sizes and complexities (Table 2). VGG models were developed by the Visual Geometry Group from University of Oxford and are the most preferred choices in the community for extracting features from images. They are widely used in many applications despite the expensive architecture in both terms of parameters number and computational requirements (Figure 1). The large dimensionality of these models increases the computation and data movement. More precisely, it increases the amount of generated data which its movement considered more expensive than computation, in terms of power on hardware platforms [21].
At this inflection point, it is therefore necessary to benefit from new design methodology, to make good use of new design opportunities and to explore some optimization techniques to reduce the network size and to enhance the implementation performance in terms of throughput and energy consumption. Besides, the choice of suitable hardware platform to implement a DL model is of paramount importance [24]. In the next section, we will explore different computation platforms of DL implementation.

| COMPUTATION PLATFORM OF DL IMPLEMENTATION
The employment of DL into daily applications of different fields will depend on the ease with which it will be possible to deploy DL model on small, low-power devices rather than large servers. In majority of cases, the training phase is performed in the cloud. However, the inference phase is less demanding, it can happen locally or in the cloud depending on the application [24]. Research is underway on the two phases implementation using parallel architectures on different hardware targets and computing devices. Four major types of technology are being used to accelerate DNNs: CPU, GPU, FPGA and ASIC.

| Central processing units
Traditionally, DNNs were mainly tested on the CPU of a computer. The CPU works by sequentially performing the computations that are sent to it. Sometimes, a programme has different tasks that can be calculated independently of each other. To optimise the time required to complete all tasks, many processors have multiple threads or cores that can perform parallel calculations. Some manufacturers have sought to optimise the hardware architectures of their processors to meet the needs of DL: Intel has tweaked the CPUs of its servers to improve its performance with DL [25]. Google has developed a chip to perform DL tasks more economically [26]. However, it is still very difficult for CPUs, even with multicore architecture to support the high computation and the storage complexity of large DNN models.

| Graphics processing units
A GPU excels in parallel computing. CPU has typically between one and eight cores, and high-end GPUs have thousands of cores (e.g. GeForce GTX TITAN Z included 5760 cores, the last one is Geforce RTX 2080...). GPUs are slow during sequential operations, but shine when given tasks that can run in parallel. Since the operations required to run a DL algorithm can be done in parallel, GPUs became extremely valuable tools. Furthermore, by using OpenCL [27], an open standard for portable parallelisation, compute kernels written using a limited subset of the C programing language can be lunched on GPUs. In this perspective, NVIDIA has invested much in its CUDA (Compute Unified Device Architecture) language to make it support the most DL development frameworks. Similar to OpenCL CUDA affords an environment of general-purpose programing and enables parallel processing over NVIDIA GPU's cores. NVIDIA GPUs are currently the most used for F I G U R E 1 Computational cost of most popular models: inference on ImageNet dataset [20]  implementing DL algorithms. Most lately, NVIDIA [28] invented NVDLA, a scalable and highly configurable open source accelerator for DL inference to simplify integration and portability. In late 2018, AMD announced the first 7 nm (nanometer) GPU specifically designed for DL. The company's new Radeon deliver up to 7.4 TFLOPS (one trillion floating point operations per second). AMD revealed also a software to improve performance [29]. This proves the interest shown by manufacturers in choosing the right hardware that best suits the deployment of these DNNs. The Nvidia Tesla V100 for example, embeds 640 hearts 'Tensor'. These units offer neural networks a high computing capacity of over 100 teraflops and are particularly suited to popular development frameworks.

| FPGA
When evaluating the acceleration of hardware platforms, the trade-off between flexibility and performance must inevitably be taken into consideration. FPGAs serve as a good compromise between flexibility and performance. They are a reconfigurable integrated circuit with programmable processor cores. They offer the performance advantages of integrated circuits, with the high degree of flexibility. At a low level, FPGAs can implement sequential logic using Flip-Flops (FFs) and combinational logic using Look-Up Tables (LUTs). FPGAs also contain hardened components for functions that are commonly used, such as full processor cores,

-
communication cores, arithmetic cores, and RAM blocks. In addition, the adaptation of the System-on-Chip (SoC) design approach, in which the ARM coprocessors and FPGAs logic cells are generally located on the same chip have enhanced the flexibility of such devices. The FPGAs current market is dominated by Intel (nee Altera) and Xilinx, representing a combined market share of 85% [30]. On FPGAs, programmable logic cells can be used to implement the data and control path. They are also able to exploit the distributed memory on chip, and the pipelines parallelism, which are naturally part of the methods of deep feed-forward networks. FPGAs also support partial dynamic reconfiguration, which may have implications for large DL models, where individual layers could be reconfigured on the FPGA without disrupting the current calculation in the other layers. To speed up hardware designs FPGA platforms could be a promising perspective compared to GPUs. With fixed architectures like GPUs, a software execution model is tracked and structured around the execution of tasks in parallel on independent computing units: the goal of developing DL techniques for GPUs is to adapt the algorithms to follow this architecture, where the computation is carried out in parallel and where the interdependence of the data is ensured. However, when developing DL techniques for FPGAs, it is less important to adapt algorithms for a fixed computation structure, which allows more flexibility to explore algorithm optimizations. Techniques that require many complex low-level hardware control operations that are difficult to implement in high-level software languages are of particular interest for FPGA implementations. Recently, software-level programing models for FPGA have been adopted, including OpenCL, High-level synthesis (HLS), C, C++, making it a more attractive option [31]. In this perspective, Xilinx invented PYNQ to design embedded systems with their Zynq SoCs on easier way. It uses the Python language and libraries, which offers the benefits of programmable logic and microprocessors in Zynq that help building high performance embedded DL applications [32,33]. Furthermore, to accelerate DL inference with optimised and tuned hardware and software, Xilinx unveiled an adaptive compute acceleration platform (ACAP), Versal [34] a new heterogeneous compute architecture. Versal delivered higher performance (8�) than high-end GPUs. More recently, Xilinx designed an integrated IP block for Zynq SoC and MPSoC devices which is a programmable engine dedicated for CNNs called DL Processor Unit (DPU) [35]. Lately, FPGA-based accelerators like Xilinx Alveo cards [36] with new architecture appeared more often. They offer FPGAs ready to programme on the accelerator cards which can be directly plugged into servers and allow reconfigurable acceleration to adapt to continuous optimization of DL algorithms. For example, when executing inference, the Alveo U250 reduces latency by 3� over GPUs. Another FPGA-based multi-accelerator platform, Maxeler's MPC-X 2000 [37], that supports reconfigurable designs is widely used. It is comprised with Data Flow Engines (DFE) each using a Xilinx Virtex-6 FPGA. Currently, the Cloud represents a simple and efficient solution for using FPGAs without investing in specific hardware. In major cloud platforms and modern data centres, FPGA based accelerators have showed impressive performance in terms of parallel computing and power consumption.

| ASIC
ASICs are designed for a specific fixed functionality or application. During its operating life, a customised ASIC has a fixed logic function because its digital circuitry is made up of gates and flip-flops permanently connected in silicon. Several research works have focussed on building customised ASICs to accelerate DL model training and inference [44]. Compared to FPGA, ASIC platforms with a customised architecture are more efficient in terms of power and speed. An ASIC can perform fixed operations extremely fast since the entire chip's logic area can be devoted to a set of narrow functions. Despite its high performance, designing an ASIC can be highly expensive due to its construction process complexity. year later, Google announced TPU v3 and improves the peak performance to 420 TFLOPs. In February 2018, cloud TPUs that powers Google products like Translate, Search, Assistant, and Gmail became available for use in Google Cloud Platform (GCP) [45]. TPU can handle both training inference and it has the highest training throughput. More recently, Habana Labs startup developed the HL-1000 a 16 nm custom ASIC chip [46]. The designed architecture is very similar to that of Google's TPU using a large on-chip Static Random Access Memory (SRAM) and large matrix-multiply accelerator. The only difference is that Habana includes eight programable CPU cores to handle non convolutional layers, whereas Google implements these layers in fixed-function logic. The startup Gyrfalcon Technology Inc (GTI) [47] introduced Lightspeeur 2801S, Lightspeeur 2802M and Lightspeeur 2803S edge-based ASICs for the deployment of AI application. Lightspeeur 2801S, a 28 nm neural accelerator with no external memory and 28,000 parallel computing cores performs up to 2.8 TOPs and 9.3 TOPs/W. Based on 16-chip server, a 2803S performs 271 TOPs at 28W [48]. ASICs are still more efficient than FPGAs. However, the combination of GPUs training performance and FPGAs efficiency and flexibility for inference can be an alternative and promising solution.
While running DNNs, it is still difficult for CPUs to achieve high performance levels compared to GPUs, FPGAs and ASICs due to the massive computation and memory bandwidth requirements. However, GPUs with their high memory bandwidth and throughput are the most widely used for training DNNs. GPUs' high performance is due to their parallel processing. However, they consume a large amount of power. FPGAs and ASICs can also offer very high bandwidth by being directly connected to inputs. Moreover, compared with GPUs, FPGAs and ASICs can provide higher performance with lower power consumption while running DL algorithm. As DL models rapidly evolve and change, FPGAs offer more flexibility and reconfigurability than ASICs. Additionally, FPGAs are using new tools that make programing DNNs applications much easier.
For further improvement of performance, various optimization techniques have been proposed. In the next section, we give an overview of some of the most used techniques.

| OPTIMIZATION TECHNIQUES
There are several techniques focussed on modifying DL algorithms to make them more hardware-friendly with minimal loss of accuracy. Many approaches have been explored to effectively digest the redundancy of models and provide improved computing efficiency such as the low data precision, pruning network and Low-Rank Approximation (LRA).

| Precision reduction
The use of lower precision in representing data to run DNNs reduces the storage demand of the DNN models lowering the data bandwidth requirements. It optimises the computing efficiency and improves performance. However, special attention must be payed to the possible degradation of accuracy. From the algorithmic perspective, recent research work can be divided into three categories: weights precision reduction, precision reduction of both weights and activations and precision reduction of inputs, weights and activations. Many researchers targeted weights precision reduction, since weights can reduce directly the network size. In Ref. [49], a friendly quantization applied on MobileNetV1 model, reached an accuracy of 68.03% in 8-bit weights representation, which almost closed the gap to the float point representation (70.77%). Zhou et al. presented INQ [50], a generalized quantization framework to convert any pre-trained full-precision CNN model with 32 bit floating point into a lossless low-precision version of weights with 5-bit, 4-bit, 3-bit and even 2-bit. The use of this framework on ResNet-18 improved accuracy for 5-bit and 4-bit quantization by 0.71% and 0.62% respectively. Li et al. squeezed the representation to 2-bit in Ref. [51] which resulted in 6.47% accuracy degradation. Also, Rastegari et al. proposed a fully binarised neural networks called BWN in Ref. [52]. BWN gained 32� memory saving with 12.4% accuracy degradation. Another recent research works, applied the precision reduction technique on weights and activations. Indeed, in Ref. [53], CaffeNet inference is successfully performed with 8-bit fixed-point representation of weights and activations and resulting in less than 1% degradation of accuracy. A Balanced Quantization method is introduced in Ref. [54]. It performed 66.6% top-1 accuracy when applying 8-bit representation of weights and activations on GoogLeNet which is less than 5% degradation compared to the float 32-bit baseline. Moreover, a quantized version of GoogLeNet with 4-bit weights and activations in Ref. [55] achieved 66.5% top-1 accuracy which is 5.1% drop in accuracy. Cai et al. [56] introduced Half-Wave Gaussian Quantization (HWGQ), that reduced the precision by 5.7% on GoogLeNet with binary weights and ternary activations. Some other researches have shown that quantized inputs, weights and activations can achieve better computational efficiency. The binarisation of inputs weights and activation is explored in Ref. [57]. The authors proposed a fully Binarised Neural Networks (BNN) that drastically reduced memory size and accesses. Based on BWN, Rastegari et al. [52] presented XNOR-Net by binarising all activations resulting in 58� faster convolutional operations. XNOR-Net performed better accuracy than BNN [57]. In Ref. [58], the authors proposed DoReFa-Net, a method that used low bitwidth parameter gradients to train CNN with low bitwidth inputs, weights and activations. DoReFa-Net performed comparable accuracy as the 32-bit baseline on SVHN and ImageNet datasets. Detailed results are summarised in Table 2. From the hardware perspective, a lot of work applied fixed point representation to implement DNNs and substantially reduced the bitwidth for energy and area savings, and throughput increasing. In Ref. [59], LSTM models (Google LSTM and Small LSTM) with 16bit fixed-point data type were implemented on two FPGA platforms resulting in only 1.23% precision degradation. Moss et al. presented an FPGA-based customisable matrix multiplication framework to run DNNs [60]. It allowed the runtime switching between static-precision bit-parallel and dynamic-precision bit-serial MAC (Multiply and Accumulate) implementations. The experimental results on AlexNet, VGGNet and ResNet reached up 50� throughput increases versus FP32 baselines. In Ref. [61], the authors implemented Google LSTM in Xilinx FPGA using 12-bit fixed point which resulting better performance and only 0.3% precision degradation. In Ref. [62], Shen et al. implemented VGG-16 and C3D across multiple FPGA platforms with DSPs (Digital Signal Processor) that supports one 16-bit fixed-point multiply and add. It achieved an end-to-end performance 7.3� better than software implementation. Following the same strategy, Zhang et al. [63] achieved a 3.1� throughput speedup with the implementation of a long-term recurrent convolutional network LRCN on a Xilinx FPGA using a fixed-point quantization. Although, the use of this technique offers a substantial gain in throughput and energy efficiency. But less than 8 bits representation of the data values in large DNNs can increase the accuracy degradation (Table 4).

| Pruning
Neural networks are considered over-parametrised, as there is a large amount of a redundant parameters that are with small influence on the accuracy, which costs in computation as well in memory footprint. These parameters can be removed through a process called pruning which is often followed by some fine tuning to improve the accuracy. Recently, several research studies [75,76] have shown the effectiveness of this technique on model size reduction, the computations amount and indirectly the energy consumption with minimal accuracy degradation. There are many pruning methods in terms of weights, filters channels and feature maps. The core idea of weight pruning is to remove the redundancy of some weight by setting them to zero. Rather than searching exhaustively for the weights to be pruned per layer, Ref. [66] explored a technique to find automatically the possible pruned weights sets while minimising the loss over all weights. The test error of this method on ResNet110 and ResNet56 was respectively 6.50% and 6.67%. To guarantee the weight reduction ratio, Zhang et al. [67] proposed a systematic framework for weight pruning of DNN based on the alternating direction method of multipliers (ADMM). This approach achieves weight reduction on LeNet-5 and AlexNet models respectively with 71.2� and 21� with and no accuracy loss. Yang et al. [68] proposed the Energy-Aware Pruning (EAP) technique for weight pruning using the energy consumption estimation of CNN. This method achieves an energy consumption reduction for GoogLeNet and AlexNet respectively by 1.6� and 3.7�, compared to their original models with less than 1% top-5 accuracy loss. For filter pruning, the basic idea is to remove the unimportant filters by an estimation of the filter's importance. Li et al. [69] reported a methodology to prune whole filters and their related feature maps by using as a measurement of filter importance the sum of the absolute values of filters. This approach reduced the inference cost of VGG-16 and ResNet-110 respectively by 34%, 38% while maintaining nearly the original accuracy. The study by Huang et al. [70] suggested a 'try-and-learn' learning algorithm to prune filters in CNN while maintaining the performance. The proposed algorithm removes 63.7% redundant filters in FCN-32s and accelerated the inference by 37.0% on GPU and 49.1% on CPU. Recently, a new method for filter pruning was explored in Ref. [71] which is based on the sparsity induction of weights. The proposed technique achieves FLOPs reduction on VGG-16 on two datasets CIFAR10 and GTSRB respectively by 90.50% and 96.6% without accuracy loss. Channel pruning reduces the model size by removing the channels and the related filters as well as the corresponding feature maps. Several channel pruning methods were proposed, for instance, Ref. [72] investigated a method for channel selection called Discrimination-aware Channel Pruning (DCP). Experiments of this method on ResNet-50 showed that with 30% reduction of channels it outperforms several state-of-theart methods by 0.39% in top-1 accuracy. The study by Liu and Wu [73] proposed a new channel pruning criterion based on the mean gradient of feature maps which reduces effectively the network FLOPs. Using this approach on VGG-16 and ResNet-110 achieves respectively 5.64� and 2.48� reduction in FLOPs, with less than 1% and 0.08% decrease in accuracy, respectively. Liu et al. [74] enforced a scaling factor during the training for channel pruning. The effectiveness of this approach was evaluated with several CNN models (VGGNet, ResNet and DenseNet). For VGGNet, it achieves 20� reduction in model size and 5� reduction in computing operations. More details are presented in Table 2. To achieve speedup, pruning can be combined with other techniques used for optimization. The work in Ref. [77] investigated the benefits and costs of quantization and pruning as well as the combination of the both. The evaluation of the approach on NVIDIA Jetson TX2 showed that when using pruning, the inference time and energy consumption was reduced, respectively by 28% and 22.5% with little saving in storage size. However, when using quantization, the model storage size was reduced by 75% while the inference time and energy was reduced respectively by 1.41� and 1.19�. The combination of these techniques leads to a reduced model storage size (76%) with a little decrease in the top-1 prediction accuracy (less than 7%). This work showed that the combination of techniques depends on the architecture of neural network and the reason of optimization: it shows positive impact on the inference time for VGG-16, but it results in a longer inference time for ResNet50 so less benefit in energy consumption for ResNet50 over VGG-16. Tung et al. [78] explored the incorporation of network pruning and weight quantization in a single learning framework named CLIP-Q where both performs in joint and parallel manner. Comparing to the state-of-the-art results, the CLIP-Q technique achieves an improvement in compression rate for AlexNet, GoogLeNet, and ResNet-50 respectively with 51�, 10� and 15�. Several studies have investigated this compression technique from the hardware perspective. For instance, Faraone et al. [79] suggested a filter pruning framework that utilise efficiently FPGA resources without accuracy DHOUIBI ET AL.
-85 degradation. The evaluation of this approach on the Xilinx KU115 board showed that the pruned AlexNet and TinyYolo networks achieved 2� speedup and 2� reduction in resources (LUTs, DSP, BRAM) without accuracy loss compared to the original networks. Posewsky et al. [80] proposed an FPGAbased accelerator of the pruned DNN inference. This accelerator was implemented on ZedBoard for evaluation. Compared to the software implementation, this approach achieves an improvement with 10� in energy efficiency and 3� in runtime. The hardware implementation of the pruned and non-pruned network achieves an accuracy loss with less than 0.5%. The study by Zhang et al. [81] proposed a compression strategy for CNN based on pruning and quantization and an FPGA-based accelerator for the compressed CNN. The evaluation of the proposed system on Xilinx ZCU104 for AlexNet showed an improvement in terms of latency and throughput on convolutional layers compared with CPU and GPU respectively with 182.3� and 1.1� and an improvement in terms of energy efficiency with 822.0� and 15.8�, respectively.

| Low-rank approximation
Layer decomposition or LRA has been expensively explored to reduce computation complexity to improve efficiency. This method decomposes the model to a compact and approximate one with more lightweight layers by matrix decomposition. Denton et al. [89] applied a LRA of kernels to reduce computation in convolutional layers. The proposed model performed 2.5� speedup with little drop in accuracy (¡1%). In Ref. [85], Wang et al. proposed a factorised convolutional layer, that outperforms the standard ones on performance and complexity ratio. The factorised network achieved similar performance to VGG-16 while requiring 42� less computation. The authors in Ref. [90] proved that the low-rank approximation technique can also be applied to decompose the weights in the FC layers, which resulted in up to 50% reduction in number of parameters. Following the same strategy, Qui et al. [91] applied LRA on FC layer to reduce the number of weights. with 63% less parameters, this method performed 87.96% accuracy On VGG16-SVD. Also, to decompose pretrained weights, a tucker decomposition is used in Refs. [82,92]. In Ref. [86], LRA was adopted for weights and inputs. Zhang et al. used a deneralized Singular Value Decomposition (GSVD) to reduce multiple layers accumulated error. By applying this method on VGG-16, the model achieved 4� speedup with only 0.3% increase in top5 error. Chen et al. proposed a Layer Decomposition-Recomposition Framework (LDRF) [86], in which they applied a Singular Value Decomposition (SVD) to weights matrices. During the SVD decomposition, they lowered the rank of each layer to estimate the layer valid capacity. On VGG-16, the proposed method reached 5.13� speedup with only 0.5% top-5 accuracy reduction. In Ref. [83], the authors showed that low rank tensor decompositions can speedup large CNNs while maintaining performance. The proposed approach achieved 1.82� speedup with 5.0� weights reduction for AlexNet, with low than 0.4% accuracy degradation. The implementation of DNNs can be more effective when using layer decomposition method. Wen et al. designed a new LRA to train a DNN model with lower ranks and higher computation efficiency [87] ( Table 4). This method gained 2� speedup on GPU when maintained the accuracy and 4.05� speedup on CPU with low degradation in accuracy. To accelerate the CNN inference computation, Wang et al. proposed an approach based on low rank and group sparse tensor decomposition [88]. On VGG-16, this method achieved 6.6� speedup on CPU with less than 1% degradation on top-5 error. In Ref. [93], the authors proposed a framework to accelerate DNNs based on lowerrank approximation. On FPGA, it achieved an average computation efficiency of 64.5%. LRA can obtain a compact and approximate network model. However, to learn an accurate network structure, LRA needs the reiterations of decomposing, finetuning, etc., resulting in extra computation overhead.
The aim of using the optimization techniques is to reduce model size while maintaining good performance. Lower precision in representing data (quantization) usually improves latency and reduces accuracy especially when dealing with large scale DNNs. Pruning the network also reduces the size of the model and is able to improve accuracy but usually not latency. However, weight quantization is more hardware friendly than weight pruning. LRA techniques are efficient for model compression but the necessity of expensive decompression operations makes it difficult for the implementation. Furthermore, LRA techniques cannot perform global compression of parameters as they are performed layer by layer.
To improve the efficiency and achieve further compression optimization techniques such as pruning and precision reduction or quantization and LRA can be combined.

| HW ACCELERATION APPROACHES
DNNs have been successful in a wide range of applications thanks to the rapid development of custom hardware to speed up the training phase as well as the inference phase. Among the different hardware targets previously presented in section 3, FPGA platforms with reconfigurable integrated circuits and embedded hardcodes make it easy to design dedicated hardware accelerators for complex DNNs. In this section, we review many recent research works and summarise acceleration methods based on FPGA.

| Throughput optimization
Throughput optimization is one of the objectives to design an efficient DNN based accelerator. Several techniques have been explored to achieve higher throughput. However, the most used techniques have included loop optimization, systolic array architecture and Single Instruction Multiple Data (SIMD) based computation. 86 -DHOUIBI ET AL.

| Loop optimization
To achieve high throughput, loop optimization techniques such as loop unrolling, loop tiling and loop interchange have been widely used. They have reduced the overheads associated with the massive nested loops which has increased the execution speed. These techniques are based on making effective use of parallel processing capabilities. In Ref. [94], the authors exhaustively analysed loop optimizations and data movement patterns in CNN loops. They provided a new dataflow and architecture, in which they leveraged loop tiling, unrolling, interchange to minimise data communication. Their design achieved 645.25 GOPs of throughput on Intel FPGA using VGG model. Loop tiling is used in Ref. [91] to fit largescale CNN models into limited on-chip buffers. The proposed approach demonstrated higher acceleration on VGG16-SVD when applying quantization method, and performed 137 GOPs. Also, to explore the design space of dataflow across layers, the authors in Ref. [95] used loop tiling and developed a fused-layer CNN accelerator. The implementation of the proposed approach on a Xilinx FPGA minimised off-chip feature map data transfer by 95% and reached up 61.62 GOPs in throughput. Based on unrolling and tiling loops, Rahman et al. [96] presented ICAN, a 3D compute tile for convolutional layers. With optimization of on-chip buffer sizes for FPGAs, the proposed technique outperformed [95] by 22%. In Ref. [97], loop unrolling is used to define the computation pattern and the data flow. The paper also proposed an RTL compiler ALAMO to automatically integrate the computing primitives to accelerate the operation on FPGA. On AlexNet, the accelerator reported a computational throughput of 114.5 GOPs. In Ref. [98], the authors designed DLAU, an accelerator architecture for large-scale DNNs by exploiting data reuse in order to reduce the memory bandwidth requirements. It included three pipelined processing units to improve the throughput and loop tiling technique to improve locality and minimise data transfer operations. On Xilinx FPGA, the proposed accelerator achieved up to 36.1� speedup with 234mW power consumption.

| Systolic array architecture
Systolic array architecture is another technique that employs high degree of parallelism to improve throughput. It consists of placing, in an organised structure, thousands of Processing Elements (PEs) and connecting them directly to each other to form a large physical matrix of these operators. Each PE has its limited private memory. In Refs. [99][100][101], systolic array architecture is applied to FPGA-based CNNs. To accelerate CNN/DNN on FPGA, C. Zhang et al. designed and implemented Caffeine [99], a HW/SW co-designed library which decreased underutilised memory bandwidth. The authors proposed a massive number of parallel PEs and organised them as a systolic array to mitigate timing issues for large designs. The implementation of the proposed accelerator on Xilinx FPGA using VGG performed 636 GOPs. A 1-D systolic array architecture described in OpenCL is proposed in Ref. [101]. This approach is only suitable for small models because all input feature maps are stored in on on-chip memory. The implementation of AlexNet on FPGA resulted in 1382 GFLOPS. In this work the DSP utilization is improved by adopting Winograd transformation. In Ref. [100], Wei et al. implemented CNN on Intel FPGA using systolic array architecture which achieved up to 1171 GOPs. In their work they provided an analytical model for resource utilization and performance and developed an automatic design space exploration framework. Besides, the use of current FPGA Computer Aided Design (CAD) tools to synthesise and layout systolic arrays resulted in frequency degradation. In Ref. [102], a 2 D systolic architecture is analysed to identify causes, and two methods are proposed to improve frequency of systolic array designs which is directly related to throughput. The evaluation results attained 1500 GOPs for VGG inference on Xilinx FPGA platform (and achieved 1.29� higher frequency). Table 3 summarises some results.

| SIMD-based computation
To perform high throughput, an SIMD-based computation technique has been used in several recent research works. The authors in Ref. [105] designed a system architecture based on heterogeneous FPGA with DSPs, supporting SIMD paradigm to efficiently process parallel computation for CNNs layers (Convolution and fully connected layers). The proposed architecture required lower computational time (47%) over non-SIMD computational implementation. Furthermore, to accelerate CNNs computation rate on FPGAs, Nguyen et al. proposed Double MAC [103], which is an approach for packing two SIMD MAC operations into a single DSP block with reduced Bitwidth. This work improved the computation throughput by 2 times with the same resource utilization. Zhong et al. designed Synergy [104], a hardware-software codesigned pipelined framework based on heterogeneous FPGA to accelerate CNN inference. Supporting multi-threading, Synergy leveraged all the available on-chip compute resources including CPU, FPGA and NEON SIMD engines. FPGA and the NEON engines are used to accelerate convolutional layers while the CPU cores execute the fully-connected layers and the preprocessing functions. Additionally, a workload accelerator balancing was provided to adopt various networks at runtime without the need to change the hardware or the software implementations. The evaluation of Synergy resulted in higher throughput and energy-efficiency over implementations on the same platform. Likewise, an architecture based on SIMD technique was presented in Ref. [106] to accelerate DNN for speech recognition. SIMD and MIMD (Multiple Instructions Multiple Data) modes are mixed in Ref. [107] to accelerate DL models. In addition, a SIMD like architecture is adopted in Ref. [108] to minimise the energy consumption, which is another important key to further improving accelerator efficiency. DHOUIBI ET AL.

| Energy optimization
Reducing the energy consumption is a key challenge for designing an efficient DNN-based accelerator. Therefore, various techniques have been explored by researchers to obtain high throughput with low energy consumption.

| Reducing the memory bandwidth
Many recent researchers focussed on reducing the on-chip and off-chip memory bandwidth. In [109] Zhang et al., presented a 2-D interconnection between PEs and local memory to minimise the on-chip memory bandwidth. The authors also, increased the data locality, which reduced the off-chip memory requirements. Using OpenCL, the design implementation of VGG on FPGA achieved a 1790 GOPs throughput and an energy efficiency of 47.78 GOPs/W. Also, Memsqueezer [110], an on-chip memory subsystem for low-overhead DL accelerator, that can be implemented on FPGA is proposed. It compressed the data and weights from the hardware perspective and eliminate the data redundancy. With Memsqueezer buffers, CNN accelerators achieve 80% energy reduction over conventional buffer designs with the same area budget. Reducing data transfer between on-chip memory and off-chip memory can also minimise the energy consumption. It is in this context, Shen et al. [111] realized a CNN accelerator with a flexible data buffering scheme Escher. This latter reduced the bandwidth requirements by 2.4� on FPGA using AlexNet. The study by Li et al. [112] observed that for CNN accelerators, over 80% of the energy are consumed by DRAM accesses. The authors proposed SmartShuttle, an adaptive layer, to minimise the off-chip memory accesses by investigating the impact of sparsity and reusability of data on the memory. The evaluation on AlexNet showed that SmartShuttle reduced up to 47.6% DRAM access volume, and reached up 36% of energy savings. In the same context, Ref. [113] designed an algorithm called block convolution to completely  [115] showed that using low-rank approximation, a 31% to 53% energy reduction can be reached. The low data representation can also reduce energy consumption, the binarised neural networks in Ref. [116]

| Algorithmic optimization
Recent works [59,118,119] [123] designed an accelerator based on Winograd algorithm. In this work, the authors evaluated Winograd algorithm with different tile sizes. When using VGG, the design achieved 943 GOPs on FPGA. More details are presented in Table 5. Several techniques have been explored to achieve higher throughput such us loop optimization, systolic array architecture and SIMD based computation. A DNN accelerator designed using these techniques usually consumes higher energy. Therefore, various techniques have been explored to obtain high throughput with low energy consumption such us memory bandwidth reduction and model compression. For further improvement, algorithmic optimization approaches like Fast Fourier and Winograd algorithms can be used. Furthermore, the automatic generation of high-performance hardware accelerator from software can significantly simplify the development and speed up the process (e.g. HLS). Reducing the energy consumption and improving throughout are key challenges for designing an efficient DNN-based accelerator. Therefore, various acceleration techniques can be combined along with the optimization approaches.

| CONCLUSION
Herein, DL concept was initially presented through the complexity of different models. We also reviewed the exploration of different computation platforms of DL implementation. Then, we discussed a review of the literature about the different approaches used to optimise DL models to make them more hardware friendly. In the end, we presented and analysed the  used acceleration techniques for the deployment of DL models on FPGA platforms. The deployment of DL on embedded equipments with high accuracy, high throughput and low consumption is still a challenge. Indeed, hardware constraints required for lower power consumption, such as limited processing power, lower memory footprint, and less bandwidth, reduce the accuracy. Due to the increasing complexity of DNN models it is difficult to integrate a large DNN into an embedded hardware design. This made researches think about applying optimization and acceleration techniques. Optimization techniques focussed on modifying DL algorithms to make them more hardware-friendly. They effectively digest the redundancy of models and provide improved computing efficiency with minimal loss of accuracy. However, the acceleration methods aim to speed up DNNs while improving throughput and reducing energy consumption. Also, applying algorithmic optimization like Fast Fourier and Winograd algorithms can accelerate DNNs and improve resource productivity and efficiency. In addition, the use of frameworks to automatically map models onto hardware platforms simplifies the development and speedup the automatic generation of the hardware acceleration. The efficient implementation of complex DNN models on new and increasingly powerful embedded platforms can offer many benefits for AI applications. Previous works faced challenges such as limited hardware resources, long development time, and performance degradation. Moreover, it is difficult to use all the functionalities of neural network algorithms in hardware compared to software implementations [131]. In this context, new FPGAs, using parallel processing and embedded programable cores have advantages over other hardware platforms for DNN implementations. Whole systems can be integrated on a chip using many hardware components such as memories, fast devices, DSP units and processor cores which expedite the design of such large-scale systems. FPGAs are very flexible and allow reconfiguration to optimise bit resolution, clock rate, parallelisation, and pipeline processing for a given application. Some FPGA manufacturers like Xilinx have provided accelerators (DPU) along with other tools and APIs to optimise pretrained DL models by applying pruning and quantization techniques.