Efficient generative model for motion deblurring

: This article proposes a generate model for motion deblurring based on generative adversarial network. The generate model adopts multi-level and multi-scale feature fusion structure. By concatenating different scales of images and feature maps and adding them by pixels, the image details of each level are obtained. The enlargement part of feature maps uses three branches, which can generate more rich and realistic detail. In the training process, three loss functions at the pixel level and the abstract level are used to make the training convergence faster and effectively assist the parameter learning of the generated model. Experiments show that the proposed approach is more efficient than many state-of-the-art methods.


Introduction
In the daily process of image capture, image blurring is easy to occur because of the jitter of image capture equipment and the motion of objects. Ideally, blurring in image capturing can be solved by improving illumination conditions, shortening exposure time and reducing shooting distance. However, in the daily environment, there is no such condition, only through the postprocessing stage, deblurring as a viable option.
Initial research focused on the non-blind deconvolution method, which assumed that the fuzzy core was known and restored a clear picture in this case, but the fuzzy core was generally unpredictable, which lacked practical value.
Blind deblurring is a commonly used deblurring method. It attempts to estimate the blurred core under unknown conditions and use it to remove the blurring caused by jitter or motion of objects and output clear images.
As of the convolution operation characteristics of blurred images, it is very suitable to use supervised learning method to estimate its convolution core. Therefore, the algorithm of this chapter is to design an effective generating network based on convolutional neural network (CNN) and the idea of blind deambiguity to remove all kinds of ambiguity.

Related work
Generally, blind deblurring algorithm can be divided into two categories: single image deblurring and multi-image deblurring algorithm.
Multi-image deblurring usually uses multi-angle images or images in time series to remove blur. Although the effect is good, it is not suitable for single image deblurring, which is more common in real environment.
Single image deblurring algorithm is suitable for reality. This kind of method does not need additional hardware requirements and is suitable for more image devices. Fergus et al. [1] proposed a deblurring algorithm based on gradient distribution model. This method focuses on the estimation of the fuzzy kernel. According to the gradient distribution of the blurred image and the non-blurred image, the algorithm constructs a joint posterior probability of the original image and the fuzzy kernel in the case of known observed image. Jia et al. [2] add spatial random model of blurred noise and local smoothing priori knowledge to the probability model, and alternately estimates the restoration process of blurred kernels and clear images until convergence. Cai et al. [3] used frame and curve transform to obtain sparse representation of fuzzy kernels and images. Xu et al. [4] use sparse prior to propose a two-step fuzzy kernel estimation algorithm, which can accurately estimate the fuzzy kernel, after that, the model is used to estimate the clear image.
With the rapid development of deep learning, convolution neural network utilises massive image data and very depth model to learn the feature expression of image at multiple levels and has achieved remarkable results in the field of computer vision, many deep learning-based blind deblurring approaches have been proposed. Yan et al. [5] proposed the algorithm consists of a deep neural network and a general regression neural network making full use of both the classification ability of DNN and the regression ability of GRNN. Sun et al. [6] constructed a model to reconstruct the overall clear image after predicting multiple local blurred information of the image through CNN. Hradis et al. [7] first use CNN to generate a set of blurred image data and then use clear image and blurred image as training data pairs to train a blind deblurring network for text image blurring removal. Tao et al. [8] used Resnet [9] modules and LSTM [10] structure to construct CNN model, and achieved satisfactory deblurring effect, especially in the image details, the recovery is quite clear.
Generative Adversarial Network (GAN) [11] proposed by Ian Goodfellow et al. is a hot area of deep learning. It contains two trainable models: generating model G, discriminating model D. G and D are constantly confronting the game. Ideally, the two will eventually achieve a dynamic balance. The images generated by G are infinitely close to the real distribution data, while D cannot distinguish the authenticity of the results generated by G. Baseon the ideas of conditional GAN [12] and perceptual loss [13], Kupyn et al. [14] proposed a deblurring algorithm for the first time and surpasses many state-of-the-art methods.
Ideas of CNN, GAN and multi-level and multi-scale feature fusion structure are adapted to design the generate model in this paper.

Discriminant model
The discriminant model is a typical classification network, which consists of six down-sampling modules and one output function layer. Each sub-sampling module includes convolution layer, batch normalisation and PReLU [15] activation function layer. The final output function uses sigmoid, 0 represents false and 1 represents true. Different from the ordinary classification network, the discriminant network removes the pooling layer and directly uses the convolution layer with a stride of 2, thus reducing the computational complexity. Each down-sampling layer reduces the length and width of the input layer to 1/2, respectively. There are six down-sampling layers. The original image can be reduced to 1/64. The size of the input image is 256 × 256, so the result of down-sampling is 4 × 4. Finally, the network output channel is 1 by using 4 × 4 convolution core to connect the sigmoid function Fig. 1.

Generate model
Our focus on designing generates model G which generates more refined output results that is close to a clear image. FCN can generate output results of the same size as the original input graph, so the idea of FCN algorithm is also used to construct G. As the generator of G requires higher precision than the ordinary segmentation network, it cannot directly use the existing FCN network. The G network of this chapter uses the idea of Unet to design symmetrical Encoder and Decoder, which reduces the original image to one-fourth of the original one layer by layer, and then enlarges symmetrically to the original one, as shown in the following figure.
The input and output of the network are paired images, the left input is blurred images, and the right one is enhanced images after deblurring. The oblique texture rectangular box is a convolution module, which consists of three sub-layers: convolution layer, batch normalisation layer and PReLU [15] activation layer.
The input image is reduced to two scales with a reduction factor of 1/2 and 1/4, respectively. Each reduced image is added into the feature extraction module as an intermediate input and a feature map extracted from the front layer. The feature map after the module is used as the sum input of the next layer and the same level of the enlargement module.
Each long-dotted line frame on the left side of the figure is a down-sampling module. The grey rectangular box is a subsampling layer, which includes three sub-layers: first, the length and width of the feature graph can be reduced to 1/2 by using convolution sub-sampling with a stride of 2, then a batch normalisation layer and an activation function layer; and second, two white rectangular boxes are Resblock layers [9].

Up-sample structure:
The right-hand part of Fig. 2 is the up-sample structure of the feature maps. Bilinear interpolation is often used in image enlargement tasks, and it is also used in our upsample operation. However, the task here is not to simply enlarge the feature map, but to restore the clear version of the original image. After several layers of down sampling, the feature maps are quite abstract. Even after enlarging the size of the original image by bilinear, because interpolation smoothing operation will lose a lot of details, which is not suitable for such generation tasks. The algorithm in this chapter designs a composite up-sampling structure, which is composed of a specified magnification calculation.
In addition to bilinear interpolation, CNN has many kinds of up-sample operations, unpooling [16] is one of them, as shown in the following figure: Fig. 3 Pooling operation down sample the feature maps. For example, when Max pooling is used, the position of each convolution core covering the maximum value is recorded, and the position of the maximum value is restored by using this value in the up-sample operation.
Transposed convolution is inserting a value of 0 into the original graph and following a convolution, as shown in the following figure (Fig. 4).
Input feature map is x 1 , x 2 , …, x 9 and the convolution kernel is w 1 , w 2 , …, w 9 , part results of convolution calculation are as follows: y 1 = x 1 × w 5 , y 2 = x 1 × w 4 + x 2 × w 6 , y 3 = x 2 × w 5 , …, The kernel in the Fig. 3 is different from ordinary bilinear enlargement in that they generate new details on the original feature map, so the transposed convolution is used as one of the branches of the up-sample structure. Pixel shuffle [17] is a new up-sample operation, which can upsample feature maps without introducing new data, as shown in the following figure (Fig. 5).
Main idea of Pixel shuffle is to convolute the original feature maps with shape c, W, H to c × r 2 channels, where c is the number of channels, W and H are the width and height of the feature maps, r is the scale factor, then reduce the feature maps of every r 2 channel to one channel with shape r × W, r × H . It is  equivalent to combining the pixels in r 2 feature graphs into r 2 pixels in one enlarged image, which is realised by using the following formula: We also adopt Pixel shuffle as a branch of up-sampling unit. Above-mentioned three branches are used to up-sample the feature map to the required scale. At the end of the three branches, the enlarged feature map is added by pixels.

Losses
We use the cross-entropy loss function with a gradient penalty for discriminant model to judge the output of the two classifications as mentioned in [14]. In the following formula, I blur , I sharp , L d stand for the blur image, sharp image and computed loss for discriminant model, respectively.
Generally, the difference between the generated image and the clear image label can be expressed by MSE [18], which compares the mean square deviation of each pixel. If the blurred image is caused by large-scale jitter, the distance between the corresponding pixels of the blurred and clear image will be larger, and the loss calculated from it will be very large, and the same large gradient will be calculated in the process of BP, which will cause the adjustment oscillation of the weights of each neuron and make it difficult to converge. We use SmoothL1 loss [19] to calculate the residual between the generated image and the clear image, it can be formulated as follows: when the absolute value is <1, the loss is calculated by using Euclidean distance, and in other cases, 0.5 is subtracted, thus avoiding the problem that the value is too large to converge after derivation.
We adopt perceptual losses [13] as another loss. Perceptual losses compares two images in feature space and uses MSE-like losses as loss function to calculate the difference between pairs of pixels and to find the mean of the sum of squares.
where C j is output number of convolution layer, H j and W j are height and width, respectively, ∅ j is CNN activation function and yâ nd y i are the real and generated image, respectively. We use VGG16 structure as feature extraction network and use Layer 16 to calculate Perceptual Loss. The final loss is as follows: Here λ 1 is coefficient of SmoothL1 loss with a value of 10, λ 2 is that for perceptual losses, with a value of 100 as the same settings in DeblurGan [14], L stands for the total computed loss of mentioned model. L d , L smooth , L feat stand for the contents mentioned in Formula (2)-(4), respectively.

Experimental results
All the experiments were designed and conducted using a 3.2 GHz Intel CPU, double nvidia pascal Titanx under CUDA 8 and python 2.7 and Pytorch1.0. The proposed method is trained on the GoPro dataset [20] which consists of 2103 pairs of blurred and sharp images in 720p quality. We use GoPro dataset's test set which contains 1111 pairs of images as our test set.
Köhler Dataset [21] contains four clear images, each of which produces 12 blurred images. We adopt it as another test set.
We compared the experimental results with different previous state-of-the-art image deblurring approaches: DeblurGan [14], Sun et al. [6] and Xu et al. [4] in terms of peak signal-to-noise ratio (PSNR) and structural similarity measure (SSIM). We report the result in Tables 1-3. As shown in the table, the proposed algorithm surpasses other algorithms on PSNR, indicating that the result of this algorithm is the smallest difference in pixels compared with the original image. Xuetal [4] is leading PSNR in Köhler Dataset. Therefore, although its PSNR is relatively high, the restoration of the image is not necessarily the most realistic in terms of SSIM. Compared with many pictures, our algorithm is generally more similar to the original image, which can effectively provide more eye-friendly data for subsequent processing. Deblurred images from test on GoPro dataset are shown in Fig. 6.
In the fields of computer vision, deep learning shows prominent performance, but it requires heavy computational cost and heavily relies on the performance of hardware. Thus, we also the compare the results from the perspective of computational efficiency. Our generated model has only 8.7 m storage and Deblurgan has 43.4 m, our approach is more suitable for lightweight computing environment.
When processing image with resolution of 1280 × 720, DeblurGan takes 0.0094 s and FPS 106.38, our approach is 0.0081 s, reaching FPS 123.46, it can effectively deal with daily deblurring tasks.

Conclusion
This paper describes an efficient generation model for motion deblurring. The proposed approach uses multi-level and multi-scale features. Experiments show that our approach has a good deblurring effect on the existing datasets. At the same time, the resource consumption of this method is relatively small, and it is also suitable for the use of lightweight equipment.