A method of inpainting moles and acne on the high-resolution face photos

With the rapid development of mobile phones, more and more high-resolution photos are taken. The demand for high-resolution image inpainting is becoming increasingly urgent. In order to repair high-resolution face images automatically and quickly, this paper proposes an improved generative adversarial networks method. Firstly, we made a high-resolution dataset for training and testing, and abandoned the traditional 256 × 256 size data. Secondly, since the existing methods can only repair the mask with ﬁxed size and shape on the image, when the global average pooling layer is used in the network, the improved network can repair the moles and acne with arbitrary sizes and shapes on the human face photos. Finally, in order to achieve optimal performance of the network, a mixed loss function is used in training. The experimental results prove that our method has not only achieved good results in qualitative results, but also achieved excellent results in quantitative results.

information in the image to complete the hole. The effect of the repair must be invisible to the human eye. This task seems easy to our human brain, but this task is extremely difficult for computers. First of all, there is no uniquely definite solution to this problem. Second, how to use other information? Third, how to judge whether the completion result is real enough?
Although face photo beauty mainly uses image inpainting technology, face photo beauty is different from image inpainting. The image inpainting means that the position after the repair is exactly the same as the original image before. However, for face photo beauty, image repair means that the moles and acne have been removed after the repair, and the surface of the repaired skin must be smooth and blend well with the surrounding skin.
At present, the image inpainting methods are mainly divided into two categories: one is the classic texture synthesis method, whose core is to sample similar pixel blocks from the undamaged area of the image to fill the area to be completed [, 3]. The other is a generative model based on neural networks, which encodes the image into a high-dimensional hidden space feature, and then decodes this feature into a repaired full image [1,2,[4][5][6][7]. However, both methods have their limitations in terms of ensuring reasonable semantics and clear textures.
Generative adversarial network (GAN) is a generative model proposed by Goodfellow et al. [8] in 2014, and has become a popular research direction in the field of artificial intelligence. The famous scholar Yann Lecun even calls it "the most exciting idea in machine learning in the past decade". At the same time, GAN has become the most used method in computer vision filed. In the image inpainting task, GAN-based methods [40][41][42] use a coarse-to-fine network and context attention module to greatly improve the image inpainting performance. GAN has three obvious advantages. Firstly, GAN is a better generative model, which avoids the Markov chain learning mechanism in a sense, which makes it distinguishable from the traditional probability generative model. Traditional probability generation models generally require Markov chain sampling and inference, while GAN avoids this process of high computational complexity and directly performs sampling and inference, thereby improving the application efficiency of GAN, so its actual application scenario is more extensive. Secondly, GAN is a very flexible design framework. All kinds of loss functions can be integrated into the GAN model. In this way, we can design different types of loss functions for different tasks and learn and optimize them under the framework of GAN. Finally, the most important point is that when the probability density is uncalculable, some generative models that traditionally rely on the natural interpretation of the data cannot be learned and applied on it. However, GAN can still be used in this case, because GAN introduces an internal confrontation training mechanism that can approximate some objective functions that are not easy to calculate. When Goodfellow et al. [8] proposed the framework of GAN, they thought that GAN had a global optimal solution. They had a rigorous proof to ensure that the results for the optimal solution. And it is proved that when the generated data distribution is completely consistent with the real data distribution, this optimization function reaches the global minimum, that is, the model GAN can reach the global optimal solution.
The GAN model produces fairly good output through the mutual game learning of (at least) two modules in the framework. The two modules are the Generative Model (GM) and the Discriminative Model (DM). GM is a process of finding the optimal parameters, and these parameter updates do not come from the data sample itself, but from a backpropagation gradient of DM. The training purpose of DM is to maximize its own discriminative accuracy. When this data is judged to be from real data, it is marked with 1 and when it is generated, it is marked with 0. Inspired by GAN, Pathak et al. [4] proposed the use of the deep convolutional neural network (CNN) and GAN to solve image inpainting task in 2016. They use a convolutional encoder-decoder network as GM, jointly trained with DM to encourage the coherency between generated and real image. More and more researchers have used or improved GANs. The emergence of these methods shows that GAN has a good application effect in the field of image inpainting.
Most of face improving applications (Apps) in Apps store, such as TiantianPtu (Tiantian), Meituxiuxiu (Meitu), Photo Editor and Mobile Photoshop (Mobile PS), are mainly divided into two categories. The first is an automatic processing, which can only process the entire image as a whole, and cannot automatically process a certain part of the image. The second is to manually click on the part of the face that needs to be processed, and it is time-consuming and laborious to manually process the acne on the face one by one. In addition, to the respective shortcomings of the above two modes, these two modes have a common shortcoming, that is, they will compress the quality of the image, and after compression a 40Mb image is less than 10Mb. This is a very large quality loss for the image quality. The GAN-based repair model proposed by us can automatically repair the moles or acne on the high-resolution face photos, and the rest of the image remains unchanged. Moreover, the resolution of the model remains the same as the original image after the repair. At the same time, the GAN-based inpainting model is also proposed for using in high-definition digital cameras (HD digital camera) and Apps, to improve the effect of the traditional algorithms used in previous HD digital cameras and Apps. The size of the proposed model is only 14Mb, which is more suitable for HD digital cameras and Apps. The comparison results between the proposed method and different Apps are shown in Figure 1. The image size, the peak signal to noise ratio (PSNR) and structural similarity (SSIM) are given below each image.
In Figure 2, we show a case of repairing an image using the proposed method. The size of the test image is 3456 × 4608. In the first line, the original image, ground truth image and the repaired image are listed. The PSNR and SSIM are given under the repaired image. The second line is the image details of the face. Meanwhile, the PSNR and SSIM of the image details are also given. It is well known that when the PSNR value is greater than 30, the image is a high-quality photo. When the value of SSIM is closer to 1, it means that the two images are more similar. The PSNR of the sample image reaches 48.13, indicating that the image is of high quality and the SSIM value reaches 0.9970, indicating that the repaired image is very close to the ground truth image.
The main advantages and innovations of this paper compared with the previous work are listed below: 1. We collected and made one high-resolution human face image dataset (HRHF). We have found that the current methods of GAN can only process photos with the size of 256 × 256. This size is very different from high-resolution photos taken by mobile phones or HD digital cameras. According to the existing methods, repairing high-resolution photos taken with mobile phones or HD digital cameras is far from meeting the requirements of human vision, and has great defects. The aim of this paper is to solve the problem of repairing high-resolution images, one high-resolution human face image dataset is made for training and testing, and the traditional 256 × 256 size data are abandoned. 2. The global average pooling (GAP) layers are used in the network to make the model processing the masks of arbitrary size and shape. Most of the methods usually used the fully connected layers in the network structure to repair masks with fixed size and shape. In this paper, GAP layers are used to repair the arbitrary masks, such as moles and acne. 3. A mixed loss function is designed to optimize the proposed method in the training phase. In order to make the network better converge during the training phase, the mixed loss function is proposed, which includes l 1 and GAN loss. 4. The experimental results show that the proposed method can achieve more advanced results on the public dataset and HRHF dataset. Meanwhile, it can bring good subjective visual effect.
The remainder of this paper is organized as follows. Section II briefly reviews previous works that are closely related to image inpainting. A detailed description of the proposed method is presented in Section III. The experiments are presented in Section IV, and the conclusion of the paper is provided in Section V.

RELATED WORK
In Section 2.1, we will briefly summarize and review the work with image inpainting and face photo completion. In Section 2.2, we will introduce some public image repair datasets which are commonly used and mentioned by researchers.

Image inpainting
The image inpainting task in computer vision is to complete the missing areas of the image to be repaired according to the information of the image itself or the image dataset, and the image after repair is visually natural and beautiful. In order to pursue a higher quality image, the image inpainting task requires not only reasonable semantic content, but also a sufficiently clear image texture. The traditional image inpainting algorithms [9][10][11][12] mainly include block-based and diffusion-based image completion methods. The block-based methods fill the missing areas of the image by finding similar blocks in the image. Tijana et al. [] proposed a best matching block that uses Markov random field to find missing regions in image texture components. Based on annihilation property filter and low rank structured matrix, Jin et al. [13] proposed the method of finding the best matching block. There are other methods of image repair that also bring good results, such as image texture-based repair [3] and image structural-based information repair [14,15].
The diffusion-based methods fill missing areas by smoothly spreading image content from the border area to the interior. Li et al. [16] proposed a diffusion-based image inpainting method. This method first locates the diffusion of the feature area, then identifies the repair area based on local changes within and between the channels, and finally locates the diffusion of the repair area. Li et al. [17] proposed a method for repairing images using diffusion coefficients. The diffusion coefficient is calculated using the distance and direction between the damaged pixel and its neighbouring pixels. Sridevi et al. [18] proposed another diffusion-based image repair method based on fractional derivatives and Fourier transform.
However, the block-based and diffusion-based methods for repairing images used low-level features of the images. They use variational algorithms or block similarity to generate information from the background area of the image, and finally fill this information to the area to be repaired. Although these methods perform very well in repairing images with smooth textures, their performance is limited in image repairs with nonstationary textures. In order to better model non-stationary texture images, Simakov et al. [19] used an algorithm based on twoway patch similarity to repair the images. However, this method requires a large amount of computational block similarity, which will inevitably bring greater computational overhead. To better solve this problem, C. Barnes et al. [20] Proposed a fast nearest neighbour region algorithm.
In recent years, with the strong performance of CNN in various fields, many researchers have proposed many CNN-based network structures for image inpainting. D. Pathak et al. [4] proposed a context encoder, which is the first to use a deep CNN to repair large missing areas in an image. Iizuka et al. [5] used dilated convolutions instead of fully connected layers, and added global and local discriminators to the context encoder. Shift-Net [6] is a U-Net structure that uses rich structure and texture information to achieve image inpainting. Weerasekera et al. [7] used the depth map of the image as the input of the CNN, and Zhao et al. [21] used the structure of the image to implement the repair of X-ray medical images. Pathak et al. [4] proposed the use of encoder-decoder technology to achieve image inpainting. Aiming at unmanned aerial vehicle (UAV) data, Hsu et al. [22] proposed to use VGG network architecture to complete image inpainting. To repair damaged artwork images, Alilou et al. [23] proposed a non-texture image inpainting method. Liao et al. [24] proposed that Artist-Net completed the restoration of artwork images. Jiahui Yu et al. [1] use context information to repair the holes in the image. It can effectively solve the phenomenon of structural distortion and inconsistent texture around the holes in the generated images.
The emergence of GAN has increased the diversity of image inpainting methods. The GAN-based image inpainting method uses coarse-to-fine and contextual attention modules, which makes the image inpainting performance better. For handwritten images, Li et al. [25] proposed the use of improved GoogLeNet and deep convolutional generation adversarial networks to achieve image repair. Wang et al. [26] uses an encoderdecoder network and a multi-scale GAN to implement image inpainting tasks. In the RBG-D image, Vitoria et al. [27] proposed a modified version of the [7]. In the ocean temperature image, Dong et al. [28] proposed a deep convolutional adversarial network to fill missing areas in the image.
Face photo inpainting also plays an important role in the image inpainting task. Yeh et al. [29] searched for the most similar features of missing areas in the latent space of the image, and completed image restoration by decoding these features. Li et al. At present, the most CNN-based or GAN-based model only accept input image sizes with 256 × 256. As shown in Figure 3, two sizes of images are displayed, and marked the size of unrepaired area at the bottom of the image. It can be seen from the figure that for a 3456 × 4608 image, the average size of the area to be repaired is 120 × 120, while for a 256 × 341 image, the average size of the area to be repaired is 9 × 9. In other words, the smaller the repair area is, the better the repair effect is. Therefore, the task of repairing a high-resolution face image is a very difficult challenge.

Dataset
As is known to all, the CNN-based or GAN-based methods often require a large amount of training data, and the quality of training data is critical to the performance of models. Currently, image inpainting methods use many public and large datasets to evaluate their algorithms and compare performance. These datasets contain a variety of image content. These contents contain natural images, building images, facial images and many other types of images. The CelebFaces attribute dataset (CelebA) [32] is a largescale facial attribute dataset containing 202,599 face images of 10,177 celebrities, and 5 landmark locations. Each image has 40 attribute annotations, and the images cover large pose changes and background clutter. It is widely used in face attribute recognition, face detection landmark (or face part) positioning, and face editing and synthesis.
The CelebFaces attribute high quality dataset (CelebA-HQ) [33] is a high-resolution dataset proposed by Tero Karras. This dataset is an upgraded version of the CelebA dataset. It consists of five levels: 64 × 64, 128 × 128, 256 × 256, 512 × 512, 1024 × 1024. To obtain the CelebA-HQ dataset, Tero Karras et al. used two pretrained neural networks to process all the images in CelebA. Finally, by analyzing the quality of all the images, the best quality 30,000 images were selected.
The DTD textures dataset [34] is a texture image data, including 5,640 images, which are divided into 47 categories according to human perception, with 120 images in each category. Its resolutions range from 300 × 300 to 640 × 640. And all images are taken from Google and Flickr websites.
The Paris StreetView dataset [35] comes from Google Street View and mainly contains street images from multiple urban areas around the world. Paris StreetView dataset contains 15,000 images in total. The size of each image is 936 × 537.
The Place dataset [36] is a scene image dataset, which contains multiple scene categories, such as bedrooms, streets, synagogues, canyons etc. The dataset consists of 10 million images, where each scene category contains about 400 images, and each image has a 256 × 256 size.
The Foreground-aware dataset [37] is different from other datasets. This dataset can destroy images by adding masks. Each image has masks with irregular holes. The foreground-aware dataset contains 100,000 masks for training and 10,000 masks for testing. Among them, each mask size is a 256 × 256 grayscale image. If the value of masks is 255, it represents empty pixels. And if the value of masks is 0, it represents valid pixels.
The USC-SIPI image dataset contains 300 images in total and have four image types, including textures, antennas, miscellaneous and sequences. The image resolution in the dataset mainly varies among 256 × 256, 512 × 512 and 1024 × 1024.
The Indian Pins dataset [38] have three scene images, including agriculture, forest and natural perennials. The size of each image is 145 × 145.
The Middlebury Stereo dataset has many versions, and we introduce one of them [39]. Middlebury 2006 [39] is a deep grayscale dataset. It contains images with different lighting and exposures captured from 7 views. Image resolution is defined by the following three categories: full size (1240 × 1110), half size (690 × 555) and third resolution (413 × 370).
With the development of HD digital cameras and mobile photography equipment, the obtained image resolution is getting higher and higher. For example, the current mobile phone or high-definition digital camera can obtain a 4000 × 6000 image. Therefore, the existing dataset is difficult to meet the face photo beauty task and a high-resolution human face image dataset is urgently needed. The aim of this paper is to solve the problem of lack of high-resolution datasets, we have produced a dataset called HRHF. HRHF contains about 10,000 high-resolution face images. The size of the dataset is from 1000 × 1000 to 3648 × 6528. The detail of HRHF is shown in Section 3.1.

THE PROPOSED METHOD
The proposed method is an improvement of the method of [1]. There are four main differences. Firstly, the ground truth image is different. The repaired image in the method [1] is the same as the original image, that is, the original image is the ground truth image. In the proposed method, the repaired image is different from the original image, the repaired image need to closer ground truth image which is that people use the photoshop software to remove the moles and acne on the face. Secondly, the repaired image resolution is different. The selected resolution of dataset is much smaller (256 × 256) than the HRHF dataset in the method [1]. Our method can handle high-resolution datasets whether in the training or testing of the model. Thirdly, the GAP layer is used to process masks of arbitrary shape and size, while [1] uses the fully connected layer to process masks with fixed size and shape. Fourthly, the difference is the loss function. When designing the loss function, the loss function of a global image is used in the proposed method, and do not consider the loss function of the repair area. But the loss function of [1] needs to consider the local loss function of the repair area. The following Sections 3.1 (High-resolution human face image dataset), 3.3 (Global average pooling layer) and 3.4 (Loss function) are mainly a detailed introduction to the proposed methods and an explanation of the selection of these methods. And Section 3.2 is a detailed description of the overall framework of our proposed method.

High-resolution human face image dataset
To solve the problem of lack of high-resolution human face data, our research team built a dataset containing about 10,000 high-resolution face images. This dataset can be called HRHF dataset.
The HRHF dataset was not collected once successfully, but collected ten times. The quantity of each batch is listed in Table 1. A total of 10,904 images were collected in the HRHF dataset. These images are the result of taking pictures of 1,199 people. Everyone was asked to shoot at three distances, and five different facial poses were required for each distance. These distances refer to the distance from the person to the camera, and the five postures refer to the face, head up, head down, left side, right side. As shown in Figure 4, we listed the samples of five postures. Not each person has 15 kinds of photos, and some of them are low quality or damaged images, so these images are In addition to having more than 10,000 high-resolution face images, we also labelled each image. Each image has a binary mask image and a ground truth image. To obtain the binary mask image, we did the following steps.
Firstly, people get a high-resolution photo (Image A ) through mobile phone photography. Image A is called original image in this paper.
Secondly, people use Photoshop software to edit Image A to remove moles or acne that they are not satisfied with. The edited photo is Image B . At the same time, Image B is considered to be the ground truth image in the HRHF dataset.
Thirdly, we use the following formula to obtain the different pixels between the two images.
Finally, each pixel in Image di f f is judged. If the value is 0, it remains unchanged. If the pixel value is not 0, the value is set to 1. As is shown in Figure 5, some samples with masks are listed. The left side is the original image, and the right is the corresponding binary mask.
In Table 2, we list the advantages of HRHF dataset compared with the Places2, CelebA faces, CelebA-HQ and DTD textures datasets. In the table, the meaning of "Human face" is whether there are any human face images in the dataset. If the face image exists in the dataset, it is represented by "Y". If it does not exist, it is represented by "N". The meaning of "Mask label" is whether there are any label of moles or acne in the dataset. "High-resolution" means whether the image in this dataset exceeds 3000 × 6000 pixels.

3.2
The architecture of our model As far as we know, the model of [1] has already achieved a good effect on image inpainting, especially for faces, buildings and natural images. Inspired by [1], we first built our basic network structure. Later improvements were made to the basic network structure. In the rest of this section, the improvements are described in detail. The proposed network is mainly for repairing acne or moles on the human face in high-resolution images. Its detailed structure is shown in Figure 6. The meaning of each colour block is indicated in the lower right corner of the figure. As can be seen from the figure, the proposed network contains a total of three stages: Stage1, Stage2 and Stage3. The input of the Stage1 is original image and mask. It is worth noting that for details on obtaining mask, please refer to Section 3.1.
Image preprocessing plays an important role. Unlike other methods to scale the image, the method we use is to crop the image. The purpose of cropping is to maintain the size of the area to be repaired. Assuming that an image to be repaired is 3000 × 6000, we use a 512 step size to crop the image into 768 × 768 small patches, and also crop the mask. After image pre-processing, the original image is cropped into multiple 768 × 768 size patches, and the mask is also cropped into multiple 768 × 768 size patches. The cropped original image patch and mask patch are merged into one image of hole to be repaired.
Stage1 is a coarse prediction network structure. Stage1 consists of 11 convolution layers, 4 dilated convolution layers and 2 deconvolution layers. Among dilated convolution layers, the expansion rates are 2, 4, 8 and 16, respectively. The purpose of using dilated convolution in the network is to expand the receptive field of the network. Stage2 is a fine prediction network structure. Stage2 consists of 17 convolution layers, 4 dilated convolution layers, 2 transposed convolution layers, 1 concat layer and an attention module. Stage2 is used to enhance the output of Stage1 and improve the repair performance of the network. As can be seen from Figure 6, Stage2 extracts the features of the image from the upper and lower branch, then merges the two extracted features through concat layer, and finally restores the fused features to the input size image. The attention module in Stage2 has the same structure as [1], the purpose is to improve the accuracy of feature extraction.
Stage3 is a global discriminator in the network. The function of Stage3 is to distinguish whether the repaired image is natural. The input of the global discriminator is the ground truth image and the repaired image. The ground truth images are that people use the photoshop software to edit original images to remove moles or acne that they are not satisfied with in the HRHF dataset. Stage3 only participates in the training, not in the testing phase.
Training: As shown in Figure 6, in the training phase, an image with 1 × 3456 × 4608 × 3 (batchsize × width × height × channel) size and a mask with the same size are input. After the image pre-processing stage, the image is cropped into 45 images, and each image size is 1 × 768 × 768 × 3. Similarly, the mask is also cropped into 45 masks, and each mask size is 1 × 768 × 768 × 3. Each pair of images and masks is merged to generate a hole image, that is, the place on the face image that needs to be repaired is a black hole with a value 0. The face hole image is sent to the continuous 12 different types of convolution layers (6 convolution layers, 4 dilated convolution layers and 2 convolution layers) of Stage1. Through this series of convolution layers, the 1 × 768 × 768 × 3 size image is encoded into 1 × 192 × 192 × 128 feature map. The 1 × 192 × 192 × 128 feature map is decoded and restored to a 1 × 768 × 768 × 3 preliminary repaired image, through 5 continuous convolution layers (1 deconvolution layer, 1 convolution layer, 1 deconvolution layer and 2 convolution layers). The preliminary repaired image is sent to Stage2. Stage2 is divided into two branches. The 1 × 768 × 768 × 3 preliminary repaired image is encoded into 1 × 192 × 192 × 128 feature map, through the upper branch of Stage2 (6 convolution layers, 4 dilated convolution layers). In the same situation, the 1 × 768 × 768 × 3 preliminary repaired image is encoded into 1 × 192 × 192 × 128 feature map, through the lower branch of Stage2 (6 convolution layers, 1 attention module, and 2 convolution layers). Then, the two 1 × 192 × 192 × 128 feature maps of the upper and lower branches are merged into 1 × 192 × 192 × 256 feature map by concat layer. The 1 × 192 × 192 × 256 feature map is continuously sent into two continuous convolution layers. And 1 × 192 × 192 × 256 feature map is encoded into 1 × 192 × 192 × 128 feature map. By the remaining convolution layers of Stage2 (1 deconvolution layer, 1 convolution layer, 1 deconvolution layer, 2 convolution layer), the 1 × 192 × 192 × 128 feature map is decoded and restored to 1 × 768 × 768 × 3 size. The second 1 × 768 × 768 × 3 repaired image is considered as the predicted image. Finally, The predicted image and the ground truth image are sent to the Stage3, a discriminator at the same time. These images are extracted feature representation through 4 convolution layers and a global average pooling layer. The discriminator respectively analyzes the feature representation of the predicted image generated by Stage2 and the ground truth image. When the discriminator judges that these two features are both ground truth images, the model will be stopped training, and the training will continue in any other situations.
Testing: The proposed algorithm is a GAN-based network model, where Stage1 and Stage2 are generators and Stage3 is discriminator. In the test stage, we only need to input an image with holes to be repaired and send it to Stage1 and Stage2. The result obtained through Stage2 is the final predicted image result. Stage3 does not participate in the testing and the ground truth images still do not participate to predict repaired image. The ground truth images are only used in the training phase.

Global average pooling layer
The proposed network can process masks with arbitrary position, arbitrary size and arbitrary shape. The network can achieve this effect because we have replaced the last fully connected (FC) layer with a global average pooling layer (GAP). GAP first appeared in [44]. They use GAP to replace the last FC layer, directly achieving dimensionality reduction, and more importantly, greatly reducing network parameters. Later, many work continues to use GAP, and experiments have proven that GAP can indeed improve the ability of CNN to extract features. In the proposed method, the reasons for using the GAP layer are as follows: Firstly, for a long time, fully connected networks have been the standard structure of CNN classification networks. The function of the fully connected network is to stretch the feature map obtained by the last convolution layer into a vector, multiply this vector, and then reduce its dimension, and finally input it into the softmax layer to obtain the score of each category. The GAP layer can also reduce the dimensionality of the feature map.
Secondly, because the FC layer has too many parameters, it is easy to cause over-fitting of the network. The GAP layer only needs to perform linear calculation on the feature map, and there are not a large number of parameters. The GAP layer can also perform global dimensionality reduction operations on each channel, so that the length of the output vector remains the same as that of the FC layer. Therefore, GAP can not only avoid the risk of over-fitting caused by full connection with fewer parameters, but also achieve the same conversion function as FC layer.
Thirdly, if there is an FC layer in the network structure, the input of the network must be a fixed size and shape. The GAP layer has no size or shape requirements for the input. Using GAP layer allows the network to accept masks of arbitrary shape.
Fourth, as we all know, the role of the pooling layer in CNN networks is reflected in down-sampling: retaining significant features, reducing feature dimensions, and increasing the receptive field of the kernel. The deeper the network is, the more it can capture the semantic information of the object. This semantic information is based on a large receptive field. GAP retains the advantages of the traditional pooling layer, but it is different from the average pooling layer. The average pooling layer is to average the sub-regions of the feature map and then slide the sub-regions, while GAP obviously averages the entire feature map.
Note that using GAP requires setting the average pooling window size same with the size of the feature map. At the same time, it should be noted that the using of GAP layer may cause slower network convergence. The training time of [1] model is 5 days, and the model we trained took 14 days to converge.

Loss function
The loss function is crucial for training the model. The designed loss function can be divided into three stages. The details of designed loss function can be seen in (2). L stage1 is the reconstruction loss function of Stage1. L stage2 is the reconstruction loss function of Stage2. L (stage2+stage3) is a wasserstein GAN loss function.
where 1 , 2 and 3 are adjustable weights. In our experiments, 1 , 2 and 3 are all set to 1.2. And L stage1 and L stage2 can be defined as: |GT (i (x j ,y j ) ) − G stage1 (i (x j ,y j ) )|, where (x j , y j ) is the pixel value and i is the i-th repaired region. N is the number of the unrepaired region. M is the number of each unrepaired region pixel. G stage1 and G stage2 means that the output data of Stage1 and Stage2, respectively. GT is the ground truth image.
where L Global is defined as (6). And in our experiment, is 10.
where GGT is the fusion data. GGT can be obtained by fusing G stage2 and GT . Concat is the concat layer in Tensorflow.
GGT ∼ P r presents that the data GGT obeys the spatial distribution of the P r , andGGT ∼P g means the spatial distribution of the generated dataGGT . Further,ĜGT = GGT +(1-)ĜGT , ∼Uni form [0,1].ĜGT is sampled from the straight line from P r and P g .
Unlike [1], the designed loss function does not calculate the loss for the repaired region. The reason is that when we added loss of repaired region, the training loss value always is nan. Therefore, we directly only calculate the loss function between the whole repaired image and the whole ground truth image.

EXPERIMENTS AND RESULTS
In order to reduce the impact of the environment on the experiment, all experimental models use the same software and hardware environment whether it is in the training or testing. Models run on a server with Intel i5-6600k (3.

Qualitative results
As shown in Figure 7, qualitative comparison results are listed. In Figure 7, we only annotated the area with visible moles. Among figures, (a) is a original image (a test image to be repaired) and (b) is a repaired image by the proposed method. (c) is the part repaired by [1] and (d) is the part repaired by the proposed method. It can be clearly found that the image repaired by the method [1] has obvious defects in vision, and such visual effects are not satisfactory. And we can find that the proposed method can achieve a very good fusion of the repaired area and the surrounding skin, and it has achieved a good visual effect. More examples of our results are shown in Figure 8.

Quantitative evaluation metrics
As we all know, image generation tasks lack good qualitative evaluation indicators. In order to better evaluate the proposed method, we choose to use five objective evaluation indicators such as mean l 1 loss, mean l 2 loss, mean PSNR, mean Assume that there are M images to be repaired in total, and each image size is w × h × 3. The mean l 1 loss is the average of the absolute values of the deviations between the predicted image and the ground truth image. The l 1 loss can avoid the problem that the errors cancel each other, so it can accurately reflect the error between ground truth image and the predicted image. The mean l 1 loss can be expressed as The P (x, y) and G (x, y) represent the pixel values of the predicted image and the ground truth image, respectively. The mean l 2 loss is a measure of the degree of difference between the predicted image and the ground truth image. The mean l 2 loss can be written as The PSNR is the most common and widely used image objective evaluation index. It is based on the error between corresponding pixels, that is, based on error-sensitive image quality evaluation. The PSNR can be defined as 10 log 10 255 2 (P i (x, y) − G i (x, y)) 2 .
Rudin et al. [43] proposed the TV loss as a loss function. They found that the TV loss of noisy images is greater than noise-free images. Therefore, the TV loss is often used in image restoration tasks because it can ensure the smoothness of the restored image. At the same time, the TV loss can also be used for evaluation of restored images. In our experiments, the TV loss is used to quantitatively evaluate the smoothness of the predicted image.
The is a constant value. In our experiments, the value is set to 2.
The SSIM measures the similarity of the image structure between the predicted image and the ground truth image. It is not affected by changes in contrast and brightness. The value range of the SSIM is [0, 1]. The SSIM is expressed as The P i , P i , G i and G i represent the mean and standard deviation of the image P i and G i , respectively. The P i G i is the covariance of the image P i and G i . The c 1 and c 2 are constants. Tables 3 and 4 show the results of different parameter settings when the loss function is designed. HRHF_1 and HRHF_2 are the test datasets in HRHF dataset. In the Tables 3 and 4, the masks of HRHF_1, HRHF_2 and CelebA-HQ are the position of moles and acne. Table 3 shows the result of different values of 1 , 2 and 3 in Equation (2). According to the experimental results, the values of 1 , 2 and 3 are all set to 1.2. Table 4 is the result of different values of in Equation (5). According to the experimental results, the value of is set to 10. In all the tables, the best results are shown in bold. Table 5 shows the results of the final combination of all optimal parameters, which are tested on the HRHF_1, HRHF_2 and CelebA-HQ datasets. The masks of CelebA-HQ are the same as the masks in Tables 3 and 4.

Compared with other methods in quantitative results
In order to better compare with other algorithms, we choose to verify the performance of model on the Places2 and CelebA-HQ database. The test images and masks on the Places2 are the same as in the paper [1]. The test images and masks on the CelebA-HQ are the same as the paper [49]. The comparison results are showed in Tables 6 and 7.
As is listed in Table 6, the proposed method is compared with a traditional method with a fast nearest neighbour field algorithm and two GAN-based methods. The three methods are described in detail in Section 2. It can be seen that our algorithm has achieved the best performance in measuring l 1 loss, l 2 loss and PSNR value. The results of l 1 loss, l 2 loss and PSNR are 8.3%, 2.0% and 18.95, respectively. Although the TV loss is not very good, qualitative results do show that the algorithm is effective. As is listed in Table 7, the proposed method is compared with state-of-the-art methods. From the results, it can be seen that the proposed method can achieve the best accuracy. The results of l 1 loss, l 2 loss, PSNR and SSIM are 2.53%, 0.09%, 30.76 and 0.9676, respectively.

CONCLUSION
By reading the literature, we found that these existing GANbased image inpainting methods did not propose a repaired method for high-resolution face images, so a GAN-based method to repair high-resolution face photos is proposed by us. In the process of repairing high-resolution face image, three problems need to be solved. The first problem is that there is no existing public high-resolution face dataset for training and testing. The second problem is that the past methods for image inpainting are to repair the fixed mask, which is not consistent with the actual application. The third problem is that the loss function need to be designed to meet the need of image inpainting. In order to solve the first problem, a high-resolution face dataset is made, which contains both the image to be repaired and ground truth image. In order to solve the second problem, GAP layer is used in network to replace the FC layer, so that the network can handle masks with arbitrary shape and size. In order to solve the third problem, a mixed loss function is used as the global loss function. Academic research on image inpainting algorithms should not be limited to low-resolution restoration, but should consider the actual image resolution, especially in today's rapid development of various technologies. Of course, the proposed algorithm has some defects. The method can only have a good effect on face repair, but the model needs to be retrained for other scenes such as natural scenery and city building. If the model is directly used to test other scenes without retraining, the model cannot achieve good repair effect. In the future work, we will try to modify the algorithm so that it can achieve good results in any image inpainting in the same model.