Object detection method on station logo with single shot multi‐box detector

In this work, the authors design an object detection method by the characteristics of the station with convolutional neural network, such as small scale-to-height ratio change and relatively fixed position. In order to realise the pre-processing and feature extraction of the station data, they collect the video samples and filter, frame, label and process to these samples. Also then, the training sample data and the test sample data are divided proportionally to train the station detection model. After that, the sample is tested to evaluate the effect of the training model in practice. The simulation experiments proved its validity.


Introduction
The TV station logo is used to determine the name of the TV station, which contains important information about the program's meaning and distinguishes the unique logo of different TV stations [1][2][3]. In order to claim the ownership of the video, the station logo will be added to the video.
Compared with the manual identification methods of the TV station logo, the method of intelligently identifying the logo using a computer can not only save time, but also reduce the error and improve the accuracy [4]. Therefore, the station logo detection as an auxiliary means of public opinion supervision has a strong engineering significance and social value for conducting in-depth research.
In recent years, it was a hot issue that the identification of TV stations logo in many universities and research institutions, which are mainly focusing on how to describe the identification of TV stations and how to identify the identity of TV stations.
In the description of the TV station logo, the station logo description, extraction, analysis, and comparison will determine the effect of the station logo recognition. Therefore, the description of the station logo is the first and most critical step in the station logo detection. At present, the existing domestic standard feature analysis algorithms are: based on colour histogram [5], ordinary Hu invariant moment [6], weighted Hu invariant moment [7], spatial distribution histogram [8] and so on. The TV station logo detection based on colour histogram uses different colour tones between different types of station labels to complete the station logo detection. According to the result of matching the station logo that has been described and the pre-processed histogram in the reference library of the TV station logo template, the station logo with the smallest distance is determined as the caption of the identification test. However, it was obvious to the disadvantages of the algorithm that the detection performance of the similar colour TV station logo and the transparent TV station logo is sensitive and difficult to achieve. The TV station logo detection algorithm of the ordinary Hu invariant moment mainly identifies and matches the TV station logo by using the shape characteristics of the station logo. Due to the background and noise of the station logo, the average accuracy of this station logo detection method is ∼56%. The weighted Hu invariant moment increases the detection accuracy to 76.7%, but the effect is still not satisfactory. The station logo detection algorithm combining spatial distribution histogram with colour histogram of hue, saturation, value Space increases the number of station logo in the knowledge base by using traversal recognition, which reduces the calculation efficiency and the detection effect of translucent station logo.
In the research of identification of the station logo, the theory mainly focuses on the following aspects: Firstly, the station logo detection is based on the database template knowledge base and the statistical probability maximum criterion [9]. The knowledge base uses offline storage to store information such as logo feature vectors and channel names. The statistical probability maximisation criterion uses the traversal recognition algorithm to compare the key frames read each time with the knowledge base template, but the computational efficiency of this method decreases as the number of station marks in the knowledge base increases. Secondly, the extracted station logo feature data is compared with the data in the template database, and the calculated difference is compared with the threshold. Since such recognition algorithms are usually compared with the colour or spatial features of the logo, the principle is relatively simple and the amount of calculation is relatively small. However it is susceptible to background and noise. Thirdly, the classification and recognition method is based on support vector machines [10]. This method has obvious advantages and strong pertinence in solving small samples, high-dimensional pattern recognition and non-linear problems so that it can be widely applied to other problems such as function fitting.
In the research of station logo detection, Wang et al. [11] proposed an image segmentation method, which is based on the geometric and positional features of the mark to determine the split identification area. In this method, the representative frame sequence is selected by a time-domain sampling method, then the edge matching is performed by calculating the gradient of the representative frame sequence. Shi et al. [12] used the combination of spatial distribution histogram and colour histogram of HSV space to describe the characteristics of the logo, and used the knowledge base to assist the histogram statistics method to identify the logo. The algorithm adopts ergodic recognition, which leads to the gradual decrease of the computational efficiency as the number of station logo in the knowledge base increases; thus the recognition effect on semi-transparent logos is not ideal. Ozay et al. [13] proposed a station-based detection method based on the average time boundary, which performed well in extracting opaque and translucent logos, but lacked detailed description of the algorithm for obtaining individual logos. Yang et al. [14] introduced a polar coordinate point pair matching method. Although the problem of semi-transparent station marking detection is better solved and the accuracy rate is greatly improved, the real-time performance of the algorithm is poor.
In this paper, we propose an algorithm to detect station logo, which is based on deep learning convolutional neural network [3] and target detection tool single shot multibox detector (SSD). Firstly, construct an experimental environment of deep learning by optimising the algorithm parameters. Secondly, the algorithm is modelled by collecting diversified station logo data. Finally, the model is applied to the real application environment after reaching the pre-determined detection accuracy.

Detection algorithm based on convolutional neural network
The core of the proposed algorithm is to predict the object and grade the category classification, so it is the fastest detection algorithm. Moreover, the algorithm uses a small convolution kernel on the feature map to predict the box offsets of a series of bounding boxes, so that the detection success rate is the highest and the robustness is the strongest.

Network structure of algorithm
SSD is based on a previously transmitted CNN network [15] to produce a series of fixed-size boundary boxes, and each box contains the possibility of object instances. Subsequently, a nonmaximum suppression is used to obtain the final prediction [16]. Fig. 1 is the algorithm frame diagram of SSD. Firstly, the algorithm framework is the 1-5 layer of the visual geometry group-NET (VGG-NET) network as the base layer. Then add some convolution layer, namely conv4-3, conv6, conv7, conv8-2, conv9-2, conv10-2. Finally, the final test results are obtained by using non-maximal values after the feature extraction of the pool layer of Pool 11 is completed.
SSD is a pyramid detection network. Since the value of the convolutional layer is reduced layer by layer, different features of the image can be obtained after the input image goes through the network. For example, an image of 300×300 through the different convolutions layer of the network convolution into the images of different matrices such as 256×256, 128×128 ×64 ×64, 32×32, and so on, which can maximise the extraction of key features.

Algorithm principle
SSD training objective function is derived from the objective function of multiple testing box [17]. xpij = 1 indicates that the box i matches the labelled object box j of the category p. Otherwise, xpij = 0 indicates that it does not match. According to the above matching strategy, there must be ∑i xpij ≥ 1, which means that there may be multiple default boxes matching the labelled object box j.
The total objective loss function is the weighted sum of location loss (loc) and confidence loss (conf) [18] N is matched the box labelled with default box number of object; Location Loss (loc) is Smooth L1 Loss in Fast R-CNN, used the prediction of the box (l) and the fact base box (g) parameters (namely the center coordinates, width, height) to return the center of the boundary box, as well as the width, height; Confidence Loss (conf) is Softmax Loss, which input are every kind of confidence level c; α is the weight term and set to 1; Confidence Loss (conf): Location Loss (loc): where g cx , g cy , g w , g h represents the fact library box, d cx , d cy , d w , d h indicates the default box, l cx , l cy , l w , l h indicates the offset of the predicted box relative to the default box. Generally speaking, CNN's different layers have different receptive fields [19]. However, in SSD structure, the default box does not need to correspond to the receptive field of each layer, and the specific feature graph is responsible for processing the certain scale object in the image [20,21]. On each feature graph, the scale of the default box is calculated as follows: Among them, S min = 0.2, S max = 0.9, the aspect ratio of the default box is: {1, 2, 3, 1/2, 1/3}. For the aspect ratio = 1, an additional default box is added, which is the scale of the box Each default box, width, height, and center point is calculated as follows: h r a = S k a r (10) After matching, many of the default boxes are negative samples, which will result in positive samples and negative samples imbalance, so that the training will be difficult to converge. Therefore, in this paper, the negative samples are sorted by the confidence level, then the highest ones are selected and the ratio of negative samples and positive samples is 3:1.

Algorithm design
The change of station logo is relatively small and the position is fixed. Considering the speed and precision of the detection, the main network of VGG-NET [22] is modified by adding three different convolution layers. Also the pool layer is added to every convolution layer on the structure of all 1-5 layers of the original VGG-NET. After sorting the initial 10,030 pictures and compressing it to the size of 300×300, it is found that the size of the station logo is basically between 45×28 and 60×35. So, there is no need to carry out a lot of convolution layer for convolution operation. The modified network structure is shown in Fig. 2. After a series of SSD pyramidal convolution extraction features, non-maximum suppression is applied to each feature graph.

Network level design
It has been proved that the more and deeper the network level, the better the performance in a certain range. In order to make the network thinner and deeper, the structure uses two 3×3 convolution kernel instead of a convolution kernel of 5×5, and the convolution kernel using three 3×3 is replaced by a convolution kernel of the 7×7. Fig. 3 shows that the 3 channel colour map of 300×300 with input image 'BTV'. The parameters of each layer of the network are shown in the diagram.
The first level is the Conv1_1 layer, the convolution kernel is 3×3 matrix, the step size is 1, the output feature graph is 300×300, then follow the activation function Relu1_1. The second layer is the Conv1_2 layer, the convolution core is 3×3 matrix, the step length is 1, the output characteristic is 300×300, then follow the activation function Relu1_2 and through the maximum pool layer of 2×2. Finally, the feature graph becomes 150×150 and the convolution kernel is 64.
The third layer is Conv2_1 layer, the convolution core is 3×3 matrix, the step length is 1, the output feature is 150×150, then the Relu2_1 is followed.
The fourth layer is Conv2_2 layer, the convolution core is 3×3 matrix, the step length is 1, the output characteristic is 150×150.After through Relu2_2 and the maximum pool of 2×2, which play the role of reducing dimension, the final feature graph becomes 75×75 and convolution kernel is 128.
The fifth level is the Conv3_1 layer. After a 3×3 convolution kernel, the step length is 1, followed by the activation function Relu3_1.
The sixth level is the Conv3_2 layer. After a 3×3 convolution kernel, the step length is 1, followed by the activation function Relu3_2.
The seventh level is the Conv3_3 layer. After a 3×3 convolution kernel, the step length is 1, followed by the activation function Relu3_3.
Finally, the parameters and complexity of the network are reduced by the pool layer of 2×2, the final feature graph becomes 38×38 matrix and the feature graph is 512.
The eighth layer, ninth layer and tenth layer are the convolution layer of three 3×3, which followed a pool layer of 2×2 where the image continues to extract the feature in this layer. Finally, the dimension is reduced by the pool, the step length is 1, the characteristic number is 512, and the feature graph becomes the matrix of 19×19.
After a combination of convolution and pooling, the feature graph becomes smaller and smaller, and the convolution kernel keeps increasing. Finally, the 512×10×10 image is output.
In the paper, the test network is selected as three layers, such as Conv4_3, FC7, Conv6_2 and so on, and combines with the existing basic network to detect the different scales of the same object, respectively. In Conv6_2, as an example, the Conv6_2_norm layer normalises the feature graph of the Conv6_2 layer in Fig. 4. At the time of across_spatical = false, the processed characteristic points are handled according to the following formula: ′x is a characteristic point after processing In Fig. 4, conv6_2_norm_mbox_loc is used to predict the regression value, which is a coiling layer with 8, 8 = 2×4 output, where 2 represents two rectangle frames and 4 represents the location of the rectangle, which is a center point coordinate and a rectangular high and wide value. Since the height-width ratio of the station logo scale is basically fixed 1:1 or 2:1, the ratio of height to width is not considered as 2:1. Finally, the predicted value will do Smooth L1 regression with the live box.
Conv6_2_norm_mbox_loc_perm is the Permute layer, which can change the arrangement of data storage. conv6_2_norm_mbox_loc_flat is the flat layer of the Flatten data.
conv6_2_norm_mbox_conf is also a convolution layer, which can predict the attributes of each point on the feature graph. If channel is 2 = 2×1, using an anchor which is a pre-defined rectangle predicts the background or the station logo of this feature.
There are no differences between the two layers of conv6_2_norm_mbox_perm and conv6_2_norm_conf_perm, they are Permute layers, which can change the arrangement of data storage. However, both conv6_2_norm_mbox_loc_flat and conv_6_2_norm_mbox_conf_flat are the flat layer.
conv6_2_norm_mbox_priorbox can product prior box. When the convolution layer extracts features each time, the SSD tool uses a sliding window strategy to slide on the feature graph to generate the prior box, while at the same time using a pre-defined rectangle of different sizes. Assuming that the conv6_2_norm size is 10×10, three pre-defined rectangular boxes are selected for each feature point, then there will be 10×10×3 = 300 prior boxes on the whole picture. When the model is trained, 300 prior boxes will be calculated with the live box. When the overlap area of the two is larger than a set value, it is considered that the live box is detected by the current prior box, then the SSD tool will revise the position of prior box by the regression frame. In the test of the model, because we are ignorant of the particular case of the live box, using the regression frame to adjust the predicted prior box. Finally, the center position coordinates and the wide coordinates of the object to be detected are obtained.

Introduction of application process
According to the characteristics of the width-to-height ratio and the relatively fixed position of the station logo, this paper proposed a detected method based on SDD. In the application of the algorithm, we must have enough large annotated samples, which contains the image and the name and the location of the object being marked, and the images and annotations of the samples are one-to-one correspondence. Therefore, the tagged file must be treated as XML files, and all data must be arranged in ascending order. Then, according to the ratio of 3: 7, the samples are divided into the training set train and test set test files, and write the label file of each station logo. This algorithm runs on the CAFFE environment, so we need to install the algorithm running environment with deep learning CAFFE. Finally, we compile the program to train the network model, and select the best iterative model to test the new sample in the real environment. Fig. 5 is the general use process of this algorithm.

Data pre-processing
Considering of the singleness of the appearance changes of the station logo, we choose 79 kinds of common station logo videos. Each type of station logo video selected three video clips of 3 min, then extract one frame per second from the selected video. In the paper, there are 79 types of 237 videos to extract 128,034 images.

Data cleaning:
After the video is extracted from the frame, some of the images contain a station logo, some of news channels not only contain station logo on the top of the image, but even the small area of news that scrolls under the image will also have a station logo, that is, a number of targets appear in the image. However, when some video frames are extracted, there may be a full black picture or a picture without a station logo, and some of pictures are difficult to distinguish station logo by eyes, which is called an invalid picture. Since the extraction of frames leads to a large amount of data, the 128,034 pictures are selected together with a few people. After the cleaning is completed, there are 125,375 valid images containing the station logo.

Data annotations:
The data annotation should not only label all the icons appearing in the image, but also classify and describe them and convert all the labelled station logo into XML files. The following naming rules are followed when describing each station logo. The general naming rule is the first spell alphabet based on the first word of the station logo, so its most basic principle is that duplicate names cannot appear. However when the first spell alphabet are the same, such as Hunan satellite TV, Hainan TV and Henan TV, the name is distinguished by the first spell alphabets of the full name of station logo, such as hnws, hhnws, and hnnws. Data annotation is used in the image annotation software lablelImg, which running in the Linux operating system. The annotation results are shown in Fig. 6.
In Fig. 6a, the frame is extracted from the video with the CCTV news station logo, and describe the station logo of this video at tenth seconds. Note that the station logo is located in the middle left side of the image, and the red line box is the station logo of the specific location. In the txt file, cctv13 is the description of the CCTV news in the red box, and the eight numbers in brackets are tow per group, representing the position of the four points in the upper left corner, the upper right corner, the lower right corner and the lower left corner of the red rectangle which encircled station logo.  Fig. 6b shows that when there is more than one station logo in the image, the station logo in this image is marked. The image in Fig. 6b is btvxw (BTV News) video showing two station logos in 120th frames: one is located on the upper left side of the image, the other is located in the lower right.
After marking the location and category of all valid images for the station logo, all the folders are opened recursively, then all the images are arranged in order and stored in a new folder. When reading the annotation result of each annotation file txt, the node of the corresponding images is written to the XML file, and the new folder is updated. Finally, all the XML file names are arranged in the ascending order and correspond to the images in the original folder. Fig. 7 of Jiangxi Satellite TV is based on the.xml file generated by creat_xml.py (Table 1).
If there are two or more station logos in one image, the specific name and location information of another station logos will appear in <object> file of the xml file.

Generate set of data
After the data annotation is completed, we have obtained 125,375 images with 79 types of station logos. According to experience, when the general training set and the test set ratio are 7:3, the training model and the model test effect can reach the best by the deep neural network. That is, the training set trainval.txt contains 87,762 samples and the test set contains 37,612 samples. Running creat_datasets.py can generate trainval.txt and test.txt files.
When the network model is trained, the trainval.txt files will be taken into the deep neural network for training. So we need to write a file record the relationship between the name and the network. In other words, writing the corresponding labelmap files after generated the trainval.txt and test.txt files. Labelmap file attributes and parts list are as follows: The images have many types of jpg, gif, bmp, jpeg and so on. If the data format is not normalised, the transformation of different types of pictures will take more time when the model is trained. In order to ensure that a large number of different formats are not required to be translated and delayed in training, all the pictures are processed in advance and the training set or test set is saved as a unified format.

Train of network model
Generally speaking, training a deep network model requires at least 20,000 pictures [23]. In the paper, the sample data satisfied the theoretical requirement due to the training sample data reach 87,762. Considering that the VGG model is a widely used and well-performing network model, we use the pre-training method to train the model by adjusting the parameters on the existing VGG network model.
Since there are a large number of different formats of images, reading require conversion and extraction. So in order to ensure that the time is not lost during the training process, the pictures will be processed in advance and the training and test sets will be saved in a uniform format.
In order to ensure that the model can be well converged and will not be over-fitting in the training process, the detect_eval value in the log is detected every 2000 iterations. When the detect_eval value tends to 0, the model tends to converge.
When iterating the first generation model, the detect eval value reached 20.025. After the 60,000 iterations, the detect_eval value decreased from 12.023. However, after the 78,000 iterations of the model, the detect_eval value trend to 0 and the post fluctuation is small. Based on the previous analysis, we note that the detection result is globally optimal due to the model achieves convergence after 80,000 iterations. Thus, the model is trained to end.

Score of network model
After training, the network model is iteration once every 2000 times of training, and the text set data is used for model score.
Execute the 'python examples/ssd/score_ssd_TaiBiao.py' instruction to run the python program of the model score.
In the program, the images between the detection set which consist of 37,612 images with station logo and the training set which all images stored in trainval.txt are mutually exclusive. The procedure is to model each iteration once, and then grade the latest iteration model. Fig. 8 describes the model scores of each 10,000 iteration models.   Fig. 9 exhibits the results showing that the rate of model detection in the 10,000 iterations to the 50,000 iterations is still low, but the detection rate keeps going up. After about 80,000 iterations, the accuracy of the model score is 98.2%, and the scores from 80,000 iterations to 90,000 iterations are fluctuating near 98%. Thus, it can be judged that the detection model of the station logo achieves convergence in the 80,000 iteration and reaches the global optimum. It is also observed that, with the amount of training data increases, the model becomes more and more accurate.

Comparison of experimental results
lt is clear that the model achieves convergence in the 80,000 iterations and reaches the global optimum, so the model of the test algorithm adopts the 80,000 iteration model. This paper selected the sample types and tested results are shown in Fig. 9. Table 2 indicates that the model has high accuracy, and the following is the effect drawing of the test. Fig. 10 is an advertisement screenshot of Hunan satellite TV (hnws), where the score of the algorithm is 1 for the station logo of the Hunan satellite TV (the closer to 1, the more computer determined). Finally, in order to prove the advantage of the algorithm of this paper, we compare the algorithm of the station logo detection in the paper with the classical detection algorithm.
In the experiment, we selected 237 common station logo videos as the benchmark library, and selected 125,374 images for the station logo detection model training and model performance test. Taking iterative 80,000 times as the termination condition, the accuracy rate can reach 98.12%, which is shown in Table 2.

Conclusions
In this paper, we proposed a station logo detection method based on convolutional neural network. The method refers to the network structure and hierarchical classification of the algorithm, as well as the module of the application of the algorithm, which function includes collection, screening, labeling, frame drawing, pre-  processing, segmentation, training, testing etc. The proposed station logo detection method based on convolutional neural network is compared with three classical station logo detection algorithms, and the result ( Table 2) could prove the superiority of the algorithm. Apart from its standalone utility, we believe that our proposed method provides a great building frame for station logo detection systems that employ an object detection component. A promising future direction is to explore its use as part of a system using recurrent neural networks to detect and track objects in video.