An Efficient and Layout-Independent Automatic License Plate Recognition System Based on the YOLO detector

In this paper, we present an efficient and layout-independent Automatic License Plate Recognition (ALPR) system based on the state-of-the-art YOLO object detector that contains a unified approach for license plate (LP) detection and layout classification to improve the recognition results using post-processing rules. The system is conceived by evaluating and optimizing different models with various modifications, aiming at achieving the best speed/accuracy trade-off at each stage. The networks are trained using images from several datasets, with the addition of various data augmentation techniques, so that they are robust under different conditions. The proposed system achieved an average end-to-end recognition rate of 96.8% across eight public datasets (from five different regions) used in the experiments, outperforming both previous works and commercial systems in the ChineseLP, OpenALPR-EU, SSIG-SegPlate and UFPR-ALPR datasets. In the other datasets, the proposed approach achieved competitive results to those attained by the baselines. Our system also achieved impressive frames per second (FPS) rates on a high-end GPU, being able to perform in real time even when there are four vehicles in the scene. An additional contribution is that we manually labeled 38,334 bounding boxes on 6,237 images from public datasets and made the annotations publicly available to the research community.


INTRODUCTION
Automatic License Plate Recognition (ALPR) became an important topic of research since the appearance of the first works in the early 1990s [1,2]. A variety of ALPR systems and commercial products have been produced over the years due to many practical applications such as automatic toll collection, border control, traffic law enforcement and road traffic monitoring [3,4].
ALPR systems typically include three phases, namely: license plate (LP) detection, character segmentation and character recognition, which refer to (i) locating the LP region in the acquired image, (ii) segmenting each character within the detected LP and (iii) classifying each segmented character. The earlier stages require higher accuracy since a failure would probably lead to another failure in the subsequent stages.
Many authors have proposed approaches with a vehicle detection stage prior to LP detection, aiming to eliminate false positives (FPs) and reduce processing time [5][6][7]. Regarding character This paper is a preprint of a paper submitted to IET Intelligent Transport Systems. If accepted, the copy of record will be available at the IET Digital Library. segmentation, it has become common the use of segmentation-free approaches for LP recognition [8][9][10][11], as the character segmentation by itself is a challenging task that is prone to be influenced by uneven lighting conditions, shadows, and noise [12].
Despite the importance of having a robust ALPR system and that major advances (in terms of both accuracy and efficiency) have been achieved in computer vision using deep learning [13], several solutions are still not robust enough to be executed on real-world scenarios. Such solutions commonly depend on certain constraints such as specific cameras or viewing angles, simple backgrounds, good lighting conditions, search in a fixed region, and certain types of vehicles. Additionally, many authors still propose computationally expensive approaches that are not able to process frames in real time, even when the experiments are performed on a high-end GPU [12,14,15]. In the literature, generally a system is considered "real-time" if it is capable of processing at least 30 frames per second (FPS) since commercial cameras usually record videos at that frame rate [8,16,17].
ALPR systems must also be capable of recognizing multiple LP layouts since there might be various LP layouts in the same country or region. However, as stated in [18], most of the existing solutions work only for a specific LP layout. Even though most authors claim that their approaches could be extended with small modifications to detect/segment/recognize LPs of different layouts [14,[19][20][21], this may not be an easy task. For instance, a character segmentation approach designed for LPs with simple backgrounds is likely to fail on LPs with complex backgrounds and logos that touch and overlap some characters (e.g., Florida LPs) [9].
In this paper, we propose an end-to-end, efficient and layoutindependent ALPR system using YOLO-based models at all stages. YOLO [16,22,23] is a real-time object detector that achieved impressive results in terms of speed/accuracy trade-off in the Pascal VOC [24] and Microsoft COCO [25] detection tasks. Although YOLO has already been employed in the ALPR context in previous works, such works present certain limitations, for example, they handle only part of the ALPR pipeline [26][27][28], LPs from a single country/region [6,17], or cannot process frames in real time [7]. Also, some authors [18,28] simplified the problem by forcing their systems to output only one bounding box per image, even though it is very common to have more than one vehicle/LP in the scene [8].
We locate the vehicles in the input image and then their LPs within the vehicle bounding box. Considering that the bottleneck of ALPR systems is the LP recognition stage (see Section 2.3), in this paper we propose a unified approach for LP detection and layout classification to improve the recognition results using postprocessing rules (this is the first time a layout classification stage is proposed to improve LP recognition, to the best of our knowledge). Afterward, all LP characters are recognized simultaneously, i.e., the entire LP patch is fed into the network, avoiding the challenging character segmentation task. We eliminate various constraints commonly found in ALPR systems by training a single network for each task using images from several datasets, which were collected under different conditions and reproduce distinct real-world applications. Moreover, we perform several data augmentation tricks and modifications to the chosen networks aiming to achieve the best speed/accuracy trade-off at each stage.
Our experimental evaluation demonstrates the effectiveness of the proposed approach, which outperforms previous works and two commercial systems in the ChineseLP [29], OpenALPR-EU [30], SSIG-SegPlate [31] and UFPR-ALPR [17] datasets, and achieves competitive results to those attained by the baselines in other four public datasets. Our system also achieved an impressive trade-off between accuracy and speed. Specifically, on a high-end GPU (i.e., an NVIDIA Titan XP), the proposed system is able to process images in real time even when there are 4 vehicles in the scene.
Considering the aforementioned discussions, the main contributions of our work are summarized as follows 1 : (i) a new end-toend, efficient and layout-independent ALPR system using YOLObased object detection Convolutional Neural Networks (CNNs) at all stages; (ii) annotations regarding the position of the vehicles, LPs and characters, as well as their classes, in each image of the public datasets used in this work that have no annotations or contain labels only for part of the ALPR pipeline. Precisely, we manually labeled 38,351 bounding boxes on 6,239 images; and (iii) a comparative evaluation of the proposed approach, previous works in the literature and two commercial systems in eight publicly available datasets.
A preliminary version of the system described in this paper was published at the 2018 International Joint Conference on Neural Networks (IJCNN) [17]. The approach described here differs from that version in several aspects. For instance, in the current version, the LP layout is classified prior to LP recognition (together with LP detection), the recognition of all characters is performed simultaneously (instead of first segmenting and then recognizing each of them) and modifications were made to all networks (e.g., in the input size, number of filters, layers, and anchors, among others) to make them faster and more robust. In this way, we overcome the limitations of the system presented in [17] and were able to considerably improve both the execution time (from 28ms to 14ms) and the recognition results (e.g., from 64.89% to 90% in the UFPR-ALPR dataset). This version was also evaluated on a broader and deeper manner.
The remainder of this paper is organized as follows. We review related works in Section 2. The proposed system is presented in Section 3. In Section 4, the experimental setup is thoroughly described. We report and discuss the results in Section 5. Finally, conclusions and future works are given in Section 6.

RELATED WORK
In this section, we review recent works that use deep learning approaches in the context of ALPR. For relevant studies using conventional image processing techniques, please refer to [3,4]. We first discuss works related to the LP detection and recognition stages, and then conclude with final remarks.

License Plate Detection
Many authors have addressed the LP detection stage using object detection CNNs. Silva & Jung noticed that the Fast-YOLO model [16] achieved a low recall rate when detecting LPs without prior vehicle detection. Therefore, they used the Fast-YOLO model arranged in a cascaded manner to first detect the frontal view of the cars and then their LPs in the detected patches, attaining high precision and recall rates on a dataset with Brazilian LPs.
Hsu et al. [26] customized the YOLO and YOLOv2 models exclusively for LP detection. Despite the fact that the modified versions of YOLO performed better and were able to process 54 FPS on a high-end GPU, we believe that LP detection approaches should be even faster (i.e., 150+ FPS) since the LP characters still need to be recognized. Kurpiel et al. [32] partitioned the input image in sub-regions, forming an overlapping grid. A score for each region was produced using a CNN and the LPs were detected by analyzing the outputs of neighboring sub-regions. On a GT-740M GPU, it took 230 ms to detect Brazilian LPs in images with multiple vehicles, achieving a recall rate of 83% on a public dataset introduced by them.
Li et al. [12] trained a CNN based on characters cropped from general text to perform a character-based LP detection. The network was employed in a sliding-window fashion across the entire image to generate a text salience map. Text-like regions were extracted based on the clustering nature of the characters. Connected Component Analysis (CCA) is subsequently applied to produce the initial candidate boxes. Then, another LP/non-LP CNN was trained to remove FPs. Although the precision and recall rates obtained were higher than those achieved in previous works, such a sequence of methods is too expensive for real-time applications, taking more than 2 seconds to process a single image when running on an NVIDIA Tesla K40c GPU.
Xie et al. [28] proposed a YOLO-based model to predict the LP rotation angle in addition to its coordinates and confidence value. Prior to that, another CNN was applied to determine the attention region in the input image, assuming that some distance will inevitably exist between any two LPs. By cascading both models, their approach outperformed all baselines in three public datasets, while still running in real time. Despite the impressive results, it is important to highlight two limitations in their work: (i) the authors simplified the problem by forcing their ALPR system to output only one bounding box per image; (ii) motorcycle LPs might be lost when determining the attention region since, in some scenarios (e.g., traffic lights), they might be very close. Kessentini et al. [18] detected the LP directly in the input image using YOLOv2 without any change or refinement. However, they also considered only one LP per image (mainly to eliminate false positives in the background), which makes their approach unsuitable for many real-world applications that contain multiple vehicles in the scene [8,26,32].

License Plate Recognition
In [6], a YOLO-based model was proposed to simultaneously detect and recognize all characters within a cropped LP. While impressive FPS rates (i.e., 448 FPS on a high-end GPU) were attained in experiments carried out in the SSIG-SegPlate dataset [31], less than 65% of the LPs were correctly recognized. According to the authors, the accuracy bottleneck of their approach was letter recognition since the training set of characters was highly unbalanced (in particular, letters). Silva & Jung [7] retrained that model with an enlarged training set composed of real and artificially generated images using font-types similar to the LPs of certain regions. In this way, the retrained network became much more robust for the detection and classification of real characters, outperforming previous works and commercial systems in three public datasets.
Li et al. [12] proposed to perform character recognition as a sequence labeling problem, also without the character-level segmentation. Sequential features were first extracted from the entire LP patch using a CNN in a sliding window manner. Then, Bidirectional Recurrent Neural Networks (BRNNs) with Long Short-Term Memory (LSTM) were applied to label the sequential features. Lastly, Connectionist Temporal Classification (CTC) was employed for sequence decoding. The results showed that this method attained better recognition rates than the baselines. Nonetheless, only Taiwanese LPs were used in their experiments and the execution time was not reported.
Dong et al. [14] claimed that the method proposed in [12] is very fragile to distortions caused by viewpoint change and therefore is not suitable for LP recognition in the wild. Thus, an LP rectification step is employed first in their approach. Afterward, a CNN was trained to recognize Chinese characters, while a shared-weight CNN recognizer was used for digits and English letters, making full use of the limited training data. The recognition rate attained on a private dataset with Chinese LPs was 89.05%. The authors did not report the execution time of this particular stage.
Zhuang et al. [33] proposed a semantic segmentation technique followed by a character count refinement module to recognize the characters of an LP. For semantic segmentation, they simplified the DeepLabV2 (ResNet-101) model by removing the multi-scaling process, increasing computational efficiency. Then, the character areas were generated through CCA. Finally, Inception-v3 and AlexNet were adopted as the character classification and character counting models, respectively. The authors claimed that both an outstanding recognition performance and a high computational efficiency were attained. Nevertheless, they assumed that LP detection is easily accomplished and used cropped patches containing only the LP with almost no background (i.e., the ground truth) as input. Furthermore, their system is not able to process images in real time, especially when considering the time required for the LP detection stage, which is often more time-consuming than the recognition one.
Some papers focus on deblurring the LPs, which is very useful for LP recognition. Lu et al. [34] proposed a scheme based on sparse representation to identify the blur kernel, while Svoboda et al. [35] employed a text deblurring CNN for reconstruction of blurred LPs. Despite achieving exceptional qualitative results, the additional computational cost of a deblurring stage usually is prohibitive for realtime ALPR applications.

Final Remarks
The approaches developed for ALPR are still limited in various ways. Many authors only addressed part of the ALPR pipeline, e.g., LP detection [28,32,36] or character/LP recognition [33,37,38], or performed their experiments on private datasets [9,14,38], making it difficult to accurately evaluate the presented methods. Note that works focused on a single stage do not consider localization errors (i.e., correct but not so accurate detections) in earlier stages [10,33]. Such errors directly affect the recognition results. As an example, Gonçalves et al. [8] improved their results by 20% by skipping the LP detection stage, that is, by feeding the LPs manually cropped into their recognition network.
In this work, the proposed end-to-end system is evaluated in eight public datasets that present a great variety in the way they were collected, with images of various types of vehicles (including motor-cycles) and numerous LP layouts. It should be noted that, in most of the works in the literature, no more than three datasets were used in the experiments (e.g., [12,17,18,33]). In addition, despite the fact that motorcycles are one of the most popular transportation means in metropolitan areas [39], motorcycle images have not been used in the assessment of most ALPR systems in the literature.
Most of the approaches are not capable of recognizing LPs in real time (i.e., 30 FPS) [7,15,33], making it impossible for them to be applied in some applications. Furthermore, several authors do not report the execution time of the proposed methods or report the time required only for a specific stage [12,14,38], making it difficult an accurate analysis of their speed/accuracy trade-off, as well as their applicability. In this sense, at each stage, we evaluate different YOLO models with various modifications, carefully optimizing and combining them aiming to achieve the best speed/accuracy tradeoff. In our experiments, both the accuracy and execution time are reported to enable fair comparisons in future works.
It is important to highlight that although outstanding results in terms of mean Average Precision (mAP) have been achieved with other object detectors such as SSD [40] and RetinaNet [41], in this work we adapt YOLO since it focuses on an extreme speed/accuracy trade-off [41], which is essential in our domain application, being able to process more than twice as many FPS as other detectors while still achieving competitive results [22,23].
We consider LP recognition as the current bottleneck of ALPR systems since (i) impressive LP detection results have been reported in recent works [17,27,28], both in terms of recall rate and execution time; (ii) Optical Character Recognition (OCR) approaches must work as close as possible to the optimality (i.e., 100% of character recognition rate) in the ALPR context, as a single mistake may imply in incorrect identification of the vehicle [31]. Thus, in this work, we propose a unified approach for LP detection and layout classification in order to improve the recognition results using heuristic rules. Additionally, we design and apply data augmentation techniques to simulate LPs of other layouts and also to generate LP images with characters that have few instances in the training set. Hence, unlike [6,38], we avoid errors in the recognition stage due to highly unbalanced training sets of LP characters.

PROPOSED ALPR SYSTEM
The nature of traffic images might be very problematic to LP detection approaches that work directly on the frames (i.e., without vehicle detection) since (i) there are many textual blocks that can be confused with LPs such as traffic signs and phone numbers on storefronts, and (ii) LPs might occupy very small portions of the image. Thus, we propose to first locate the vehicles in the input image and then detect their respective LPs in the vehicle patches. Afterward, we detect and recognize all characters simultaneously by feeding the entire LP patch into the network. In this way, we do not need to deal with the challenging character segmentation task.
Although some approaches with such characteristics (i.e., containing a vehicle detection stage prior to LP detection and/or avoiding character segmentation) have already been proposed in the literature, none of them presented robustness for different LP layouts in both accuracy and processing time. In [6] and [8], for instance, the authors designed real-time ALPR systems able to process more than 50 FPS on high-end GPUs, however, both systems were evaluated only on LPs from a single country and presented poor recognition rates in at least one dataset in which they were evaluated. On the other hand, outstanding results were achieved on different scenarios in some recent works [7,12,15], however, the methods presented in these works are computationally expensive and cannot be applied in real time. This makes them unsuitable for use in many real-world applications.
In order to develop an ALPR system that is robust for different LP layouts, we propose a layout classification stage after LP detection. However, instead of performing both stages separately, we merge the LP detection and layout classification tasks by training an object detection network that outputs a distinct class for each LP layout. In this way, with almost no additional cost, we employ layoutspecific approaches for LP recognition in cases where the LP and its layout are predicted with a confidence value above a predefined threshold. For example, all Brazilian LPs have seven characters: three letters and four digits (in that order), and thus a post-processing method is applied to avoid errors in characters that are often misclassified, such as 'B' and '8', 'G' and '6', 'I' and '1', among others. In cases where the LP and its layout are detected with confidence below the predefined threshold, a generic approach is applied. To the best of our knowledge, this is the first time a layout classification stage is proposed to improve the recognition results.
As great advances in object detection have been achieved using YOLO-inspired models [42][43][44], we decided to specialize it for ALPR. We use specific models for each stage. Thus, we can tune the parameters separately in order to improve the performance of each task. The models adapted are YOLOv2 [22], Fast-YOLOv2 and CR-NET [6], which is an architecture inspired by YOLO for character detection and recognition. We evaluated several data augmentation techniques and performed modifications to each network (e.g., changes in the input size, number of filters, layers and anchors) to achieve the best speed/accuracy trade-off at each stage.
In this work, unlike [17,33,45], for each stage, we train a single network on images from several datasets (described in Section 4.1) to make our networks robust for distinct ALPR applications or scenarios with considerably less manual effort since their parameters are adjusted only once for all datasets.
This remainder of this section describes the proposed approach and it is divided into three subsections, one for each stage of our end-to-end ALPR system: (i) vehicle detection, (ii) LP detection and layout classification and (iii) LP recognition. Figure 1 illustrates the system pipeline, explained throughout this section.

Vehicle Detection
In this stage, we conducted experiments using and modifying the following models: Fast-YOLOv2, YOLOv2 [22], Fast-YOLOv3 and YOLOv3 [23]. Although the Fast-YOLOv2 and Fast-YOLOv3 models correctly located the vehicles in most cases, they failed in challenging scenarios such as images in which one or more vehicles are partially occluded or appear in the background. On the other hand, impressive results (i.e., F-measure rates above 98% in the validation set 2 ) were obtained with both YOLOv2 and YOLOv3, which successfully detected vehicles even in those cases where the smaller models failed. As the computational cost is one of our main concerns and YOLOv3 is much more complex than its predecessor, we adapt the YOLOv2 model for vehicle detection.
First, we changed the network input size from 416 × 416 to 448 × 288 pixels since the images used as input to ALPR systems generally have a width greater than height. Hence, our network processes less distorted images and performs faster, as the new input size is 25% smaller than the original. The new dimensions were chosen based on speed/accuracy assessments with different input sizes (from 448 × 288 to 832 × 576 pixels). Then, we recalculate the anchor boxes for the new input size as well as for the datasets employed in our experiments using the k-means clustering algorithm. Finally, we reduced the number of filters in the last convolutional layer to match the number of classes. YOLOv2 uses A anchor boxes to predict bounding boxes (we use A = 5), each with four coordinates (x, y, w, h), confidence and C class probabilities [22], so the number of filters is given by As we intend to detect cars and motorcycles (two classes), the number of filters in the last convolutional layer must be 35 ((2 + 5) × 5).
According to preliminary experiments, the results were better when using two classes instead of just one regarding both types of vehicles. The modified YOLOv2 architecture for vehicle detection is shown in Table 1. We exploit various data augmentation strategies, such as flipping, rescaling and shearing, to train our network. Thus, we prevent overfitting by creating many other images with different characteristics from a single labeled one. Silva & Jung [7] slightly modified their pipeline by directly applying their LP detector (i.e., skipping the vehicle detection stage) when dealing with images in which the vehicles are very close to the camera, as their detector failed in several of those cases. We believe this is not the best way to handle the problem. Instead, we do not skip the vehicle detection stage even when only a small part of the vehicle is visible. The entire image is labeled as ground truth in cases where the vehicles are very close to the camera. Therefore, our network also learns to select the Region of Interest (ROI) in such cases.
In the validation set, we evaluate several confidence thresholds to detect as many vehicles as possible while maintaining a low FP rate. Furthermore, we apply a Non-Maximum Suppression (NMS) algorithm to eliminate redundant detections (those with Intersection

280-BGY
TW TW Heuristic Rules Fig. 1. The pipeline of the proposed ALPR system. First, the vehicles are detected in the input image. Then, the LP of each vehicle is detected and its layout classified (in the example above, the LPs are Taiwanese). Finally, all characters of each LP are recognized simultaneously with heuristic rules being applied to adapt the results according to the predicted layout class (e.g., Taiwanese LPs have 5 or 6 characters).
over Union (IoU) ≥ 0.25) since the same vehicle might be detected more than once by the network. A negative recognition result is given in cases where no vehicle is found.

License Plate Detection and Layout Classification
In this work, we detect the LP and simultaneously classify its layout into one of the following classes: American, Brazilian, Chinese, European or Taiwanese. These classes were defined based on public datasets found in the literature [17,[29][30][31][46][47][48][49] and also because there are many ALPR systems designed primarily for LPs of one of those regions [6,38,49]. It is worth noting that (i) among LPs with different layouts (which may belong to the same class/region) there is a wide variety in many factors, for example, in the aspect ratio, colors, symbols, position of the characters, number of characters, among others; (ii) we consider LPs from different jurisdictions in the United States as a single class; the same is done for LPs from European countries. LPs from the same country or region may look quite different, but still share many characteristics in common. Such common features can be exploited to improve LP recognition. In   Looking for an efficient ALPR system, in this stage we performed experiments with the Fast-YOLOv2 and Fast-YOLOv3 models. In the validation set, Fast-YOLOv2 obtained slightly better results than its successor. This is due to the fact that YOLOv3 and Fast-YOLOv3 have relatively high performance on small objects (which is not the case since we first detect the vehicles), but comparatively worse performance on medium and larger size objects [23]. Accordingly, here we modified the Fast-YOLOv2 model to adapt it to our application and to achieve even better results.
First, we changed the kernel size of the next-to-last convolutional layer from 3 × 3 to 1 × 1. Then, we added a 3 × 3 convolutional layer with twice the filters of that layer. In this way, the network reached better results (F-measure ≈ 1% higher, from 97.97% to 99.00%) almost without increasing the number of floating-point operations (FLOP) required, i.e., from 5.35 to 5.53 billion floatingpoint operations (BFLOP), as alternating 1 × 1 convolutional layers between 3 × 3 convolutions reduce the feature space from preceding layers [16,22]. Finally, we recalculate the anchors for our data and make adjustments to the number of filters in the last layer. The modified architecture is shown in Table 2. Table 2. Fast-YOLOv2 modified for LP detection and layout classification. First, we reduced the kernel size of layer #13 from 3 × 3 to 1 × 1, and added layer #14. Then, we reduced the number of filters in layer #15 from 425 to 50, as we use 5 anchor boxes to detect 5 classes (see Equation 1). In Table 2, we also list the number of FLOP required in each layer to highlight how small the modified network is compared to others, e.g., YOLOv2 and YOLOv3. For this task, our network requires 5.53 BFLOP while YOLOv2 and YOLOv3 require 29.35 and 66.32 BFLOP, respectively. It is noteworthy that we only need to increase the number of filters in the last convolutional layer so that the network can detect/classify additional LP layouts.
For LP detection and layout classification, we also use data augmentation strategies to generate many other images from a single labeled one. However, horizontal flipping is not performed at this stage, as the network leverages information such as the position of the characters and symbols on the LP to predict its layout (besides the aspect ratio, colors, and other characteristics).
Only the detection with the highest confidence value is considered in cases where more than one LP is predicted, as each vehicle has only one LP. Then, we classify as 'undefined layout' every LP that has its position and class predicted with a confidence value below 0.75, regardless of which class the network predicted (note that such LPs are not rejected, instead, a generic approach is used in the recognition stage). This threshold was chosen based on experiments performed in the validation set, in which approximately 92% of the LPs were predicted with a confidence value above 0.75. In each of these cases, the LP layout was correctly classified. A negative result is given in cases where no LP is predicted by the network.

License Plate Recognition
Once the LP has been detected and its layout classified, we employ CR-NET [6] for LP recognition (i.e., all characters are recognized simultaneously by feeding the entire LP patch into the network). CR-NET is a model that consists of the first eleven layers of YOLO and four other convolutional layers added to improve nonlinearity. This model was chosen for two main reasons. First, it was capable of detecting and recognizing LP characters at 448 FPS in [6] Second, very recently, it yielded the best recognition results in the context of image-based automatic meter reading [50], outperforming two segmentation-free approaches based on deep learning.
The CR-NET architecture is shown in Table 3. We changed its input size, which was originally defined based on Brazilian LPs, from 240 × 80 to 352 × 128 pixels taking into account the average aspect ratio of the LPs in the datasets used in our experiments, in addition to results obtained in the validation set, where several input sizes were evaluated (e.g., 256 × 96 and 384 × 128 pixels). As the same model is employed to recognize LPs of various layouts, we enlarge all LP patches (in both the training and testing phases) so that they have aspect ratios (w/h) between 2.5 and 3.0, as shown in Figure 3, considering that the input image has an aspect ratio of 2.75. The network is trained to predict 35 classes (0-9, A-Z, where the letter 'O' is detected/recognized jointly with the digit '0') using the LP patch as well as the class and coordinates of each character as inputs. It is worth to mention that the first character in Chinese LPs (see Figure 2) is a Chinese character that represents the province in which the vehicle is affiliated [38,51]. Following [15], our network was not trained/designed to recognize Chinese characters, even though Chinese LPs are used in the experiments. In other words, only digits and English letters are considered. The reason is threefold: (i) there are less than 400 images in the ChineseLP dataset [29] (only some of them are used for training), which is employed in the experiments, and some provinces are not represented; (ii) labeling the class of Chinese characters is not a trivial task for non-Chinese people (we manually labeled the position and class of the LP characters in the ChineseLP dataset); and (iii) to fairly compare our system with others trained only on digits and English letters. Remark that in the ALPR literature the approaches capable of recognizing Chinese characters, digits and English letters were evaluated, for the most part, on datasets containing only Chinese LPs [20,38,51].
As the LP layout is classified in the previous stage, we design heuristic rules to adapt the results produced by CR-NET according to the predicted class. Based on the datasets employed in this work, we defined the minimum and the maximum number of characters to be considered in LPs of each layout. Brazilian and Chinese LPs have a fixed number of characters, while American, European and Taiwanese LPs do not (see Table 4). Initially, we consider all characters predicted with a confidence value above a predefined threshold. Afterward, as in the vehicle detection stage, an NMS algorithm is applied to remove redundant detections. Finally, if necessary, we discard the characters predicted with lower confidence values or consider others previously discarded (i.e., ignoring the confidence threshold) so that the number of characters considered is within the range defined for the predicted class. We consider that the LP has between 4 and 8 characters in cases where its layout was classified with a low confidence value (i.e., undefined layout). Additionally, we swap digits and letters on Brazilian and Chinese LPs, as there are fixed positions for digits or letters in those layouts. In Brazilian LPs, the first three characters correspond to letters and the last four to digits; while in Chinese LPs the second character is a letter that represents a city in the province in which the vehicle is affiliated. This swap approach, inspired by [6], is not employed for LPs of other layouts since each character position can be occupied by either a letter or a digit in American, European and Taiwanese LPs. The specific swaps are given by [1 ⇒ I; 2 ⇒ Z; The LP characters might also be arranged in two rows instead of one. We distinguish such cases based on the predictions of the vehicle type, LP layout, and character coordinates. In our experiments, only two datasets have LPs with the characters arranged in two rows.
These datasets were captured in Brazil and Croatia. In Brazil, car and motorcycle LPs have the characters arranged in one and two rows, respectively. Thus, we look at the predicted class in the vehicle detection stage in those cases. In Croatia, on the other hand, cars might also have LPs with two rows of characters. Therefore, for European LPs, we consider that the characters are arranged in two rows in cases where the bounding boxes of half or more of the predicted characters are located entirely below another character. In our tests, this simple rule was sufficient to distinguish LPs with one and two rows of characters even in cases where the LP is considerably inclined. We emphasize that segmentation-free approaches (e.g., [8][9][10]) cannot recognize LPs with two rows of characters, contrarily to YOLO-based approaches, which are better suited to recognize them thanks to YOLO's versatility and ability to learn general component features, regardless of their positions [18].
In addition to using the original LP images, we design and apply data augmentation techniques to train the CR-NET model and improve its robustness. First, we double the number of training samples by creating a negative image of each LP, as we noticed that in some cases negative LPs are very similar to LPs of other layouts. This is illustrated with Brazilian and American LPs in Figure 4. We also generate many other images by randomly rescaling the LP patch and adding a margin to it, simulating more or less accurate detections of the LP in the previous stage.  The datasets for ALPR are generally very unbalanced in terms of character classes due to LP allocation policies. It is well-known that unbalanced data is undesirable for neural network classifiers since the learning of some patterns might be biased. To address this issue, we permute on the LPs the characters overrepresented in the training set by those underrepresented. In this way, as in [8], we are able to create a balanced set of images in which the order and frequency of the characters on the LPs are chosen to uniformly distribute them across the positions. We maintain the initial arrangement of letters and digits of each LP so that the network might also learn the positions of letters and digits in certain LP layouts. Figure 5 shows some artificially generated images by permuting the characters on LPs of different layouts. We also perform random variations of brightness, rotation and cropping to increase even more the diversity of the generated images. The parameters were empirically adjusted through visual inspection, i.e., brightness variation of the pixels [0.85; 1.15], rotation angles between −5°and 5°and cropping from −2% to 8% of the LP size. Once these ranges were established, new images were generated using random values within those ranges for each parameter. Figure 6: Examples of LP images generated using the data augmentation technique proposed in [10]. The images in the first row are the originals and the others were generated automatically.

EXPERIMENTAL SETUP
All experiments were performed on a computer with an AMD Ryzen Threadripper 1920X 3.5GHz CPU, 32 GB of RAM and an NVIDIA Titan Xp GPU. The Darknet framework [52] was employed to train and test our networks. However, we used the AlexeyAB's version of Darknet [53], which has several improvements over the original, including improved neural network performance by merging two layers into one (convolutional and batch normalization), optimized memory allocation during network resizing, and many other code fixes. For more details on this repository, refer to [53].
We also made use of the Darknet's built-in data augmentation, which creates a number of randomly cropped and resized images with changed colors (hue, saturation, and exposure). We manually implemented the flip operation only for the vehicle detection stage, as this operation would probably impair the layout classification and the LP recognition tasks. Similarly, we disabled the colorrelated data augmentation for the LP detection and layout classification stage (further explained in Section 5.2).

Datasets
The experiments were carried out in eight publicly available datasets: Caltech Cars [46], EnglishLP [47], UCSD-Stills [48], Chi-neseLP [29], AOLP [49], OpenALPR-EU [30], SSIG-SegPlate [31] and UFPR-ALPR [17]. These datasets are often used to evaluate ALPR systems, contain multiple LP layouts and were collected under different conditions/scenarios (e.g., with variations in lighting, camera position and settings, and vehicle types). An overview of the datasets is presented in Table 5. It is noteworthy that in most of the works in the literature, including some recent ones [12,17,18,33], no more than three datasets were used in the experiments. The datasets collected in the United States (i.e., Caltech Cars and UCSD-Stills) and in Europe (i.e., EnglishLP and OpenALPR-EU) are relatively simple and have certain characteristics in common, for example, most images were captured with a hand-held camera and there is only one vehicle (generally well-centered) in each image. There are only a few cases in which the LPs are not well aligned. The ChineseLP and AOLP datasets, on the other hand, also contain images where the LP is inclined/tilted, as well as images with more than one vehicle, which may be occluded by others. Lastly, the SSIG-SegPlate and UFPR-ALPR datasets are composed of highresolution images, enabling LP recognition from distant vehicles. In both datasets, there are several frames of each vehicle and, therefore, redundant information may be used to improve the recognition results.
Most datasets have no annotations or contain labels for a single stage only (e.g., LP detection), despite the fact that they are often used to train/evaluate algorithms in the ALPR context. Therefore, in all images of these datasets, we manually labeled the position of the vehicles (including those in the background where the LP is also legible), LPs and characters, as well as their classes.
In addition to using the training images of the datasets, we downloaded and labeled more 772 images from the internet to train all stages of our ALPR system. This procedure was adopted to eliminate biases from the datasets employed in our experiments. For example, the Caltech Cars and UCSD-Stills datasets have similar characteristics (e.g., there is one vehicle per image, the vehicle is centered and occupies a large portion of the image, and the resolutions of the images are not high), which are different from those of the other datasets. Moreover, there are many more examples of Brazilian and Taiwanese LPs in our training data (note that the exact number of images used for training, testing and validation in each dataset is detailed in the next section). Therefore, we downloaded images containing vehicles with American, Chinese and European LPs so that there are at least 500 images of LPs of each class/region to train our networks. Specifically, we downloaded 257, 341, and 174 images containing American, Chinese and European LPs, respectively 3 .
In our experiments, we did not make use of two datasets proposed recently: AOLPE [26] (an extension of the AOLP dataset) and Chinese City Parking Dataset (CCPD) [54]. The former has not yet been made available by the authors, who are collecting more data to make it even more challenging. The latter, although already available, does not provide the position of the vehicles and the characters in its 250,000 images and it would be impractical to label them to train/evaluate our networks (Xu et al. [54] used more than 100,000 images for training in their experiments).

Evaluation Protocol
To evaluate the stages of (i) vehicle detection and (ii) LP detection and layout classification, we report the precision and recall rates achieved by our networks. Each metric has its importance since, for system efficiency, all vehicles/LPs must be detected without many false positives. Note that the precision and recall rates are equal in the LP detection and layout classification stage because we consider only one LP per vehicle.
We consider as correct only the detections with IoU greater than 0.5 with the ground truth. This bounding box evaluation, defined in the PASCAL VOC Challenge [24] and employed in previous works [15,18,21], is interesting since it penalizes both overand under-estimated objects. In the LP detection and layout classification stage, we assess only the predicted bounding box on LPs classified as undefined layout (see Section 3.2). In other words, we consider as correct the predictions when the LP position is correctly predicted but not its layout, as long as the LP (and its layout) has not been predicted with a high confidence value (i.e., below 0.75).
In the LP recognition stage, we report the number of correctly recognized LPs divided by the total number of LPs in the test set. A correctly recognized LP means that all characters on the LP were correctly recognized, as a single character recognized incorrectly may imply in incorrect identification of the vehicle [5].
According to Table 5, only three of the eight datasets used in this work contain an evaluation protocol (defined by the respective authors) that can be reproduced perfectly: UCSD-Stills, SSIG-SegPlate and UFPR-ALPR. Thus, we split their images into training, validation, and test sets according to their own protocols. We randomly divided the other five datasets using the protocols employed in previous works, aiming at a fair comparison with them. In the next paragraph, such protocols (which we also provide for reproducibility purposes) are specified.
We used 80 images of the Caltech Cars dataset for training and 46 for testing, as in [55][56][57]. Then, we employed 16 of the 80 training images for validation (i.e., 20%). The EnglishLP dataset was divided in the same way as in [45], with 80% of the images being used for training and the remainder for testing. Also in this dataset, 20% of the training images were employed for validation. Regarding the ChineseLP dataset, we did not find any previous work in which it was split into training/test sets, that is, all its images were used either to train or to test the methods proposed in [12,19,58,59], often jointly with other datasets. Thus, we adopted the same protocol of the SSIG-SegPlate and UFPR-ALPR datasets, in which 40% of the images are used for training, 40% for testing and 20% for validation. The AOLP dataset is categorized into three subsets, which represent three major ALPR applications: access control (AC), traffic law enforcement (LE), and road patrol (RP). As this dataset has been divided in several ways in the literature, we divided each subset into training and test sets with a 2:1 ratio, following [28,33]. Then, 20% of the training images were employed for validation. Lastly, all images belonging to the OpenALPR-EU dataset were used for testing in [7,60], while other public or private datasets were employed for training. Therefore, we also did not use any image of this dataset for training or validation, only for testing. An overview of the number of images used for training, testing and validation in each dataset can be seen in Table 6. We discarded a few images from the Caltech Cars, UCSD-Stills, and ChineseLP datasets 4 . Although most images in these datasets are reasonable, there are a few exceptions where (i) it is impossible to recognize the vehicle's LP due to occlusion, lighting or image acquisition problems, etc.; (ii) the image does not represent real ALPR scenarios, for example, a person holding an LP. Three examples are shown in Figure 6. Such images were also discarded in [60].  It is worth noting that we did not discard any image from the test set of the UCSD-Stills dataset and used the same number of test images in the Caltech Cars dataset. In this way, we can fairly compare our results with those obtained in previous works. In fact, we used fewer images from those datasets to train and validate our networks. In the ChineseLP dataset, on the other hand, we first discard the few images with problems and then split the remaining ones using the same protocol as the SSIG-SegPlate and UFPR-ALPR datasets (i.e., 40/20/40% for training, validation and testing, respectively) since, in the literature, a division protocol has not yet been proposed for the ChineseLP dataset, to the best of our knowledge.
To avoid an overestimation or bias in the random division of the images into the training, validation and test subsets, we report in each stage the average result of five runs of the proposed approach (note that most works in the literature, including recent ones [7,12,15,17,33], report the results achieved in a single run only). Thus, at each run, the images of the datasets that do not have an evaluation protocol were randomly redistributed into each subset (training/validation/test). In the UCSD-Stills, SSIG-SegPlate and UFPR-ALPR datasets, we employed the same division (i.e., the one proposed along with the respective dataset) in all runs.
As pointed out in Section 4.1, we manually labeled the vehicles in the background of the images in cases where their LPs are legible. Nevertheless, in the testing phase, we considered only the vehicles/LPs originally labeled in the datasets that have annotations to perform a fair comparison with previous works.

RESULTS AND DISCUSSION
In this section, we report the experiments carried out to verify the effectiveness of the proposed ALPR system. We first assess the detection stages separately since the regions used in the LP recognition stage are from the detection results, rather than cropped directly from the ground truth. This is done to provide a realistic evaluation of the entire ALPR system, in which well-performed vehicle and LP detections are essential for achieving outstanding recognition results. Afterward, the entire ALPR system is evaluated and the results achieved are compared with those obtained in previous works and by commercial systems.

Vehicle Detection
In this stage, we employed a confidence threshold of 0.25 (defined empirically) to detect as many vehicles as possible, while avoiding high FP rates and, consequently, a higher cost of the proposed ALPR system. The following parameters were used for training the network: 60K iterations (max batches) and learning rate = [10 -3 , 10 -4 , 10 -5 ] with steps at 48K and 54K iterations.
The vehicle detection results are presented in Table 7. In the average of five runs, our approach achieved a recall rate of 99.92% and a precision rate of 98.37%. It is remarkable that the network was able to correctly detect all vehicles (i.e., recall = 100%) in 5 of the 8 datasets used in the experiments. Some detection results are shown in Figure 7. As can be seen, well-located predictions were attained on vehicles of different types and under different conditions.   Observe that vehicles of different types were correctly detected regardless of lighting conditions (daytime and nighttime), occlusion, camera distance, and other factors.
To the best of our knowledge, with the exception of the preliminary version of this work [17], there is no other work in the ALPR context where both cars and motorcycles are detected at this stage. This is of paramount importance since motorcycles are one of the most popular transportation means in metropolitan areas, especially in Asia [39]. Although motorcycle LPs may be correctly located by LP detection approaches that work directly on the frames, they can be detected with fewer false positives if the motorcycles are detected first [61].
The precision rates obtained by the network were only not higher due to unlabeled vehicles present in the background of the images, especially in the AOLP and SSIG-SegPlate datasets. Three examples are shown in Figure 8a. In Figure 8b, we show some of the few cases where our network failed to detect one or more vehicles in the image. As can be seen, such cases are challenging since only a small part of each undetected vehicle is visible.   As can be seen in (a), the predicted FPs are mostly unlabelled vehicles in the background. In (b), one can see that the vehicles not predicted by the network (i.e., the FNs) are predominantly those occluded or in the background.

License Plate Detection and Layout Classification
In Table 8, we report the results obtained by the modified Fast-YOLOv2 network in the LP detection and layout classification stage.
As we consider only one LP per vehicle image, the precision and recall rates are identical. The average recall rate obtained in all datasets was 99.51% when disregarding the vehicles not detected in the previous stage and 99.45% when considering the entire test set. This result is particularly impressive since we considered as incorrect the predictions in which the LP layout was incorrectly classified with a high confidence value, even in cases where the LP position was predicted correctly. According to Figure 9, the proposed approach was able to successfully detect and classify LPs of various layouts, including those with few examples in the training set such as LPs issued in the U.S. states of Connecticut and Utah, or LPs of motorcycles in Taiwan. It should be noted that, in some cases, the LP occupies a very small portion of the original image and therefore the vehicle detection stage is crucial for the effectiveness of our ALPR system. Some images where our network failed either to detect the LP or to classify the LP layout are shown in Figure 10. As can be seen in Figure 10a, our network failed to detect the LP in cases where there 13 Fig. 9. LPs correctly detected and classified by the proposed approach. Observe the robustness for this task regardless of vehicle type, lighting conditions, camera distance, and other factors. is a textual block very similar to an LP in the vehicle patch, or even when the LP of another vehicle appears within the patch (a single case in our experiments). This is due to the fact that one vehicle can be almost totally occluded by another. Regarding the errors in which the LP layout was misclassified, they occurred mainly in cases where the LP is considerably similar to LP of other layouts. For example, the left image in Figure 10b shows a European LP (which has exactly the same colors and number of characters as standard Chinese LPs) incorrectly classified as Chinese.
It is important to note that it is still possible to correctly recognize the characters in some cases where our network has failed at this stage. For example, in the right image in Figure 10a, the detected region contains exactly the same text as the ground truth (i.e., the LP). Moreover, a Brazilian LP classified as European (e.g., the middle image in Figure 10b) can still be correctly recognized in the next stage since the only post-processing rule we apply to European LPs is that they have between 5 and 8 characters.
As mentioned earlier, in this stage we disabled the color-related data augmentation of the Darknet framework. In this way, we eliminated more than half of the layout classification errors obtained when the model was trained using images with changed colors. We believe this is due to the fact that the network leverages color information (which may be distorted with some data augmentation approaches) for layout classification, as well as other characteristics such as the position of the characters and symbols on the LP.    Fig. 10. Some images in which our network failed either to detect the LP or to classify the LP layout.

License Plate Recognition (Overall Evaluation)
As in the vehicle detection stage, we first evaluated different confidence threshold values in the validation set in order to miss as few characters as possible, while avoiding high FP rates. We adopted a 0.5 confidence threshold for all LPs except European ones, where a higher threshold (i.e., 0.65) was adopted since European LPs can have up to 8 characters and several FPs were predicted on LPs with fewer characters when using a lower confidence threshold.
We considered the '1' and 'I' characters as a single class in the assessments performed in the SSIG-SegPlate and UFPR-ALPR datasets, as those characters are identical but occupy different positions on Brazilian LPs. The same procedure was done in [7,17].
For each dataset, we compared the proposed ALPR system with state-of-the-art methods that were evaluated using the same protocol as the one described in Section 4.2. In addition, our results are compared with those obtained by Sighthound [60] and OpenALPR [62], which are two commercial systems often used as baselines in the ALPR literature [7,8,10,11,17]. According to the authors, both systems are robust for the detection and recognition of LPs of different layouts. It is important to emphasize that although the commercial systems were not tuned specifically for the datasets employed in our experiments, they are trained in much larger private datasets, which is a great advantage, especially in deep learning approaches.
OpenALPR contains specialized solutions for LPs from different regions (e.g., China, Europe, among others) and the user must enter the correct region before using its API, that is, it requires prior knowledge regarding the LP layout. Sighthound, on the other hand, uses a single model/approach for LPs from different countries/regions, as well as the proposed system.
The results obtained in all datasets by the proposed ALPR system, previous works and commercial systems are shown in Table 9. In the average of five runs, across all datasets, our end-to-end system correctly recognized 96.8% of the LPs, outperforming Sighthound and OpenALPR by 9.1% and 6.3%, respectively. More specifically, the proposed system outperformed both previous works and commercial systems in the ChineseLP, OpenALPR-EU, SSIG-SegPlate and UFPR-ALPR datasets, and yielded competitive results to those attained by the baselines in the other datasets.
The proposed system attained results similar to those obtained by OpenALPR in the Caltech Cars dataset (98.7% against 99.1%, which represents a difference of less than one LP per run, on average, as are only 46 testing images), even though our system does not require prior knowledge. Regarding the EnglishLP dataset, our system performed better than the best baseline [45] in 2 of the 5 runs. Although we used the same number of images for testing, in [45] the dataset was divided only once and the images used for testing were not specified. In the UCSD-Stills dataset, both commercial systems reached a recognition rate of 98.3% while our system achieved 98% on average (with a standard deviation of 1.4%). Lastly, in the AOLP dataset, the proposed approach obtained similar results to those reported by [33], even though in their work the LP patches used as input in the LP recognition stage were cropped directly from the ground truth (simplifying the problem, as explained in Section 2); in other words, they did not take into account vehicles or LPs not detected in the earlier stages, nor background noise in the LP patches due to less accurate LP detections.
To evaluate the impact of classifying the LP layout prior to LP recognition (i.e., our main proposal), we also report in Table 9 the results obtained when assuming that all LP layouts were classified as undefined and that a generic approach (i.e., without heuristic rules) was employed in the LP recognition stage. The mean recognition rate was improved by 1.9%. We consider this strategy (layout classification + heuristic rules) essential for accomplishing outstanding results in datasets that contain LPs with fixed positions for letters and digits (e.g., Brazilian and Chinese LPs), as the recognition rates attained in the ChineseLP, SSIG-SegPlate and UFPR-ALPR datasets were improved by 3.6% on average.
The robustness of our ALPR system is remarkable since it achieved recognition rates higher than 95% in all datasets except UFPR-ALPR (where it outperformed the best baseline by 7.5%). The commercial systems, on the other hand, achieved similar results only in the Caltech Cars and UCSD-Stills datasets, which contain exclusively American LPs, and performed poorly (i.e., recognition rates below 85%) in at least two datasets. This suggests that the commercial systems are not so well trained for LPs of other layouts.
Although OpenALPR achieved better results than Sighthound (on average across all datasets), the latter system can be seen as more robust than the former since it does not require prior knowledge regarding the LP layout. In addition, OpenALPR does not support Taiwanese LPs. In this sense, we tried to employ OpenALPR solutions designed for LPs from other countries (including China) in the experiments performed in the AOLP dataset, however, very low detection and recognition rates were obtained. Figure 11 shows some examples of LPs that were correctly recognized by the proposed approach. As can be seen, our system can generalize well and correctly recognize LPs of different layouts, even when the images were captured under challenging conditions. It is noteworthy that, unlike [17,33,45], the exact same networks were applied to all datasets; in other words, no specific training procedure was used to tune the networks for a given dataset or layout class. Some LPs in which our system failed to correctly detect/recognize all characters are shown in Figure 12. As one may see, the errors occurred mainly in challenging LP images, where even humans can make mistakes since, in some cases, one character might become very similar to another due to the inclination of the LP, the LP frame, shadows, blur, among others. Note that, in this work, we did not apply preprocessing techniques to the LP image in order not to increase the overall cost of the proposed system.
In Table 10, we report the time required for each network in our system to process an input. As in [6,17], the reported time is the average time spent processing all inputs in each stage, assuming that the network weights are already loaded and that there is a single vehicle in the scene. It is remarkable that although a deep CNN Table 9. Recognition rates (%) obtained by the proposed system, previous works, and commercial systems in all datasets used in our experiments. To the best of our knowledge, in the literature, only algorithms for LP detection and character segmentation were evaluated in the Caltech Cars, UCSD-Stills and ChineseLP datasets. Thus, our approach is compared only with the commercial systems in these datasets. 96.8 ± 1.0 * The proposed ALPR system assuming that all LP layouts were classified as undefined (i.e., without layout classification). * * The LP patches for the LP recognition stage were cropped directly from the ground truth in [33].
0750J0 UH7329 F9F183 6B7733 Figure 12: Examples of LPs that were correctly recognized by the proposed ALPR system. In the rows, LPs of different layout classes are shown. From top to bottom: American, Brazilian, Chinese, European and Taiwanese LPs.
27 Fig. 11. Examples of LPs that were correctly recognized by the proposed ALPR system. From top to bottom: American, Brazilian, Chinese, European and Taiwanese LPs.  Figure 13: Examples of LPs that were incorrectly recognized by the proposed ALPR system. The ground truth is shown in parentheses. Fig. 12. Examples of LPs that were incorrectly recognized by the proposed ALPR system. The ground truth is shown in parentheses. model (i.e., YOLOv2 with some modifications) is used for vehicle detection, our system is still able to process images at 73 FPS on a high-end GPU. This is sufficient for real-time usage, as commercial cameras generally record videos at 30 FPS. It should be noted that practically all images from the datasets used in our experiments contain only one labeled vehicle. However, to perform a more realistic analysis of the execution time, we listed in Table 11 the time required for the proposed system to process images assuming that there is a certain number of vehicles in every image (note that vehicle detection is performed only once, regardless of the number of vehicles in the image). According to the results, our system is able to process more than 30 FPS even when there are 4 vehicles in the scene. This information is relevant since some ALPR approaches, including the one proposed in our previous work [17], can only process frames in real time if there is at most one vehicle in the scene. The proposed approach achieved an outstanding trade-off between accuracy and speed, unlike others recently proposed in the literature. For example, the methods proposed in [6,8] are capable of processing more images per second than our system but reached poor recognition rates (i.e., below 65%) in at least one dataset in which they were evaluated. On the other hand, impressive results were achieved on different scenarios in [7,12,15]. However, the methods presented in these works are computationally expensive and cannot be applied in real time. The Sighthound and OpenALPR commercial systems do not report the execution time.
It is important to highlight the number of experiments carried out to develop the proposed ALPR system. More than 50 models were evaluated (with different input sizes, number of filters and layers) and combined in several ways. It takes about two days and a half to train our networks on an NVIDIA Titan Xp GPU (a single run). In the testing phase, unlike most works in the literature (which report the results achieved in a single run), we reported in each stage the average result of five runs of our approach to avoid an overestimation or bias in the random division of the images into the training, validation and test subsets.

CONCLUSIONS
In this work, we presented an end-to-end, efficient and layoutindependent ALPR system using YOLO-based models at all stages. We performed several data augmentation tricks and modified the chosen networks to achieve the best speed/accuracy trade-off at each stage. The proposed system contains a unified approach for LP detection and layout classification to improve the recognition results using post-processing rules. This strategy was essential for accomplishing outstanding results since, depending on the LP layout class, we avoided errors in characters that are often misclassified and also in the number of predicted characters to be considered.
Our system achieved an average recognition rate of 96.8% across eight public datasets used in the experiments, outperforming Sighthound and OpenALPR by 9.1% and 6.3%, respectively. More specifically, the proposed system outperformed both previous works and commercial systems in the ChineseLP, OpenALPR-EU, SSIG-SegPlate and UFPR-ALPR datasets, and yielded competitive results to those attained by the baselines in the other datasets.
We also carried out experiments to measure the execution time. Compared to previous works, our system achieved an impressive trade-off between accuracy and speed. Specifically, even though the proposed approach achieves high recognition rates (i.e., above 95%) in all datasets except UFPR-ALPR (where it outperformed the best baseline by 7.5%), it is able to process images in real time even when there are 4 vehicles in the scene.
Another important contribution is that we manually labeled the position of the vehicles, LPs and characters, as well as their classes, in all datasets used in this work that have no annotations or that contain labels only for part of the ALPR pipeline. Note that the labeling process took a considerable amount of time since there are several bounding boxes to be labeled on each image (precisely, we manually labeled 38,351 bounding boxes on 6,239 images). These annotations are publicly available to the research community, assisting the development and evaluation of new ALPR approaches as well as the fair comparison among published works.
As future work, we intend to explore new CNN architectures to further optimize (in terms of speed) vehicle detection. We also want to explore the vehicle's make and model in the ALPR pipeline since some datasets provide such information. Finally, we plan to correct the alignment of the detected LPs and also rectify them in order to achieve even better results in the LP recognition stage. Some methods have been employed for these tasks in the literature [7,14], generally improving the accuracy of LP recognition. Accordingly, we intend to evaluate the effect of different approaches in our system from both the speed and accuracy points of view, as such approaches can be computationally expensive.