300 GHz Radar Object Recognition based on Deep Neural Networks and Transfer Learning

For high resolution scene mapping and object recognition, optical technologies such as cameras and LiDAR are the sensors of choice. However, for robust future vehicle autonomy and driver assistance in adverse weather conditions, improvements in automotive radar technology, and the development of algorithms and machine learning for robust mapping and recognition are essential. In this paper, we describe a methodology based on deep neural networks to recognise objects in 300GHz radar images, investigating robustness to changes in range, orientation and different receivers in a laboratory environment. As the training data is limited, we have also investigated the effects of transfer learning. As a necessary first step before road trials, we have also considered detection and classification in multiple object scenes.


Introduction
All major car manufacturers are evaluating LiDAR, passive optical and radar sensing capabilities for automotive applications [1], aiming beyond advanced driver-assistance systems (ADAS) such as automatic cruise control, parking assistance and collision avoidance, towards full automotive autonomy. Each technology has benefits and drawbacks, but a key benefit of automotive radar is an operating range up to 150m or more, and an ability to function in adverse weather, such as fog, rain or mist. However, radar sensors offer much lower resolution than optical technologies. Current automotive radar systems operate at 24 GHz and 79 GHz, with a typical bandwidth of 4 GHz, are able to perform low resolution mapping and detection in relatively uncluttered scenes, but object recognition is really challenging. Deep Neural Networks (DNNs) have proven to be a powerful technique for image recognition on natural images [2][3][4]. In contrast to manual selection of suitable features, followed by statistical classification, DNNs optimise the learning process to find a wider range of patterns, achieving better results than formerly on quite complicated scenarios, including for example the ImageNet challenge first introduced in 2009 [5], which has at the time of writing more than 2000 object categories and 14 million images.
In this paper, we wish to assess the capability of DNNs applied to images of objects acquired by a prospective 300 GHz automotive radar with an operating bandwidth of 20GHz. Our experiments, conducted in a laboratory setting from late 2017 uses a small database of 6 Isolated objects to assess the current capability. Later, in August 2019, we trained our neural networks in a more challenging scenario with multiple objects in the same scene to assess the performance to both detect and classify objects in the presence of both uniform and cluttered background.
The principal contributions of the work are to assess the robustness of the DNNs to variations in viewing angle, range and specific receiver; since we have limited data, we also investigated how transfer learning can improve the results. We also evaluate the performance of the trained neural networks in a more challenging scenario with multiple objects in the same scene. Using DNNs, we classify these objects with minimum domain knowledge about the sensors and objects being sensed. The scenes are static; we do not use rangedoppler spectra to classify images, but perform experiments on the radar power data alone. Methodology developed using deep convolutional neural networks to process data acquired by a prototype high resolution 300 GHz short range radar [6]. Steps: 1. Radar Signal Processing: Cartesian radar image generation. 2. Bounding box annotation to crop object region. 3. Deep Neural Network and Transfer Learning radar based recognition.

Related Work
Together with scene mapping, object recognition is a necessary capability for autonomous cars. When we create a map of the immediate environment, we also need to identify key actors, such as pedestrians and vehicles, and other street furniture, traffic signs, walls, junctions and so on. For actors, we would also wish to predict their movement in order to create a safe system, and identity is a key component of such prediction. The use of deep convolutional neural networks (DCNNs) [2,7] for large scale image recognition has changed significantly the field of computer vision. Although questions remain on verifiability [8], confidence in the results [9], and on the effects of adversarial examples [10], the best results for correct identifications applied to large image datasets have been dominated by DCNNs algorithms. The development of GPU's and large annotated datasets has helped the popularity of deep learning methods in computer vision.
Of course, the results on natural image data such as ImageNet can be replicated to a large extent using automotive data, such as the KITTI benchmarks [11]. However, in adverse weather, those sensors have poor performance, so we wish to examine the potential of radar data for reliable recognition. This is especially challenging; most automotive radars sense in two dimensions only, azimuth and range, although research is underway to develop full 3D radar [12]. Although range resolution can be of the orders of cm, azimuth resolution is poor, typically 1 − 2 degrees although again there is active research to improve this [6]. Natural image recognition relies to a great extent on surface detail, but the radar imaging of surfaces is much less well understood, is variable, and full electromagnetic modelling of complex scenes is extremely difficult.
There has been some recent work in applying deep learning techniques to radar images for automotive applications. Wohler et al. [13] [14] have used Long-Term Short Memory neural networks creating a methodology to classify road actors in the automotive scenario. Lombacher et al. [15] also used deep learning techniques to segment cars against other objects. These examples use only power data. Why not use readily available motion data available from Doppler shift? Rohling et al. [16] used a 24GHz radar to classify pedestrians by analysing the Doppler spectrum and range profile. Similarly Bartsch et al. [17] classified pedestrians using the area and shape of the object and Doppler spectrum features. They analysed the probability of each feature and used a simple decision model. They achieved 95% classification rates for optimal scenarios, but this dropped to 29.4% when the pedestrian was in close proximity to cars due to low resolution from the radar sensors. Likewise Angelov et al. [18] investigated the capability of different DCNNs to recognise cars, people and bicycles with variable success rates ranging from tests accuracies of 44-88% depending on the problem. The conclusion from these studies is that prototypical motion can be a powerful aid to object identification, but with powerful caveats. First, a car is still a car if stationary at traffic lights, and second, for a moving ego-vehicle the whole scene is moving, not just readily separable targets.

3
Applying Deep Neural Networks to 300 GHz Radar Data

Objective
The main objective of this work is to design and evaluate a methodology for object classification in 300 GHz radar data using DCNNs, as illustrated schematically in Figure 1. This is a prototype radar system; we have limited data so we have employed data augmentation and transfer to examine whether this improves our recognition success. To verify the robustness of our approach, we have assessed recognition rates using different receivers at different positions, and objects at different orientations and range. We also evaluated the performance of the method in a more challenging scenario with multiple objects per scene.

300 GHz FMCW Radar
A current, typical commercial vehicle radar uses MIMO technology at 77-79GHz with up to 4 GHz IF bandwidth, and a range resolution of 4.3-35cm dependent on target range, 20-80m, and an azimuth resolution of 15 degrees [19]. This equates to a cross range resolution of ≈ 4m at 15m such that a car will just occupy one cell in the radar image. This is clearly not sensible for object recognition on the basis of radar cross section. In this work, we collected data using a FMCW 300 GHz scanning radar designed at the University of Birmingham [12]. The main advantage of the increased resolution is a better radar image which may lead to more reliable object classification. The 300 GHz radar used in this work has a bandwidth of 20 GHz which equates to 0.75 cm range resolution. The azimuth resolution is 1.2 o which corresponds to 20cm at 10 meters. The parameters for the 300 GHz sensor used in this work can be seen in Table 1.
The raw data captured by the 300 GHz radar is a time-domain signal at each azimuth direction. To transform the raw signal into an image two steps were performed. The first step is to apply Fast Fourier Transform (FFT) to each azimuth signal to create a range Table 1 300 GHz FMCW Radar parameters for the system described in [12].  profile. The original polar image is converted to cartesian coordinates as shown in Figure 2. Before training the neural network with this data, we applied whitening by subtracting the mean value of the image data, as this helps the stochastic gradient descent (SGD) to converge faster.

Experimental Design and Data Collection
The main objective is to establish whether the proposed methodology has the potential to discriminate between a limited set of prototypical objects in a laboratory scenario, prior to collecting wild data in a scaled down or alternate radar system. We wanted to gain knowledge of what features were important in 300 GHz radar data, and whether such features were invariant to the several possible transformations. The objects we decided to use were a bike, trolley, mannequin, sign, stuffed dog and cone. Those objects contain a varieties of shapes and materials which to some extent typify the expected, roadside radar images that we might acquire from a vehicle.
The equipment for automatic data collection included a turntable to acquire samples every 4 degrees, covering all aspect angles, and at two stand-off distances, 3.8 m and 6.3 m. The sensors are shown in Figure 3. In collecting data, We used 300 GHz and 150 GHz radars, a Stereo Zed camera and a Velodyne HDL-32e Lidar, but in this paper only data from the 300 GHz radar is considered. The 300 GHz radar has 1 transmitter and 3 receivers. The 3 receivers were used to compare the object signatures at different heights. We used a carpet below the objects to avoid multi-path and ground reflections. Table  2 summarises how many samples were captured from each object at each range. Since we have 3 receivers, we have 1425 images from each range and 2850 images in total. In Figure 4 we can see sample images from all objects at different ranges using receiver 3. All the collected images were labelled with the correct object identity, irrespective of viewing range, angle and receiver height. A fixed size bounding box of 400 × 400 cells, which corresponds to 3m 2 , was cropped from the image with the object in the middle

Neural Network Architecture
We can formalize a neural network as a function with its weights to be learned.
where y is the output, f is the neural network function, W l is a set of weights at layer l and x l is the input at layer l. The neural network needs to be able to learn W l which will be generalized to any input. The architecture used has several layers; convolutional layers, rectified linear units (ReLU), max pooling, dropout layers and softmax [20].
Convolution Layer: The main layer developed for deep neural networks when applied to computer vision is the convolutional layer. This learns convolutional masks which are used to extract features In Eq. 2, M is the mask width, N is the mask height, D is the mask depth, W l is the convolution mask learned, b l is the bias and X is the image.
Rectified Linear Unit: The activation function f is usually a non-linear function that maps the output of current layer. A simple method that is computationally cheap and approximates more complicated non-linear functions, such as, tanh and sigmoid, is the Rectified Linear Unit (ReLU).
Eq. 3 shows the ReLU function where X is the output from the current layer.
Max Pooling: To reduce the image dimensions, max pooling can be used, simply taking a region and extracting the maximum value. It uses the maximum value since these are the values which have a better activation in the previous layers.
Dropout: The dropout technique was introduced in [2,21]. This technique simply sets random weights to 0 during training, forcing the neural network to find other paths to train the neural network. This technique avoids overfitting.
Softmax: The softmax layer converts the output from a previous layer into pseudo-probabilities. Thus, for each class, it gives the likelihood of a certain class. Eq. 4 shows the softmax layer, where x i is the output for the current class and x j is the output for all classes. Hence, it normalises the output vector to 1.
The neural network used in this work [22] is A-ConvNet, shown in Figure 5. We have re-coded and implemented this network ourselves from the description given in [22] using the keras framework [23]. The only modification we have made is in the last layer with 6 convolutional filters which represents the number of classes for our classification. This architecture is fully convolutional and achieved state-of-the-art results for the MSTAR radar dataset [24]. The original input A-Convnet is 88 × 88, so our input data was re-sized using bilinear interpolation to fit the original model. There were two principal reasons for using A-ConvNet. First, it achieved excellent results on the radar MSTAR dataset. Second, we wanted to investigate transfer learning using the same network and sharing the initial weights. Our intuition was that such transfer was more likely to be successful in images of the same modality, i.e. radar, even though their sensing specifications were markedly different.
To train our neural network, we used Stochastic Gradient Descent (SGD). SGD updates the weights of the network depending on the gradient of the function that represents the current layer, as in Eq. 5.
In Eq. 5, η is the momentum, α is the learning rate, t is the current time step, W defines the weights of the network and ∇f (x; W ) is the derivative of the function that represents the network. To compute the derivative for all layers, we need to apply the chain rule, so we can compute the gradient through the whole network. The loss function used to minimise was the categorical cross-entropy (Eq. 6). The parameters used in all experiments in all training procedures are given in Table 3. For all experiments we used 20% of the training data as validation, and we used the best results from the validation set to evaluate the performance. In Eq. 6ŷ is the predicted vector from softmax output and y is the ground truth.

Data Augmentation
As shown in Table 2, we have limited training data. Using a restricted dataset, the DCNNs will easily overfit and be biassed towards specific artifacts in the dataset. To help overcome this problem, we generated new samples to create a better generalisation. The simple technique of random cropping takes as input the image data of size 128 × 128 and creates a random crop of 88 × 88. This random crop ensures that the target is not always fixed at the same location, so that the location of object should not be a feature. We cropped each sample 8 times and also flipped all the images left to right to increase the size of the dataset and remove positional bias.

Experiments: classification of isolated objects
As described in Section 3.3, we used six objects imaged from ninety viewpoints with three receivers at two different ranges (3.8 m and 6.3 m). Four different experiments performed, shown in Table 4. In each experiment we compared the results with and without Transfer Learning (TL) from MSTAR. In all training scenarios, data augmentation using random crops and image mirroring from the original data were performed. The metric used to evaluate the results is accuracy, i.e. the number of correct divided by the total number of classifications in the test data. This is the often used, best case scenario, with random selection from all available data to form training and test sets. Intuitively, the assumption is that the dataset contains representative samples of all possible cases. To perform this experiment we randomly selected 70 % of the data as training and 30 % as test data. The results are summarised in Table 5   Table 5 Accuracy for experiment 1.
Random Selection from All Data 99.7% From Table 5 we conclude that the results are very high across the board, so it is possible to recognize objects in the 300 GHz radar images, with the considerable caveats that the object set is limited, they are at short range in an uncluttered environment, and as all samples are used to train, then any test image will have many near neighbours included in the training data with a high statistical probability.

Experiment 2: Receiver/Height influence
The second experiment was designed to investigate the influence of the receiver antenna characteristics and height (see Figure 3). The potential problem is that the DCNNs may effectively overfit the training data to learn partly the antenna pattern from a specific receiver or a specific reflection from a certain height. All available possibilities were tried, i.e.
• Experiment 2.1 : Receiver 2 and 3 to train and receiver 1 to test • Experiment 2.2 : Receiver 1 and 3 to train and receiver 2 to test • Experiment 2.3 : Receiver 1 and 2 to train and receiver 3 to test Table 6 shows the results for experiment 2. In comparison with Experiment 1, the results are poorer, but not by an extent that we can determine as significant on a limited trial. This was expected from examination of the raw radar data, since there is not much difference in the signal signatures from the receivers at different heights. If anything, receiver 3, which was closest to the floor and so received more intense reflections, gave poorer results when used as the test case which implied that the DCNNs did include some measure of receiver or view-dependent characteristics from the learnt data.

Experiment 3: Range influence
Clearly, the range of the object influences the return signature to the radar as the received power will be less due to attenuation, and less cells are occupied by the target in the radar image due to degrading resolution over azimuth. Therefore, if the training data set is selected only at range 3.8m. for example, to what extent are the features learnt representative of the expected data at 6.8m (and vice versa)? Table  7 summarises the results achieved when we used one range to train the network, and the other range to test performance.
• Experiment 3.1 : Train with object on 3.8 m. Test with object on 6.3 m.
• Experiment 3.2 : Train with object on 6.3 m. Test with object on 3.8 m.
The key observation from Table 7 is that if we train the DCNNs at one specific range which has a given cell structure and received power distribution, and then test at a different range, the DCNNs is

Experiment 4: Orientation influence
The final experiment was designed to examine whether the neural network was robust to change of viewing orientation. Here, we used as training sets the objects in quadrants 1 and 3, and as test sets the objects in quadrants 2 and 4. Quadrant 1 means orientation from   Fig. 6: Quadrants The DCNNs does not perform as well compared to Experiments 1 and 2, dropping to 92.5%. However, since we flipped the images left to right as a data augmentation strategy, the network was capable of learning the orientation features, as the objects exhibit near mirror symmetry, and in one case, the cone, is identical from all angles. Therefore, we have to be hesitant in drawing conclusions about any viewpoint invariance within the network as the experiments are limited and all objects have an axis or axes of symmetry (as do many objects in practice).
Together with Experiments 2 and 3, this experiment shows that it is necessary to take into account the differences in the acquisition process using different receivers at different ranges and orientation in training the network. While, this is to some extent obvious and equally true for natural images, we would observe that the artefacts introduced by different radar receivers are much less standardised that those introduced by standard video cameras, so the results obtained in future may be far less easy to generalise. Although Experiment 2 only showed limited variation in such a careful context, we would speculate that the effects of multipath and clutter would be far more damaging than in the natural image case, as highlighted in [17].

Transfer Learning
As summarised in the previous Section, we have a small dataset and there is the potential to learn characteristics of the restricted dataset rather than of the objects themselves. Therefore, we have investigated the use of transfer learning to help capture more robust features using a pre-existing dataset, i.e. to use prior knowledge from one domain and transfer it to another [25]. To apply transfer learning, we first trained the DCNNs on the MSTAR (source) data, then the weights from the network were used as initial weights for a new DCNNs trained on our own 300 GHz (target) data.

Fig. 7: MSTAR Dataset
The MSTAR data is different in viewing angle and range compared to our own data as shown in Figure 7. It was developed to recognise military targets using SAR images. The data contains 10 different military targets and around 300 images per target with similar elevation viewing angles of 15 • and 17 • . In total MSTAR has around 6000 images and is used widely by the radar community in order to verify classification algorithms.
The DCNNs function in the source domain is defined by Eq. 7.
where Ws are the weights of a network, xs and ys are the input and and output from the source domain. To learn the representation, an optimizer must be used, again stochastic gradient descent (SGD), expressed in Eq. 8.
Ws i+1 = SGD(Ws i , xs, ys) where SGD is a function which updates the weights of the neural network, as expressed in Eq. 5 Hence, using the trained weights from our source domain as the initial weights, this is expressed as Eq. 9. It is intended that the initial weights give a better initial robust representation which can be adapted to the smaller dataset.
W t1 = SGD(Ws, x t , y t ), when i = 0 We repeated experiments 1,2,3 and 4 using transfer learning. The results are summarised in Table 9. As can be seen, transfer learning gives higher values for accuracy in the majority but not all cases. The MSTAR dataset is a much bigger dataset, and although it exhibits some characteristics in common with our own data, it uses a synthetic aperture technique, and there is no significant variation in elevation angle during data collection. However, there are 2 distinguishable strong features, the shape and reflected power. As these have much in common with our own data, it is possible that the network is able to better generalise to (a) t-SNE using raw features (b) t-SNE without Transfer learning (c) t-SNE with Transfer learning   However, to gain further insight, We also show the confusion matrix from the orientation experiments without and with transfer learning in Tables. 10 and 11. The main confusion is between the dog and mannequin, since both have similar clothed material; and cone and sign, since they have similar shape. Nevertheless, in these experiments, we can conclude that the neural network approach is robust in maintaining accuracy with respect to sensor hardware, height, range and orientation.

Visualisation of feature clusters
To better understand what is being learned by our network, the t-SNE technique [26] was used to visualise the feature clusters. t-SNE employs nonlinear dimensionality reduction to build a probability distribution by comparing the similarity of all pairs of data, then transformed to a lower dimension. Then it uses KL-divergence to minimise with respect to the locations in the cluster space. Figure 8 shows the result from t-SNE clustering of samples using raw image features, in this case the orientation experiment. Figures  8b and 8c show the t-SNE clusters from the features extracted from the penultimate layer of the trained neural network with and without transfer learning, using different colormaps for each object for better visualisation. We can see that the trained neural network was able to cluster similar classes and similar features. It is hard to give actual interpretability of neural networks, the t-SNE framework can give some insights of the type of features that have been learned.
The transfer learning cluster shows slight improvement by creating bigger clusters of objects of the same class.

Experiments: Detection and classification within a multiple object scenario
The previous dataset contains one windowed object in each image.
In an automotive or more general radar scenario we must both detect and classify road actors in a scene with many pre-learnt and unknown objects which is much more challenging. Hence, in the next set of experiments we include multiple objects, and this has several additional phenomena including occlusion, multi-path and interference between objects, as well as objects which are not included as a learnt object of interest. We use the same object dataset (bike, trolley, cone, mannequin, sign, dog) in different parts of the room with arbitrary rotations and ranges, and the network is trained by viewing the objects in isolation, as before. We also include some within-object variation, using for example different mannequins, trolleys ad bikes. The unknown, laboratory walls are also very evident in the radar images. This new dataset contains 198 scenes, 648 objects, an average of 3.27 movable objects per scene. Fig. 9 shows examples of 3 scenes in the multiple object dataset. Fig. 11 shows statistical data explaining the number of instances of each learnt object, the number of objects in each scene, and the distribution of ranges of the objects. Fig. 10 illustrates possible problems that can occur in the multiple objects dataset.

Methodology
In classical radar terminology, detection is described as "determining whether the receiver output at a given time represents the echo from a reflecting object or only noise" [27]. Conversely, in computer vision, using visible camera imagery to which the vast majority of CNN methods have been applied, detection is the precise location of an object in an image (assuming it is present) containing many other objects, as for example in the pedestrian detection survey of Dollar et al. [28]. Although the image may be noisy, this is generally not the major cause of false alarms.
The extensive literature on object detection and classification using cameras, e.g. [29][30][31][32], can be grouped into one-stage and two-stage approaches. In the one-stage approach localisation and classification is done within a single step, as with the YOLO [32], RetinaNet [31] and SSD [30] methods. Using a Two-stage approach first where is a need to localise then classify each proposed bounding box, then a classification is performed in that box. R-CNN [33], Fast R-CNN [34] and Faster R-CNN [29] are examples of the two-stage approach.
For this work we developed a two-stage technique. We first generate bounding boxes based on the physical properties of the radar   signal, then the image within each bounding box is classified, similar to the R-CNN [33]. Fig. 12 shows the pipeline of the detection methodology developed. For radar echo detection, we use simply Constant False Alarm Rate (CFAR) [27] detection. There are many variations including Cell Averaging Constant False Alarm Rate (CA-CFAR) and Order Statistics Constant False Alarm Rate (OS-CFAR).
In this work we used the CA-CFAR algorithm to detect potential radar targets. In order to compute the false alarm rate, we measured the background noise level, and the power level from the objects, setting a CFAR level of 0.22. After detecting potential cells, we form clusters using the common Density-based spatial clustering of applications with noise (DBSCAN) algorithm [35] which forms clusters from proximal points and removes outliers. For each cluster created we use the maximum and minimum points to create a bounding box of the detected area. The parameters for DBSCAN used were selected empirically; = 0.3m which is the maximum distance of separation between 2 detected points, and S = 40, were S is the minimum number of points to form a cluster.
To compute the proposed bounding boxes with DBSCAN, we use the center of the clusters to generate fixed size bounding boxes of known dimensions, since, in contrast to the application of CNNs to camera data, the radar images are metric and of know size. Hence, the boxes are of size 275 × 275, the same size as the data used to train the neural network for the classification task. The image is resized to 88 × 88 and each box is classified.
To consider the background we randomly cropped 4 boxes which do not intersect with the ground truth bounding boxes containing objects in each scene image from the multiple object dataset and incorporated these in our training set. However, as there are effectively two types of background, that which contains other unknown objects such as the wall, and the floor areas which have low reflected power, we ensured that the random cropping contained a significant number of unknown object boxes. This is not ideal, but we are limited to collect data in a relatively small laboratory area due to the restricted range of the radar sensor and cannot fully model all possible cluttering scenarios.

Results for Multiple Objects
In order to evaluate performance, we have considered 3 different scenarios. In particular, we wish to ascertain how performance is affected by failures in classification assuming a perfect CFAR+DBSCAN pipeline, and to what extent failures in the box detection process lead to mis-classification. Further, we make a distinction between confusing objects (mainly the lab wall) and due to system noise from the floor area.
• Perfect Detector : In this scenario we do not use the CFAR + DBSCAN pipeline, we use the ground truth as the detected bounding boxes. Each bounding box is fed to the trained neural network.
• Easy : In this scenario we manually crop the walls and focus on the potential area containing objects of interest. This includes the CFAR + DBSCAN in a easy scenario, in which removal of static objects is analogous to background subtraction.. • Hard : In this scenario we assume the whole scene has potential targets. Hence, the wall should result in positive detections and is a challenge to the CNN classification.
We also decided to label our scene data depending on the density of objects, since highly cluttered scene should increase the likelihood of unwanted radar sensing effects, such a multi-path, occlusion, and multiple objects in the same bounding box.
• #Objects < 4 : At low density of objects, it is likely that the scene will suffer less from these effects. • 4 ≤ #Objects < 7 : At mid density, we will encounter some of the unwanted effects.
• #Objects ≥ 7 : At high density, many of these effects occur.
We also have decided to evaluate performance at different ranges.
• Short Range (Objects < 3.5 m): This scenario is not necessarily the easiest since coupling between the the transmitter and receiver happens at this range [36]. • Mid Range (3.5 m < Objects 7 m): This is the ideal scenario, as the objects were learnt within these ranges, and the antenna coupling interference is reduced.
• Long Range (Objects > 7 m): This is the most challenging scenario. At more than 7 meters, most of the objects have low power of return, close to the systemic background.
The metric we use for evaluation is average-precision (AP) which is a commonly used standard in the computer vision literature for object detection, classification and localisation [37] in which the Intersection over Union (IoU) measures the overlap between 2 bounding boxes. If the overlap is greater than 0.5 and the classification is correct, then this is a true positive. To compute AP we need to compute precision (Eq. 10) and recall (Eq. 11), where TP is true positive, FP is a false positive and FN is a false negative. To compute AP we compute the area under the curve from the precision-recall plot varying the confidence level of the prediction of each bounding box. The AP is computed as shown in the Eq. 12 where p is precision and r is recall.

P recision =
T P T P + F P (10) For these experiments we retrained the neural network from the single object dataset using the orientation experiments. For the Easy and Perfect Detector cases, we do not include the background data. For the Hard case we also added 4 background images per scene inside our training set. Extensive results for all these scenarios are shown in Tables 12 , 13, 14. As expected, the results from a scene containing many known objects and confusing artefacts are much poorer than when the objects are classified from images of isolated objects. Nevertheless, the results show promise. For example, considering the mid range, Perfect Detector case, there is an overall mean average precision of 61.36%, and for specific easily distinguishable objects such as the trolley it is as high as 97.06% in one instance. Other objects are more confusing, for example cones usually have low return power and can be easily confused with other small objects. As also expected the results degrade at long range and in scenes with a higher density of objects.
The Easy case shows performance comparable but not as good as the Perfect Detector, for example the mean average precision dropping to 50.35%. The CFAR + DBSCAN method is a standard option to detect objects in radar, but it does introduce some mistakes where, for example, the bounding box is misplaced with respect to the learnt radar patterns.
Regarding the Hard case, the mAP drops significantly to 35.18%. This shows how hard it is to recognise objects in radar images when the scene contains other, unseen and un-learnt, objects. Indeed, when the density of objects is greater than 7, some mAP values for bike, cone and mannequin are actually 0.00, which means that those objects were not recognised under those specific conditions.
Finally, we observe that trolley is the easiest object to recognise in all case. The trolley has a very characteristic shape, and strongly reflecting metal corner sections that create a distinguishable signature from all other objects. In interpreting true and false results in non-standardised datasets, which is the case in radar as opposed to visible camera imagery, one should be careful when comparing diverse published material.

Conclusions
In this work we evaluated the use of DCNNs applied to images from a 300 GHz radar system to recognise objects in a laboratory setting. Four types of experiments were performed to assess the robustness of the network. These included the optimal scenario when all data is available for training and testing at different ranges, different viewing angles, and using different receivers. As expected, this performs best when all the training and test data are drawn from the same set. This is a valuable experiment as it sets an optimal benchmark, but this is not a likely scenario for any radar system applied in the wild, first because radar data is far less ubiquitous or consistent than camera data, and second because the influence of clutter (really semantic background) and multipath effects are potentially more serious than for optical technology.
Regarding the single object scene data, we should be encouraged by two principal results, first that the performance was so high for the optimal case, and second that transfer learning may lead to improvements in other cases, Transfer learning can prevent overfitting to the 300 GHz source data, by generalizing using more samples from a different radar data set, e.g. increasing from 92.5% to 98.5% in the experiment using Q1 and Q3 to train and Q2 and Q4 to test. This leads to more robust classification.
The multiple object dataset is a very challenging scenario, but we achieved mean average precision rates in the easy case > 60%(< 4objects), but much less, 35.18%, in a high cluttered scenario. However, the pipeline we have adopted is probably subject to improvement, in particular is using the classification results to feed back to the detection and clustering. To avoid problems with occlusion, object adjacency, and multi-path, further research on high resolution radar images is necessary. We also note that we have not made use of Doppler processing, as this implies motion of the scene, the sensor or both. For automotive radar, there are many stationary objects (e.g a car at a traffic light), and many different motion trajectories in the same scene, so this too requires further research.
In conclusion, it is very challenging in radar imagery for the deep learning approach to learn features which are robust to sensor height, type, range and orientation. In the wild, by which we mean outside the laboratory and as a vehicle mounted sensor navigating the road network, we anticipate even more problems due to overall object density and proximity of targets to other scene objects.