Traffic light and moving object detection for a guide-dog robot

: Guide dogs are helpful for visually impaired people for navigating through the streets. However, it is expensive and time consuming to train a guide dog. In addition, a guide dog cannot decide when and where to cross a street safely, and it is up to the human to decide. Here, the authors propose a framework for creating a guide dog robot by using artificial intelligence and other technologies. The proposed framework is based on an Intel UP squared board, together with a Neural Compute Stick Movidius to process the images gathered from a GoPro camera. MobileNet single shot detector (SSD) is the main framework to detect the moving objects in the environment. The final decision is made after fusing the information gathered from all the sources. The authors also apply the Amazon Alexa device for the voice communication between the guide dog robot and the visually impaired person. A prototype of the proposed system is implemented and tested. Experimental results show that the proposed framework can process the information at a traffic intersection scene and can guide a blind person to cross the street safely.


Introduction
There are more than 217 million people in the world who have moderate-to-severe vision impairment, and another 36 million people are blind according to the World Health Organization [1]. Guide dogs are helpful for visually impaired people to navigate through the traffic and to avoid obstacles on the road. However, training a guide dog is time consuming. In addition, in most circumstance a real guide dog cannot help the blind person to decide where and when to cross the street safely. Thus, the guide dog robot can better help visually impaired people to cross the street safely if the robot has visual perception.
A guide dog robot is a mechanic dog with the help of artificial intelligence and other modern technologies. It can recognise the traffic light, detect moving objects, and measure the distance between the impaired person and the obstacles. Chuang et al. [2] used three cameras mounted on three different raspberry Pi2B boards, together with the Nvidia Jetson TX1 board for computing the deep convolutional neural network, which is a reference network. They used two deep learning network models to complete the computing task. The two models are based on CaffeNet [3] and the Train-Net. Three different types of images are the input for training and turning decisions, and a serial of images are applied to train the guide dog lateral distance and heading degree. Experiments show that the proposed algorithm can work properly in the indoor environment.
However, there are two main limitations of the proposed algorithm. First, the CaffeNet is a heavy network, the computation cost was higher than the light network, and thus, it used Jetson Tx1 to complete this task. On the other hand, MobileNet [4] is a light network that can be used in an embedded system. With the help of Neural Compute Stick Movidius [5], we can achieve the same detection efficiency on cheaper devices, such as Intel UP squared board [6]. The second limitation was that it used images as the endto-end inputs to decide when to turn right or left. This procedure highly depends on the training datasets. Furthermore, they do not detect the traffic light at the intersection and the moving objects near the visual impaired, and thus the visually impaired people cannot percept the environment.
To solve these problems, we propose a novel guide dog robot framework. The architecture of the system is based on serviceoriented computing [7], robot as service, and the Internet of Things, where both software and hardware components are implemented as services [8,9]. They can communicate through standard interfaces and protocols. The hardware is based on the low-cost Intel UP squared board platform and the Intel Movidius Neural Compute Stick. The mainboard is located in a wheelchair like a four-wheel robot car, in addition to a guide dog robot. A GoPro camera is mounted on the guide dog robot, and images are sent to the UP squared board through a Wi-Fi connection. After processing the images by deep neural network unit, the system recommends whether the environment is safe for the guide dog and its client to cross or not. The recommendation is sent to the client through the Amazon Alexa voice system. The guide dog client can also communicate back to the guide dog robot through Alexa.
In this paper, we focus on how we use the camera mounted on the head of the dog robot to acquire a visual perception for the guide dog robot. The rest of the paper is organised into the following sections. Section 2 presents the details of the proposed framework and its components. Section 3 processes the information gathered by the GoPro camera. Section 4 utilises the information gathered from the GoPro camera to generate a recommendation to guide the impaired person. Section 5 concludes the paper.

Guide dog robot framework
The proposed guide dog robot framework has two main parts. The first part is a mechanical dog with a GoPro camera mounted on the dog head. The second part is based on an Intel UP squared board [6] with an Intel Movidius Neural Compute Stick plugged in it. The software is developed based on service-oriented architecture and a workflow-based visual programming environment, which orchestrates all the services and components [10,11].
The hardware architecture is illustrated in Fig. 1. As this framework is a complex work completed by a team, the major concern of this paper is the visual perception part. This paper mainly concerns on visual perception part that marked in the dashed box, and thus we will discuss how we obtain images from the camera and how we process these images and then give advice to the visually impaired client. The GoPro camera is mounted on the head of the mechanical dog as shown in Fig. 2, and the Intel UP squared board and an Amazon Echo Dot are mounted on a four-wheel mini car which is like a wheelchair.
There is an ultrasonic sensor mounted in front of the mini car, which is used for detecting the distance between the wheelchair and the mechanical dog.
GoPro camera connects to the mainboard by Wi-Fi, and there is a hotspot on the GoPro camera. The mechanical dog is connected to the mainboard through bluetooth. The Neural Compute Stick Movidius is plugged into the mainboard directly.

Visual perception and processing
We installed a GoPro camera on the head of the mechanical dog. The camera takes several images per second. These images are sent to the mainboard through Wi-Fi. There are traffic lights and a few other moving objects, such as moving toy cars, in our experimental environment.
We have performed a number of experiments. First, the mechanical dog walks forward, the wheelchair, which is a fourwheel mini car, follows. Second, when they are close to the intersection, the decision is made through the environment. If the traffic light is green and there is no moving car around the visually impaired client, then the guide dog robot will cross the street. Otherwise, the dog will stop and wait. Third, we will change the traffic light from green to red to see whether the guide dog robot can react correctly.
We implement MobileNet-SSD V1 deep neural network to complete the object and traffic light recognition task. MobileNet is a light neural network that can run faster than any previous network, such as Fast-RCNN and Faster-RCNN [12]. With the help of Neural Compute Stick Movidius, we can process one image in <90 ms. Thus, when GoPro sends an image to the mainboard, we will detect whether there are traffic lights and objects in this image. Furthermore, we will detect whether the object is moving or not. If the object is moving around our client, then the system will send the command to tell our client to stop. Faster-RCNN can achieve higher accuracy, but consumes more time, thus we use MobileNet SSD to complete our task.

Traffic light detection and recognition
We build a manually controlled traffic light. It has three different colour lights, and the lights are switched by a button. Before we perform the traffic light detection task, we should first train a deep neural network.
We collect 1000 images with different distances and angles about the traffic light. Fig. 3 shows the training images. We manually marked out the bounding box of the traffic light in each image, and we label the colours. The supplement information is provided as separate XML files. To improve the accuracy of our system, we take some images that do not contain the traffic light. We trained our system on ASU HPC GPU cluster, and the training process is based on an existing model, so that we can complete the task faster than training from scratch. The existing model is come from [13]. However, the lack of training images will result in poor quality for traffic light detection and recognition. Thus, we did extra training to achieve the better performance.
In our project, the pre-trained model has 21 classes: We have completed all the data collection, training, and recognition steps in implementing a neural network experiment to detect and recognise the traffic light. We will present the experimental results in the last section.

Moving objects detection and recognition
Moving objects are the most harmful for our visually impaired clients. We should detect the objects on the road especially near the intersection. The key is that we need to recognise whether an object is moving or not.
The MobileNet-SSD model that we have gathered in this project can detect 20 classes' objects include 'bicycle', 'dog', 'motor bike', 'bus', 'car' and others. These objects most probably appear on the road. Our moving objects detection and recognition procedure is depicted as follows.
Step 1: Using MobileNet SSD to detect the objects from the image. We assume the detected object is named as A i , i = 1…n. As we know the class label of the detected objects, we will use another sequence C i , i = 1…n to label their class information.
Step 2: For a giving object A i , we first calculate the area of these objects in the image. The image size is fixed, so we can use the detected bounding box to calculate the area of the object, and defined the area as S Ai .
Step 3: Calculating the area again after time t, the new area of the object is defined as S Ai ′ .
Step 4: Then we calculate the difference between the two areas: Then we collect a serial of images to indicate the area changes. If the area goes up then we think we are getting close to the object, or the object is moving towards us. The area of the object is a simple rule to indicate the direction of the moving object. We propose another novel method in Section 3.3.
Step 5: Using steps 1-4 to detect and calculate all the objects in the images.

Commands to guide the visual impaired client
There are totally three types of information that will be sent to the mainboard: the cone information, the traffic light and the moving objects. We will discuss all the information that affects the command. First, we process the cone information that appears in the image. Only if the cone appeared in the middle of our image, it will be considered as a barrier. If the cone is on the left side or right side, it will not be considered as the barrier. We first try several times to decide the safety distance for the wheelchair to cross, and then we marked the distance in the image. Finally, the cones located outside the enclosed area will not affect the wheelchair.
Second, we process with the traffic light information. If we are walking and the traffic light is green, then we will make the wheelchair continue to move. If the traffic light changes to red, we will make the wheelchair to stop.
Third, how to react with the moving objects is the key issue of our wheelchair. If the moving object is getting close to our visually impaired client, we should make the wheelchair stop. The moving direction of the object is also useful for our system to decide where to go. The calculation procedure is listed as follows.
Step 1: Assume that the wheelchair stands at point O, calculate the centre location of the detected object and denote as C Ai . After the t time step, we compute the new centre location as C Ai ′ . The moving direction can be estimated as in Fig. 4.
Step 2: Using the two given points C Ai and C Ai ′ for calculating the slope k: where the coordinates of C Ai are x 1 and y 1 , and the coordinates of C Ai ′ are x 2 and y 2 .
Step 3: If k > 0, y 2 > y 1 and x 2 > x 1 then the moving object is going far away from us. If k > 0, y 2 < y 1 and x 2 < x 1 then the moving object is getting closed to us. If k < 0, y 2 > y 1 and x 2 < x 1 then the moving object is going far away from us. If k < 0, y 2 < y 1 and x 2 < x 1 then the moving object is getting closed to us.
Step 4: The size of the time slot will affect the result. We will use a different time slot to detect the moving direction of the objects.

Experimental results
In this section, we discuss the setting of the parameters for the guide dog and analyse the experimental results. The complete experiment is carried out on an Intel UP squared mainboard, GoPro camera, a mechanic dog, and a laptop with 8G RAM, Core I3-2310 2.8 GHz. All results are sent to a VIPLE program running on the laptop which in turn orchestrates all the activities.

Traffic light detection result
In the guide dog project, we implement the deep learning-based methods to detect the traffic light. We use 1000 images taken from GoPro from our simulated intersection to fine-tune our model. Sixhundred images from these images are selected as the training set. The model is trained in a Caffe environment. We use one Nvidia GTX1080Ti with 11 GB memory to train the model for 20,000 iterations. After around 14 h training, we obtained the fine-tuned model. Here are some images to show the detection results. We use a manually controlled traffic light that means it is not a real-time traffic light. We change the colour of the traffic light by pressing a button. We train our neural network model by using the manually controlled traffic light.
First, we set the traffic light to different colours and then we collect 340 images of red colour, 340 images of green colour, and 320 images of yellow colour, respectively. Then we use these 1000 images to fine tune the existing MobileNet-SSD model.
We draw a bounding box around the traffic light, and the bounding box is recorded in an xml file. The content of the xml list is given in Fig. 5.
From Fig. 6 we can see that the traffic light detection method is very robust. The algorithm can recognise the correct traffic light even at a different angle and distance. The first two images show the green light detection result. The detection rate of the first image is lower than the second one because the distance is farther than the second one. The third and fourth images show the red light result. All the red light is recognised. The last four images show the yellow light detection results.

Moving object detection result
Moving direction detection is a key issue in this system. Fig. 7 shows the moving direction detection results. There is a car in the first image of Fig. 7, and the detection rate of this car is 80.77%. After a specified time slot, we capture another image, and we detect the same car at 42.75% in a different place. From the location of the car in the image, we can estimate that the car is getting closer to us, and the traffic light is red, so we should send 'stop' to the client.

Execution time and retrieval accuracy
The execution time will be analysed in this section. We compare three different types of running environment to see the time cost of the MobileNet-SSD algorithm. The first is the laptop platform with Ubuntu 16.04 on 8G RAM, Core I3-2310 2.8 GHz, and the second one is the same Intel UP squared with the help of Neural Compute Stick Movidius. The third one is the Intel UP squared board. The image size is 300*300. The time cost analysis is given in Table 1.
We train the model on 600 images and test the mAP on the left 400 images. Table 2 lists the results of the accuracy of different model on the Movidius platform.
Although Faster-RCNN achieves higher mAP, the time consuming is higher than MobileNet SSD. So in this project we select MobileNet SSD as the backend. In the future, we will try Tiny-Yolo.

Summary
In this paper, we studied the traffic light and moving object detection in a guide dog project. A GoPro camera mounted on a mechanical dog was used to acquire the images. These images were sent to the mainboard. The mainboard ran the object detection algorithm with the help of Neural Compute Stick Movidius. We implemented MobileNet SSD to detect the object that appeared in the image. Meanwhile, we employed a novel strategy to detect the direction of the moving object. All the commands were generated by a VIPLE program on the mainboard and sent to the client by using an Amazon Alexa voice system. Experiments demonstrated the efficiency of the proposed framework. The current implementation is conducted in an indoor environment. In our future we will try to implement our framework in the outdoor environment to obtain the more realistic result for more practical situations. We will also test other object detection algorithms that are supported by the Movidius platform.