Cargo pallets real-time 3D positioning method based on computer vision

: In storage environment, aiming at the problem of goods positioning when picking, the pallet is firstly recognised based on deep learning. Then, algorithm of obtaining the pose of the pallet by the image processing and Kinect sensor is proposed in this study. The pallet is recognised and its selected box is obtained by deep learning. On this basis, the position and the angle of the pallet are obtained by the image processing method, and then RGB-D transforms the position and posture of the pallet into the three-dimensional (3D) coordinate for three-dimensional positioning. The experiment results show that the algorithm can obtain real-time pallet position with the success rate of 81.02%. Thus, the algorithm can meet the requirements of the efficiency and accuracy location requirements of the storage of goods when picking.


Introduction
With the development of economic and scientific technology, logistics has entered a very rapid development and phase. Storage is a key part of that, and it is considered to be a third party profit for the corporation. Since it reduces the overall supply chain costs and improves the efficiency of the goods operation and the integrated service level [1]. At the present stage, one of the main problems in logistics is the identification and sorting of goods, such as warehousing and sorting, and traditional manual work. In view of the above problems, we identify the location of the goods by visual positioning, and then the warehouse robot will carry out the relevant sorting operation.
In recent years, RGB-D sensor has been widely used in the field of real-time positioning and three-dimensional (3D) reconstruction. RGB-D sensor can obtain the depth value of the scene by using depth ranging technology, thus obtaining the 3D information of the scene. The advantage of such sensors is to obtain 3D information in real time, which is convenient to carry and low cost. The typical RGB -D sensor has a Kinect and StmctureSensor, and the RGB camera that has been used in this paper is a Microsoft Kinect, with an average accuracy of one millimetre.
In this paper, we propose to identify and identify pallet boxes by deep learning [2], and then further precise positioning by image processing, and obtain the space position of pallet cargo through 3D reconstruction.

Identification network of pallet cargo based on deep learning
A convolutional neural network is a kind of basic structure that applies a convolutional operation to a special context of a neural network to detect a convolutional neural network. The basic structure of a neural network is improved by referring to the SSD [3] detection algorithm. The first part of the network model is the basic network VGG164 [4], but the whole connection layer (such as FC6 and FC7 layer) used for classification after VGG16 is removed. After VGG16, the convolution layer is followed by Conv6, Conv7, Conv8_2 (including 1*1*256 and 3*3*512-s2) and Conv9_2 (including 1*1*128 and 3*3*256-s2). Its basic structure is as follows (Fig. 1).

Pallet identification
Input to any size image has depth of the trained neural network, by the convolution of the neural network after feature extraction, after according to VGG16, Conv7, Conv8_2 and Conv9_2 convolution layer to get the characteristics of the four different scale figure to predict the position of the object Bounding Box, and the corresponding confidence level, the final will show pallet's Bounding Box. Test end is shown in Fig. 2.

Basic process of pallet positioning
The basic process of pallet positioning is shown in Fig. 3.

Edge detection
Edge detection is a common method of segmentation based on greyscale mutation. The pallet has obvious edge features in transverse and vertical directions, which is an important feature of pallets. In this paper, Canny edge detection is used to extract the edge features of pallets. Canny edge detection has the characteristics of low error rate, high positioning and minimum response, and is considered to be the best edge detector. The calculation formula of Canny operator [5] is as follows: where n is the edge direction, ∇G is the vector for calculating the gradient, and the strength of the edge is determined by the item in formula ∇G * I(x, y) . In this paper, the image is processed by opencv library, and the image after the Canny edge treatment is shown in Fig. 4.

Text-based image segmentation [6]
To further precise positioning of the centre of the pallet, through analysing the pallet, the common pallet can be divided into fluctuation two parts, one part is on the horizontal straight lines, the following part is composed of vertical line, and the vertical line a total of six. In order to obtain the position of the horizontal direction of the pallet, the pallet can be divided according to the shape of the pallet and the local characteristics. From the perspective of the texture characteristics of the pallet for the lower part of the pallet has a certain regularity, can use the statistical method of statistics from the 'black' to 'white' or 'white' to 'black' jump times [7], and set a threshold segmentation, pallet for setting the threshold value by analysing the pallet to 12, the specific algorithm process is as follows: i. The binary image after input edge detection; ii. Scan the binary images in sequence from top to bottom, left to right. Count the number of jumps for each row M (I); The calculation formula of the jump of the I line is

Inclination of pallet
The image in the upper part of the pallet image can be used to detect the inclination θ of the pallet. The method of Hough transformation is used in image processing to detect a straight line, and then the inclination of straight line is determined. In this paper, the method of probability Hough transform [8,9] is used to detect the tilt inclination θ of the pallet. The Hough transform is a method used to find a straight line. Hough transformation uses the transformation between two coordinate spaces to map a curve or line with the same shape in one space to a point in another coordinate space to form a peak value, thus transforming the problem of detecting any shape into a problem of statistical peak value.
Through the accumulative chance of the Hough transform, it can detect a straight line on the edge of the pallet. Then using an opencv, the HoughLinesP function, we can get two points (x 1 , y 1 ), and (x 2 , y 2 ) to calculate the inclination θ. The formula is as follows: The results of the cumulative probabilistic Hough transform detection are as shown in Fig. 7.

Corrosion and expansion:
The image in the lower part of the pallet image after segmentation can be used to detect the centre point of the pallet. In this paper, the accumulative probabilistic Hough transform is used to detect the lower part of the pallet. Through analysis, it is found that there is no complete connection in the middle of the edge line after Canny processing, and there is intermittent phenomenon, which will bring difficulties to the detection of the accumulative probabilistic Hough transform. Affect the accuracy of the test results. In this paper, the method of morphology is used to connect the edge of fracture, and it can reduce the interference in the linear detection process and improve the accuracy of the test results. In this paper, by adopting the combination of inflation and corrosion [10] way to connect broken line, expands the structure of the element is 1 × 3, corrosion when the structure of the elements of 5 × 1. Dilate and erode are using in opencv function implementation. The results are shown in Fig. 8.

Find the midpoint of the pallet:
After that, it is necessary to further extract the vertical straight line of the pallet and find two vertical straight lines in the middle, calculating the centre position of the pallet based on these two straight lines. Since each edge line could detect the straight line >1. So when looking for a centre of the pallet, must find a true straight line represents the edge of a pallet, then in incredible line sorting. Based on the above ideas, this paper proposes the following algorithm, whose basic process is as follows: i. Use the HoughLinesP function to detect the edge of the lower part of the pallet, (x 1 , y 1 ) and (x 2 , y 2 ) are stored in the vector CV::Vec4i container; ii. Sorting the detected lines by insertion sort method, and storing the sorted lines in L; iii. Compare the distance between the lines in L, and if <10, the two lines I and j are the same line. Merge the two lines. iv. Determine the number of lines in L, and if the number is odd, the middle line will be the centre line of the pallet, and the Y will be brought into the line to obtain X; If the middle line of the two lines is calculated for even numbers, the line is the centre line of the pallet, and the value of X can be obtained.
According to the above algorithm, the results are as shown in Fig. 9.

Pallet 3D positioning
Kinect [11,12] its hardware structure is mainly divided into three parts, namely, a system-level chip, three cameras, and a fourelement microphone array. Three of the cameras are colour cameras, infrared emitters, and infrared CMOS cameras. Through the infrared emitter signal, the infrared CMOS camera receives the depth data and calculates the depth information of each pixel by infrared image.
With open source database OpenNI [13], the x-coordinate and Y-axis coordinates of the depth image coordinate system can be converted to the X-axis and Y-axis coordinates in the world coordinates of the reference frame of the Kinect camera. The corresponding relationship between the camera coordinate system and the depth image coordinate system is shown in Fig. 10.
It is apparent to see that the coordinates of the world at some point are '(x w , y w , z w )', and the coordinates of the camera are '(x, y, z)', and the coordinates of the image are '(x c , y c )', and according to the traditional method of calibration [14], and the image coordinate system into one point of world coordinate conversion relations as follows: where f x = f / p x , f y = f / p y , f is the focal length of the camera, p x and p y represent the width and height of the unit pixels, respectively, u 0 and v 0 are the offset of the X-axis and Y-axis of the imaging centre point of the camera. R, t represent the rotation and the translation vector of the camera relative to the world coordinate system, respectively.
It can be seen from Figs. 11 and 12 that that world coordinate system coincide with that of the camera, and the rotation matrix R is the unit matrix, and the translation matrix is a zero matrix, which can be converted into the following relationship: Since the camera and the colour RGB camera for the depth information shoot objects from a different inclination, the coordinates for the original data obtained are not yet calibrated. Through OpenNI's correction function, the parallax problem caused by two cameras can be fixed, so that the target object can be aligned to the same position so that the subsequent mapping operation can be convenient. The coordinate points from the centre of the pallet, which are available from Section 3.5 in this paper, are brought into the function [15] in the Kinect itself, with the following section codes: hResult=pCoordinateMapper->MapColorFrameToCameraSpace(depthWidth * depthHeight, &depthBuffer[0], colorWidth * colorHeight, cameraSpacePoints); if (SUCCEEDED(hResult)) { long colorIndex = (long)((r.y) * colorWidth + r.x); CameraSpacePoint csp = cameraSpacePoints[colorIndex]; }

Experimental environmental
In this experiment, visual studio 2013 was used as the experimental platform, and OpenCV2.4.9 version of the library was used as the tool for image processing. The experimental equipment mainly included the Kinect and the experiment pallet composition (Fig. 13).

Pallet positioning
When it can complete the observed pallet, placed to access the camera and the pallet in a fixed position, gathering every frame image, some experimental results as shown in Fig. 14, statistically its recognition success rate as shown in Table 1.
According to the prior experience in the actual project, the middle point of recognition can be located in the middle of the pallet. In this experiment, the thickness of the pallet was 20 mm. Also the allowable error Δ = ± 10 mm was obtained by the experiment. The ideal location and experimental error range are as shown in Fig. 15.
In order to verify the accuracy of this experiment, the experimental data will be divided into two groups: one group will be used to verify the accuracy of pallet identification by changing the relative position of the Kinect camera and the pallet to the pallet inclination θ = 0. Another set of the Kinect camera and the relative position of the pallet must be used to verify the accuracy of pallet identification by changing the size of the pallet inclination θ. Considering the situation of the warehouse robot in the actual working process, it is affected by the actual storage environment and other factors. In the first experiment, the experiment was conducted at 1, 1.5, 2 and 2.5 m, respectively. For the second experiment, when the pallet was larger than 45°, the storage robot was very difficult and did not meet the actual working conditions. In the second experiment, the relative position of the Kinect camera and the pallet was set to 1.5 m, while the inclination of the pallet relative to the camera was set to 0°, ±30°, ±45°, looking down and looking up, respectively.
As shown in Fig. 16, some experimental results are obtained in different inclinations. Through the experiment, in the field of error, the method that we present is a very good way to adapt to the storage environment and to be able to do 81.02% of the time. At the same time, a good recognition effect can be obtained at different distances and the inclination of the pallet.

Conclusion
In this paper, the positioning problem of the storage robot is studied, and the method of image processing is proposed, and the 3D positioning algorithm is made through the Kinect sensor. It divides that image into two parts by edge detection and texture hop method, and the upper part is directly tested by the Hough transform inclination A, and after the morphological treatment of the following part, the central point of the pallet is found through the present algorithm. The Kinect sensor then converts 2D coordinates into 3D coordinates. By analysing the experimental data, the image recognition rate of this algorithm is 81.02%, which basically satisfies the positioning of the pallet. By analysing Tables 2 and 3, it is easy to see that the algorithm can accurately locate the pallets within the range of errors allowed by experiments.  In the fourth step of finding the middle point of the pallet, as suggested in Section 3.5, if two or one straight line is not the straight line of the middle position of the pallet, then straight line in the middle of the straight line in L can be used as the basis of the middle point of the pallet and the error can be introduced to the middle point of the pallet. In the following research, we will focus on how to accurately extract the middle line of the pallet to improve the accuracy of the detection.