Research on optimisation processing of spatiotemporal correlation temperature and humidity data based on wireless sensor networks in cigarette factory

: The tobacco leaves have higher requirements on the environment during production and storage, especially for the temperature and humidity. In order to improve the quality of tobacco leaves, it is necessary to accurately monitor the temperature and humidity and optimise the parameters involved in the control. Based on the temperature and humidity monitoring system of cigarette factory, the authors optimised the temperature and humidity data obtained by wireless sensors. The data quality evaluation indicators were designed, and the Dixon criterion was used to eliminate gross errors in individual data instance. The abnormal data detection mechanism is designed to eliminate the fault data in multiple data instances by the similarity criteria among neighbour nodes in the area. In order to solve the problem of excessive computation caused by node explosion, extended rules of healthy node judgment were designed. Through the actual operation of the system for >6 months, the algorithm maintains good fault detection capabilities for different fault models and can provide support for temperature and humidity data processing of wireless sensor networks (WSNs).


Introduction
Temperature and humidity are very important technical index in the production process. The control of temperature and humidity plays an extremely important role in the production process of industries such as food production [1], tobacco processing [2], and biological product manufacturing [3], which directly affects the quality of products. In the production process, the control of temperature is usually achieved through heating, ventilation, air-conditioning and refrigeration (HVAC&R) system [4]. In order to save energy and improve the robustness of the temperature and humidity control system, it is necessary to accurately monitor the temperature and humidity of the production environment and optimise the temperature and humidity parameters which involved in the control.
Nowadays, wireless communication technology has developed rapidly [5]. The application of short distance wireless networking technology has been extended to various fields [6], such as smart home, industrial control, security, and other industries, and the wireless sensor network (WSN) is emerged. The monitoring of temperature and humidity is also developing towards the direction of wireless networking. By combining the short distance wireless networking technology with the existing backbone network, remote data transmission can be easily realised. WSNs also have the advantages of intelligence, small size, and low energy consumption, so they are widely used in all walks of life. Li J J and Wang F [7] proposed a low-power temperature and humidity monitoring system based on WSN. Here, a star topology was used as the structure of network. The communication protocol and frequency hopping (FH) mechanism were used to ensure the reliability of data transmission and increase the robustness of the system. The result showed that the system had the advantage of portable, flexible arrangement, a large coverage area, low power consumption, and small disturbance etc. Zhong B C and Yang Z Z [8] presented a temperature and humidity monitoring system of grain depot based on WSN. The temperature and humidity data acquisition and control node were composed of digital temperature and humidity sensor and RF wireless chip and the main control chip. Through the statistical analysis of the collected real-time data, the monitoring host could alarm and feedback control the temperature and humidity of the grain warehouse. The experimental results showed that the system had strong scalability, low power consumption, and high efficiency. Mo X and Zhou Y [9] designed a remote monitoring system of temperature and humidity based on radio frequency technique and wireless network. The system consisted of WSN of temperature and humidity and remote transferring sub-system based on 3G wireless network. The experimental results showed that the system realised the remote transmission of indoor temperature and humidity, and could achieve high measurement accuracy.
However, in the dynamic environment of temperature and humidity monitoring, it is not sufficient to collect, record, and display temperature and humidity. Due to the complexity of the environment and the openness of the sensing area, the temperature and humidity data obtained by the WSNs system will be inconsistent, partially missing, fuzzy, and noisy. In addition, wireless sensors are often deployed in complex, unmanaged areas affected by a variety of factors. In addition to the need to solve common threats such as information forgery, information tampering, and denial of service, the broadcast characteristics of wireless communications and the openness of the monitoring target area result in the generation of incorrect and invalid temperature and humidity sensing data, which can bring fatal consequences to upper layer applications, such as network failure, information misstatement etc. Therefore, the temperature and humidity need to be processed to facilitate better control of the temperature and humidity of the target area [10].
Here, we take the cigarette factory temperature and humidity monitoring system as an example. Based on the improvement of data quality, combined with the data requirements of the temperature and humidity control system in the production process, a data quality evaluation index and a data-centric WSNs anomaly detection algorithm are designed. This method can provide reliable temperature and humidity control parameters for the control system and can provide the basis for enhancing the robustness of the large time-delay temperature and humidity control system.

Experimental building
The experiments were carried out in the roll packing workshop and the first and second floor of the tobacco alcoholisation warehouse at the cigarette factory. Each floor of the No. 5 tobacco leaf alcoholisation warehouse is a regular rectangular space with the size of 48.3 m × 34.5 m × 5.0 m. The cement wall divides each floor into two areas, the area A and the area B. Each area has a dehumidifier and the dehumidifier starts at a regular time to reduce the indoor humidity. The temperature input of each dehumidifier comes from the thermo resistance temperature measuring element at the inlet of the dehumidifier, and there is no humidity input in the control system. The roll packing workshop covers an area of 5,000 square meters and is a regular rectangular space with the size of 98 m × 48 m × 4.8 m. Temperature and humidity wireless sensors are mainly distributed in the position that are conducive to deployment, such as walls, stone pillars, cigarette rolling, and packing machines, which are not easily touched and disturbed. The sensors in the roll packing workshop and tobacco alcoholisation warehouse are shown in Figs. 1-3.

Analysis of the WSNs temperature and humidity data
Before processing temperature and humidity data, it is necessary to understand the nature of the data obtained by the sensors. In addition to the influence of environmental interference and the WSN itself, the collected temperature and humidity data will also be affected by the inherent error of sensors. In this experiment, the selected temperature and humidity sensor is SHT15, manufactured by the Sensirion. It has the advantages of strong anti-interference and quick response. The accuracy of the temperature and relative humidity is presented in Fig. 4 and 5. The temperature measurement accuracy of the sensor at 25°C is ± 0.3°C, the measurement accuracy of relative humidity in 10-90%RH is ± 2.0%RH. The accuracy of the sensor is lower than the temperature and humidity technical requirements of the cigarette process, and can meet the demand.
To compensate for the non-linearity of the humidity sensor to obtain accurate humidity data, e.g. (1) is used to correct the output value. The humidity conversion coefficient is shown in Table 1.
where SQ RH is the relative humidity output value of the digital sensor. It will vary depending on the resolution of the humidity of the chip. The humidity sensor has little dependence on the voltage. The temperature sensor developed by PTAT has excellent linear properties. E.g. (2) is used to convert the digital output to a temperature value and the temperature conversion coefficient is shown in Table 2.
When the difference between the actual ambient temperature and 25°C is large, we need to consider the temperature compensation of the humidity sensor. The temperature correction is shown in e.g. (3), and the temperature compensation coefficients is displayed in Table 3.
The temperature sensor has excellent linear properties, and further compensation algorithms can be used when high accuracy is required under extreme operating conditions. Under normal temperature, the temperature range is between −10°C and 50°C and does not require compensation. The measurement resolution of temperature and humidity is 12bit.
The SHT15 digital temperature and humidity sensor can control the error within the measuring range, but it may cause temporary signal drift beyond the normal working range. This has been verified in the test of the actual installation site. When the supply voltage is switched, the value of temperature and humidity will have an abnormal jump, and then it will slowly recover to the calibration state. In addition to the inherent error of the sensor, the monitoring environment and signal interference will cause inestimable data error. Therefore, it is necessary to process the collected temperature and humidity data.

Optimisation of temperature and humidity data
The temperature and humidity WSN monitoring system in cigarette factory collects the target environmental data in real time. However, a temperature or humidity value cannot fully represent the temperature and humidity state of an object or a region. From the in-depth analysis of the dimensionality of temperature and humidity data, we find that these data are related to time and space and have a high degree of spatiotemporal correlation [11]. Combined with temperature and humidity control in production process, the research on the optimisation processing of spatiotemporal correlation data is very important.

Elimination of gross error
Gross errors can cause the measured data to deviate significantly from the true value, which will have a great negative impact on the control accuracy. Therefore, an efficient gross error discriminant technique is needed. The 3δ criterion requires a large number of repeated measurements, however, in a real-time monitoring system, the data uploaded by a single sensor in unit time is limited, and with the change of time, the temperature and humidity in the environment are not constant, which will increase the difficulty of application. Here, the Dixon criterion and the range-ratio method were used to study the distribution of temperature and humidity data sequence When x i obeys a normal distribution, the statistic r i j in e.g. (4) is used to determine whether the maximum value x n contains a gross error. If r i j is greater than a certain critical value, x n is considered to contain a gross error. Similarly, the statistic r i j in e.g. (5) can be used to determine whether the minimum value x 1 contains a gross error. The critical value of statistic r i j is shown in Table 4. Select the significance level α, and the critical value r(n, α) is determined according to the number of samples (repeated measurements times) n.
When x i does not obey the normal distribution, the critical value needs to be calculated according to the distribution characteristics of the data sequence. Table 5 shows the critical value of the common distribution when the significance level α is 5%. When the number of samples is 3, the Dixon criterion is insensitive to the data distribution characteristics.
The process of discovering and removing gross errors by the Dixon criterion is shown in Fig. 6.
Taking into account the large hysteresis of temperature and humidity, the sensor automatically uploads data every 30 sin practical applications. For example, when x i obeys a normal distribution, we use the Dixon criterion to test the data of the last 2 min, when r(n, α) = (4, 1%), r 10 = 0.889, if the temperature     changes by 2°C within 2 min, we think that there are gross errors and need to be eliminated if the two adjacent data change over 1.778°C.

Analysis of data quality dimension
Data quality measures the amount in which consistency and both schema and instance correctness, completeness and minimality are achieved in a certain system [12]. The dimension is a formal description of a specific quality goal [13]. In order to evaluate the quality of data, it is decomposed into several dimensions. The core of data quality assessment is how to qualitatively and quantitatively assess each dimension. Each dimension includes at least one specific qualitative or quantitative measure. The temperature and humidity data sensed by the WSNs is stream data, and the data stream needs to be segmented by time. The acquisition system is a centralised data processing system with powerful data processing capabilities. Here, we consider the data quality from four dimensions: correctness, integrity, data volume, and time correlation.

Correctness dimension:
The correctness dimension is used to describe the proximity between the observed value of a single data and its true value. From the perspective of error analysis, it can be described as the form of e.g. (6).
where t is the current state of the system, f (t, r) is the system error, n is the random error, and g is the gross error. r is the observed value and r is its true value.

Integrity dimension:
The integrity dimension describes how much observation data a data set contains. The communication between sensors and wireless networks may cause data loss, and software and hardware failures will cause data not to be reported at sampling time. In this case, the acquisition system will get a lot of empty values, which will cause great trouble for data processing. In this study, in the tobacco alcoholisation warehouse, the null value is replaced by the average value of a set of latest data in the same area. In the roll packing workshop, the null value is replaced by the average of the three latest data.

Data volume dimension:
The data volume dimension describes the amount of raw data required to implement a business logic. The data volume dimension can indirectly reflect the correctness and integrity dimensions of the acquisition system and is the guarantee of the accuracy and stability of the control system. The temperature and humidity data acquisition frequency is 30 s. Combined with the centralised computing environment, data volume are shown in Table 6.

Time correlation dimension:
A lot of literature measures time correlation dimension with timeliness and volatility. Volatility refers to the duration of the current data reflecting the condition of the observed object. Temperature and humidity have hysteresis, relative continuity, and stability, and have a low degree of volatility in a relatively short period of time. The data uploaded by the sensor constitutes a time series, and its timeliness has two meanings. On the one hand, a single data instance need to maintain freshness, and data instances can be used to generate time tags as metrics. On the other hand, for the overall processing of the data sequence, the data instances generated by different sensors are required to be at the same time, and the jitter of the data instance interval cannot be too severe.

Data fault detection
SHT15 temperature and humidity sensor is sensitive to the environment, it may upload large error data or incorrect sensing data in a complex environment. Such data instances do not reflect the real situation of the monitored environment [14,15]. Therefore, the uploaded fault data needs to be removed to meet the data quality requirements of the air-conditioning control system. We use the simplest fault-detection method to separate complex data calculations from sensor nodes and aggregation nodes. The WSNs are only responsible for data perception and communication. The algorithms of anomaly detection and data quality are implemented by the data processing system, which greatly reduces the computing burden and the communication load of the sensor network, and ensures the stability and communication quality of the data acquisition. The data processed by the data processing system is more representative and the accuracy will be higher.
The four areas of the tobacco leaf alcoholisation warehouse are a 25 m × 35 m rectangular interval, the roll packing workshop is a 50 m × 98 m rectangular interval. The sensor nodes cannot communicate with each other, and the sensor nodes communicate with the aggregation nodes in their respective regions by single hop, so the delay is small, but the data must be processed centrally.
It is assumed that the digital temperature and humidity sensor will have the following faults. (a) Outlier fault. In the time series of measurement data instances, sporadic data instances that deviate significantly from the time domain model of the data are considered outliers. It is generally caused by environmental disturbances and some unknown cause. (b) Spike fault. The continuous measurement data with multiple change rates beyond the normal range is called spike, usually caused by power supply or hardware failure. (c) Stuck-at fault. It is usually caused by the failure of the sensor hardware. The characteristic of this type of fault is that the variance of the continuous measurement value is unusually small, and the fault can be found by comparing the variance of the measurement data of the nodes around the sensor. (d) High noise fault. Contrary to Stuck-at faults, the variance of high-noise faults is particularly large, and the causes are similar to those of previous faults, such as hardware failures, and the use of sensors beyond the measurement range.
(i) Similarity measure: The similarity measure is used to evaluate whether two objects represent the same observed object. The effectiveness of the evaluation depends on the similarity function. The temperature and humidity environment of the tobacco leaf alcoholisation warehouse is relatively stable, and the interference is very small, so it is very suitable for this method to evaluate data failure. The input of the similarity function f (v 1 , v 2 ) ↦ s is two data objects v 1 and v 2 , and the output is a measure s. If s is greater than a given threshold, it is considered that v 1 and v 2 represent the same observed objects. In the experiment, we believe that the data between the two sensor nodes have the same characteristics and there is no fault data. (ii) Sensing vector: The sensing vectors of sensor nodes consist of a series of data instances, as shown in e.g. (7).
where d i (t) represents the sample value of the node i at time t, in our study, four data within 2 min of each sensor node constitute a sensing vector, in which temperature data and humidity data form a sensing vector, respectively. (iii) Standardisation: Normalisation is to project the value of the numeric attribute of the sensing data instance to a specific range to eliminate the similarity measure error caused by different attribute values. The similarity between sensing sequences is reflected in the similarity measure between sensing vectors. The sensing vector

Data fault detection algorithm
According to the characteristic of temperature and humidity data of sensor network, the generalised Jaccard coefficient was chosen as the similarity function. The generalised Jaccard coefficient is also called the Tanimoto coefficient. If k and l are used to represent two adjacent nodes, the normalised sensing vectors of the two nodes are x k and x l , respectively, and the dot product of the two sensing vectors is x k ⋅ x l , the generalised coefficient can be expressed as: For the vector of the real number type, the range of j kl is The algorithm is divided into two steps, taking tobacco leaf alcoholisation warehouse as an example. In the first step, after the end of each data automatic upload cycle (T = 30 s), the data processing system compares the sensor nodes in each region to calculate the similarity between the current node and the neighbour nodes. The second step is to determine the health status of the sensor node based on the voting results of the nodes in the area or the status of the proliferation information. The data processing system maintains a sensing vector for each sensor node and also maintains and maintains a regional neighbour status table. The neighbor status table saves the similarity measure results. The neighbour state table of the sensor node with area number 0201 at a certain moment is presented as follows (Table 7): where SensorID stores the neighbour node number and SimiCorr stores node similarity measurement results. The larger the Jaccard coefficient value, the higher the sample similarity. NodeState stores the current health status of the neighbour node. The algorithm defines three health states, UNKNOWN, GOOD, and FAULTY. Under the timer control, the neighbour state table will be updated periodically. The basic rule is to select the node whose health status is GOOD based on the voting mechanism. In the neighbour state table, three points are recorded when the NodeState status of neighbour nodes and a node in the area is GOOD, 1 point when the NodeState status is UNKNOWN, and −2 points when the NodeState status is FAULTY. The accumulation of scoring for each neighbour node is the score of the whole region on the health status of a node. The specific process is as follows: (a) Check the SimiCorr field of each record in the neighbour status table. If the value is greater than the preset similarity threshold, the status is GOOD. The similarity threshold setting is shown in Table 8. (b) There are 23 sensor nodes in the roll packing workshop, calculating each node using the above method is time-consuming and wastes computing resources. Therefore, we use an extended rule to find more nodes whose health state is GOOD. In the process of decision making, when the health state of a node is GOOD, and the threshold of the similarity test result between the current node and the neighbour node is >0.9, the current node health status is also deemed as GOOD. However, this method has a constraint that distance factors between two neighbour nodes must be considered. There are few nodes in the tobacco leaf alcoholisation warehouse, and the calculation can be finished by using step (a). However, the volume of the roll packing workshop is very large, and the sensor with far distance is greatly affected by the environment and cannot be judged simply by the extended rules. When judging the health state, the distance between two nodes needs to be calculated. When the distance between two adjacent nodes is less than the limit distance, we believe that the development rule is applicable, otherwise it is not considered. (c) After finding all the nodes whose health status is GOOD, the following decision rules are used to determine the fault nodes. We establish a scoring mechanism to determine whether the node is in the GOOD state. The judgment table is shown in Table 9.
If the neighbour node health status is GOOD, the similarity test result between the current node and the neighbour node is SimiCorr < 0.9, we set the current node health status to FAULTY. If the results of two consecutive cycles are all FAULTY, the current node will be judged as the fault node. Finally, we average the temperature and humidity of all sensor nodes in each region with GOOD health status and without gross errors, as the input parameters of the air conditioning control system.

Experimental results
For the fault node detection algorithm, there are two important evaluation indicators, missing alarm rate, and false alarm rate. The ratio of missing faulty nodes to all faulty nodes in the network is called the missing alarm rate. The ratio of non-fault nodes that are misjudged to all non-fault nodes is called the false alarm rate. In order to verify the effectiveness of the algorithm, we chose 5, 10,   15, and 20% nodes as fault nodes in the experiment, respectively. The algorithm for detecting the missing alarm rate and false alarm rate for spike fault and stuck-at fault is shown in Fig. 7 and 8. Experimental results show that the false alarm rate and false alarm rate of this algorithm for detecting spike faults are higher than those for detecting Stuck-at faults. This is because stuck-at fault causes the measured value to be almost constant, the difference between the measured value and the actual temperature and humidity is obvious. Therefore, the algorithm can find differences easily by using similarity test. The false alarm rate of the two faults detected by the algorithm is <3%, and the false alarm rate is <6%. Through >6 months of actual system operation, the algorithm has always maintained good fault detection capabilities. Compared with the original temperature and humidity detection data, the temperature and humidity data collected by the designed algorithm are more effective and representative.

Conclusion
This paper studied the data optimisation process based on the WSNs temperature and humidity monitoring system of the cigarette factory. Data quality evaluation indicators had been designed, and the data quality was evaluated from the dimensions of correctness, integrity, data volume, and time correlation. The Dixon criterion and the range-ratio method were used to eliminate data with gross errors. A data fault detection method was designed and the abnormal data was eliminated by the neighbour node similarity criteria. Through the extended rules of the healthy nodes, the problem of excessive increase of algorithm computation caused by excessive nodes is solved. The data processing model can improve data quality, detect data faults, and provide data support for enhancing the robustness of air-conditioning control systems.