IoTHunter: IoT network traffic classification using device specific keywords

With the proliferation of IoT devices, network management and security monitoring are becoming a challenge. For the timely detection of IoT device status and their behaviour, traffic classification methods are used. Herein, IoTHunter, a Deep Packet Inspection based IoT traffic classifier, is described. It extracts unique keywords comprising domain names, device names etc. to identify flows belonging to a particular device. IoTHunter automates the keyword extraction using the frequency of occurrence of words belonging to flows of different devices. To further enhance the performance, IoTHunter combines device specific keywords with MAC address of device for subsequent flow labelling. A publicly available IoT dataset is experimented and a good classification accuracy of it over a range of IoT devices is demonstrated.


| INTRODUCTION
Today's computer networks interconnect large number of devices other than traditional desktops, servers, printers, and mobile phones [1]. It is a common scenario to have multiple Internet of Things (IoT) devices, such as surveillance cameras, home appliances, fire alarms, temperature sensors, etc. connecting to Internet. IoT devices although perform very specific tasks; but they have a networking stack, and can send (and receive) TCP/IP packets to (and from) other devices or servers. In a smart environment comprising homes and offices, a number of unique IoT devices may be installed by different departments. With thousands of IoT devices connected together, it is difficult to identify which devices are connected to the network at a certain point of time. Creating and maintaining a database for connected devices manually is an error-prone and time-consuming process. In general, network management with IoT devices is challenging for two important reasons: 1. Security monitoring for policy enforcement: IoT devices are vulnerable to cyber-attacks. In fact several studies [2][3][4] have shown that these devices have little or no in-built security (due to resource-constraints) and are more susceptible to such attacks. Thus, for securing the network from cyber-attack, it is important to monitor these devices.
2. Providing meaningful connectivity: In a very heterogeneous environment comprising IoT devices, it is important to provide meaningful connectivity to every device for its proper functioning. Different devices may have different requirements in terms of bandwidth, tolerance to packet delays, jitter, etc. For example, a surveillance camera may need more bandwidth for transmitting live video capture to a server. Other devices such as, fire alarms may not require high bandwidth but are sensitive to delays and reliability of data transmission.
Thus, network administrators require visibility into types of IoT devices connected to the network in a timely manner. This is generally done with traffic classification techniques. In traditional networks traffic classification is handled either by inspecting network flows or by using Deep Packet Inspection [5]. Few of the previous works [6] have argued that IoT traffic classification is challenging because of sheer number of such devices. Some IoT traffic classification methods use machine learning algorithms [7] with flow characteristics of devices for their identification. However, identifying unique flow characteristics with thousands of devices is non-trivial as there may be significant overlap between them with respect to device characteristics and functions [8]. Another possible approach is to explore if these devices (and associated network flows) can be identified with Deep Packet Inspection (DPI). Herein, we study the device identification and IoT flow classification with DPI. Our proposed method identifies unique keywords from network flows and uses them to classify IoT network flows subsequently. In particular, we make following contributions here.
(i) We propose a method to automatically extract and identify device specific keywords. (ii) We describe a flow classification technique which uses device specific keywords to classify network flows originating from IoT devices in a smart environment. (iii) We present experimental results with an IoT dataset comprising data of 18 IoT devices and report the performance of our system.
The study is organised as follows: In Section 2, we describe prior work related to traffic classification of IoT devices. In Section 3, we explain our proposed system IoTHunter to classify network flows of IoT devices. Experimental results are described in Section 4. Performance comparison of IoTHunter with a machine learning based approach is elaborated in Section 5. Few limitations and possible enhancements of IoT-Hunter are mentioned in Section 6. Finally conclusions are given in Section 7.

| RELATED WORK
In this section, we describe some prior work related to classification of IoT traffic.

| Classification using flow characteristics
Network flows belonging to IoT devices can be identified with flow characteristics. Conventionally attributes, such as number of packets in a flow, frequency of flows, inter arrival time between packets within a flow, size of the packet, etc., are used as unique identifiers for each device type. Both supervised [6,7,9,10], unsupervised [11,12], and deep learning [13,14] methods have been used for classification.
1. Supervised machine learning methods: These methods require labelled feature vectors generated from network flows of each device in order to learn unique characteristics of each device type. The work described in Ref. [7] is one of the first studies to label flows of IoT devices. It uses different machine learning-based classifiers, such as Gradient Boosting Machine (GBM), Random forest, etc., for identifying flows pertaining to different devices. Shahid et al. [9] proposed a technique to recognise IoT traffic using feature vectors with six machine learning classifiers. A multi-stage classifier to classify IoT devices based on network traffic characteristics (port numbers, domain names, cipher suites, flow rate, flow volume, etc.) is proposed in Ref. [6]. Work described in Ref. [15] uses background traffic generated by IoT devices to identify and label flows generated by them. It uses flows generated periodically and extract features which are represented as device fingerprints. These fingerprints are later provided to K-nearest neighbour classifier to identify device type. Another work described in Ref. [16] [14] used deep learning aided capsule network forming classification mechanism that integrates feature extraction, feature selection, and classification model. The work described in Ref. [19] provides a framework to automatically extract features and learn traffic classes (IoT, non-IoT) using stacked auto-encoders which encodes TCP flow into a fixed-size vector. A feature vector in the form of distribution is generated for every device and these device specific distributions are used for classification of flows originating from various IoT devices.
Lei Bai et al. [20] proposed an LSTM-CNN cascade model to classify IoT traffic. For feature collection, the authors segmented the IoT traffic into sub-traffic flows of fixed time intervals. They extracted six features from an interval: user packet count, user packet length average, user packet length peak, control packet count, control packet average, and control packet peak. A packet is considered as user packet if it is UDP/ TCP/HTTP or any other application layer protocol. Extracted features are fed as input to the LSTM-CNN cascade model to identify the device category. This implementation uses SVM adopting a one-VS-all strategy. Results reveal that the performance of the classifier degrades with the increasing device categories.
All previous works discussed above use machine learning and deep learning techniques to classify network flows belonging to IoT devices. Although machine learning algorithms are used for IoT traffic classification, Desai et al. [8], argued that the performance of a machine learning algorithm mainly depends on the identified features. Providing all extracted features as input to a machine learning-based model is not always beneficial as obtaining features adds to the computational cost. Our work IoTHunter to classify IoT traffic differs from existing work in a way that it uses DPI to recognise IoT devices using device-specific keywords rather than depending upon the feature vector.

| Alternate schemes
Noguchi et al. [21] use profiles generated from communication pattern of each device and subsequently compare these with collected communication patterns from unknown devices in real-time to identify device type. For comparison, they use a simple euclidean distance-based metric. Authors argue that this kind of similarity comparison is feasible because communication pattern of each device is different.
There are some existing works, which perform security assessment of IoT devices [3] or identify these devices using complementary methods like IP address to which they connect [22]. For example, Sivaraman et al. [3] proposed a scheme to identify security vulnerabilities in IoT devices based on four parameters, namely confidentiality, integrity and authentication, access control, and the ability to withstand reflective-attacks. Although these works are not dealing explicitly with classification but do help in understanding which devices are vulnerable.
Hsu [23] propose to identify class of IoT device rather than specific device name. The idea is motivated by the fact that same category devices exhibit similar traffic behaviour. For example, all IP cameras may exhibit similar traffic flow behaviour. Author used three features namely bytes per second, packet direction, and time since last communication to identify the device category. Limited testing revealed that for few categories of device like IP cameras, the approach has better classification, while for others like temperature and motion sensors it has limited accuracy of prediction.

| DEEP PACKET INSPECTION-BASED IoT TRAFFIC CLASSIFICATION
In this section, we describe our proposed technique IoTHunter which is used to detect IoT devices connected in a smart environment. IoTHunter is a content-based IoT traffic classifier. Its goal is to label the traffic flows belonging to different IoT devices.
IoTHunter is a DPI-based network flow classifier which requires labelled flows from different devices to extract device specific keywords. These keywords are subsequently used to label the flows (classify). Its working principle is based on the idea that many devices use specific keywords in their communication. For example, many IoT devices contact regularly the servers maintained by device manufacturers and also use specific domain names for communication and name resolution. These can be used for identifying those devices reliably. A sample snapshot of payload taken from Belkin Motion Sensor device is shown in Figure 1. From this figure, we can clearly see that it is contacting domain belkin.org which can serve as unique identifier for the devices manufactured by Belkin. Once such device-specific keywords are identified they can be subsequently used to identify flows belonging to these devices. The two phases namely extracting keywords from network flows and classification of flows are elaborated in the next two subsections.

| Extracting device specific keyword
To classify network flows of IoT devices, device specific keywords are used. Hence, to begin with keywords are extracted. Keyword extraction involves operations, such as flow reconstruction, payload extraction, term extraction, filtering uninteresting terms, normalized term frequency calculation, keyword selection modules, as shown in Figure 2. The main operations of this phase are as follows.

Flow Reconstruction: IoTHunter classify flows belonging
to different devices hence flows are reconstructed by reading packets. Packets belonging to the same flow have five attributes in common. These five attributes are the source IP address, destination IP address, source port number, destination port number, and layer four protocol (TCP/UDP). Flow reconstruction phase make use of the above five attributes to reconstruct the bidirectional network flows belonging to one session. For a TCP flow, all packets sharing common attributes between a SYN and FIN/ACK packets will be part of one flow. In case of UDP as it is a connectionless protocol; a threshold time of inactivity is considered as end of flow between the two devices. 2. Term Extraction: Flows generated in the previous phase are used for term extraction. All the packets belonging to a flow are considered and all the packets payloads are concatenated to generate a flow payload. This concatenated payload portion is used for extracting terms. For this a set of delimiters like newline character, white space and special character are used. For example, if a flow f i has payload content 'I am the first flow' then the set of terms generated are {I, am, the, first, flow}. These terms are used for the subsequent phases. 3. Filtering Unwanted Terms: Some of the terms may not be good candidates for uniquely identifying devices or classifying flows. For example, some of the flows may contain dates, hosts names, and other related details within the payload. These terms are filtered to get a slightly meaningful set of terms for further consideration. This filtering can be automated with dictionary of negative words, a regular KHANDAIT ET AL.
-61 expression-based filtering or can be done manually.In addition to filtering above undesirable terms, we also remove those terms which are common across the devices. The idea is to filter out terms which do not uniquely identify devices. Let the terms of device D i are and remove all terms which are in T(D i ) ∩ T(D j ). This is repeated for all pairs of device types considered. Removing overlapping terms helps in reducing false positives. 4. Potential Keyword Selection: In this phase, device specific keywords are extracted and are recorded in the database. The terms left after previous stage filtering are considered for this case which we name as candidate terms. Each candidate term's frequency of occurrence is calculated by counting how many times, it has appeared in the flows. For each candidate term, normalized term frequency is calculated with respect to the number of flows for that device as in Equation 1.
where CT freq ¼ Total frequency for the candidate term, and TF ¼ Total flows used for generating keywords.
Once all candidate terms normalized term frequency is calculated, the mean and standard deviation of frequency values are used for pruning the candidate term set to get the final set of device specific keywords. This is done by setting a threshold on the normalized term frequency on keyword as in Equation 2.
where μ is the mean normalized frequency of every term, σ is the standard deviation of normalized frequency values and K is a positive constant. All the terms in the candidate term set having the normalized frequency above the Threshold frequency will be the identified as the final set of keywords for that device.

Finding Keyword Offset Range:
In the subsequent step, we calculate the byte offset values of the keywords within the reconstructed flow payload. It is also worth noting that keyword byte offset calculation can be combined with term extraction phase itself by recording the offset values of terms (this method is adopted in Algorithm 1) to avoid second pass on payload screening.
Let the keyword set of a device be K ¼ {k 1 , k 2 , …, k n } then for each k i ∈ K find whether k i appears in flow f j ∈ F (where F is the total flows). If f j contains k i then record the byte offset (from beginning) of k i . This is repeated for ∀ k i and f j resulting into a series of offset values for every k i . Thus if k i appears in m ≤ n flows, then at the maximum m distinct offset values will be generated Algorithm 1 summarises the details of device-specific keyword extraction described above. Input to the algorithm is a set of packets of a device D and negative word database. At the end of flow processing, it generates a set of keywords to be used for classification subsequently. The algorithm begins with flow reconstruction and for each flow, it extracts terms from flow payload by parsing payload using delimiters. At the same time, it records the byte offset values that are generated. Once all the flows are processed and terms are extracted it filters the  Figure 3 shows the various stages of classification operation. As in the previous phase, in this phase too, flows are reconstructed by reading network packets and words are extracted using the set of delimiters. Flows are classified using one of the following two methods. A test flow f i is evaluated against all device keywords. If any keyword k j ∈ K D g is found in f i within the lower and upper byte offset range then the flow f i is labelled as of device type D g . If any flow f i is classified as any device type then the MAC address of device communicating is stored in a database which is subsequently used for other flow classification or identification from same device. 2. Identifying Flows with Device MAC Address: A flow is identified as belonging to a specific device if an earlier flow sharing the same source MAC address is identified as belonging to that device. As mentioned previously, this requires marking the Source and Destination MAC addresses of packets in a flow for the first flow which contains a keyword. Identifying flows with MAC address has few advantages.

| Flow classification
(i) It can speed up the classification as parsing payload is not required. Parsing payload is computationally expensive operation. (ii) It is worth noting that some of the flows may not contain keywords particularly if the IoT device has more than one set of flows and keywords can be seen in only one set of flows. By using MAC address such flows can also be classified. (iii) Some of the devices may have both encrypted and nonencrypted flows. Those flows which are not encrypted may contain keywords. Using these non-encrypted flows, other flows (encrypted) can also be classified. This enhances the classification coverage.
Algorithms 2 and 3 describe the operations involved in flow classification. Algorithm 2 takes a set of network packets as input and labels the flows as belonging to various IoT devices. It starts with flow reconstruction and consults keyword database for different devices. If any of the keywords in the database is found in the flow within its minimum and maximum offset range, it is labelled as a flow of that device. It also logs the corresponding MAC addresses for subsequent flow classification. In the absence of any keyword, Algorithm 2 invokes Algorithm 3 to check if the MAC addresses of packets in the flow are identified earlier as belonging to a particular device and hence label the flow.

| Asymptotic complexity
In this subsection, we describe the asymptotic complexity of every module of IoTHunter. Table 1 lists the asymptotic complexities of different modules mentioning training and testing part. Flow reconstruction module reads packet header information from each packet and adds the packet with its associated flow. For F flows in the trace (or live collection) and P packets, its complexity is O(F � P ). This is because every packet needs to be checked against every flow to decide whether it needs to be a part of an ongoing flow or a new flow. Payload parsing, term extraction module reads the entire payload to generate a set of terms. For a payload of size B bytes, the parsing operation takes O(B) time. For F flows and J candidate terms, Frequency generator has the complexity of O (F � J ). Normalized term frequency calculator requires reading the candidate term set once and for J number of terms it will have a complexity of O( J ). Byte offset calculation has a complexity of O(F � SK ) for F flows and SK number of final selected keywords. SK can be of the size of candidate term set thus its complexity can be rewritten as O(F � J ). Thus the total training complexity is O( The MAC Detect module compares MAC addresses of a packet against address in the MAC store. It requires scanning the MAC store once, and hence has complexity of O(M ) where M is the size of MAC store. The entire flow classification requires scanning the keyword store for classifying each flow. This in addition calls MAC Detect procedure when keyword is not found in keyword store. Thus, the overall complexity of this module is O((F � K ) þ (F � M ))) where K is the size of keyword store and M is the MAC address database store.

| EXPERIMENTS
In this section, we describe the experiments performed to evaluate IoTHunter. To assess the performance of classification we use Recall as a metric. Recall is calculated as in Equation 3.
where TP denotes the network flows correctly classified by IoTHunter and FN denotes the flow either misclassified as other device types or not classified. As mentioned previously, we use both keywords and MAC addresses of devices to classify flows. We denote the recall achieved with keywords as Recall kw and recall achieved with both keyword and MAC addresses as Recall kwMAC . Dataset: For our experiments with IoTHunter, we used a publicly available dataset [24]. This dataset comprises of network traffic generated by 28 IoT devices collected over sixmonth period. We categorised these IoT devices into three groups based on the type of data they carry: Pre-processing: The dataset consists of network traffic in the form of.pcap files with one file generated per day. To get the per device traffic, we filtered traffic based on the MAC addresses as mentioned by the authors of Ref. [6] using tcpdump tool [25]. Subsequently, all flows belonging to a device type are merged which gave us device-specific.pcap files.
Implementation: We implemented the keyword extraction and flow classification module as Java programs with jNetPcap library which allows to read network trace files in the form of. pcap files and parse packets for reading payloads. The keyword search operation required for searching a keyword in the flows is implemented through a method provided by jNetPcap library (payload.contains(keyword)). Using our implementation of IoTHunter, we processed flows of 18 device types which are using partial encryption or no encryption for our experiments. For IoT devices with small classification accuracy, we use the corresponding device MAC addresses for classification. MAC addresses of 18 devices we considered for our experiment are shown in Table 2. Results: For our experiments, we divided the device specific network flows into two parts in the approximate ratio of 50:50. The first part is used for extracting keywords and second part is used for testing. Table 3 shows the details of the dataset in terms of number of flows and also the size of trace files used for training and testing parts, respectively.
Using the training portion of dataset, we selected keywords for all the 18 devices using the mean and standard deviation values of the terms extracted from each device type. Figure 4 shows normalized frequency histogram and the keyword selected for the device Pix-start photo frame. Figure 4a shows the frequency histogram of few sample terms (incomplete) along with their normalized frequency values. The red horizontal line in this figure is the mean frequency value of the terms shown. Figure 4b shows the final set of selected keywords using the standard deviation and constant K set to two for this device.
Experiment I: We performed the first set of experiments considering all device-specific keywords with their associated byte offset values as described in Algorithm 1 and Algorithm 2. In Figure 5a,b, we have shown the recall rate of IoTHunter considering the keywords byte offset for the 18 IoT devices. As evident from the Figure 5a, the keyword-based identification has the highest recall rate (Recall KW ) for Withings Sleep Sensor with a value of 0.96. The total number of network flows for this IoT device is 3294 in the testing part. The smallest recall rate Recall KW , was exhibited by Netatmo Weather Station with value of 0.008. The total number of network flows for these IoT devices is 1213. This small accuracy for this device and few other devices is due to a. The keywords identified appear only in a fraction of flows. This is due to the reason that more than one type of flows appear for a device type and the corresponding keywords are found in only one type of flow. IoT devices perform specialised tasks for which they communicate with limited and specific domains and servers to avail the services. IoT traffic though encrypted, contains domain names, and the server name in plain text, which can be treated as unique to a specific device. But these keywords may not appear in the payload of each flow, which leads to poor detection performance for keyword only classification. b. Some of the devices for privacy concerns use SSL/TLS encryption allowing them to generate partially encrypted data (with only domain names and/or device name being in plain text format). c. The flows from a particular device are either misclassified as flows of other device due to the presence of keyword from other device type.
From Figure 5a, we can notice that only keyword-based classification is not having recall which is in an acceptable range. To improve the performance (particularly for some devices), we subsequently performed classification using MAC address along with the keywords, as described in Algorithm 2.  classifier based on keyword followed by MAC address-based classification with byte offset as constraint. These results demonstrate that keyword-based classification along with MAC address-based classification significantly improves the detection performance (We assume that MAC addresses are not spoofed). As MAC address-based classification happen only when the keyword based classification fails, that is when no keyword of any device is found in the payload. Hence, keyword-based classification serves as a baseline classifier which is improved with MAC address based classification. To understand the extent of improvement in classification accuracy of IoTHunter, we depict the classification results in the form of confusion matrices for both cases (only keywordbased and keyword with MAC address-based classification). Tables 4 and 5 are the two confusion matrices for these two cases. In these tables, column headers include the name of device and number within bracket indicates the number of flows used in testing. For example, in Table 4, the second column Belkin Motion Sensor (26,197) indicates there are 26,197 flows in testing dataset of this device. We can notice from Table 5 that in all the cases the number of flows classified correctly has increased along with a significant reduction in number of unclassified cases. For example, from Table 4, we can notice that 2644 flows out of 3066 for device Pixstart photo frame are classified as Pix-start photo frame and 389 are identified as Belkin Switch and remaining 33 are not classified. From Table 5, we can notice that a similar number of flows (389) are classified as Belkin Switch and 2656 flows are correctly identified as Pix-start photo frame cases. Number of unclassified cases has come down to 21 instead of 33 in previous case. In this instance, although the improvement is marginal but for other cases keyword combined with MAC address-based classification is significantly reducing the number of unclassified instances. For example, in case of Lifx Bulb 5667, unclassified instances have been reduced to just 34. It is easy to see from the two tables that similar scale improvements are made across all the devices. In general, keyword followed by MAC address-based classification makes the classifier robust as encrypted flows originating from a device which has been already identified can also be classified correctly with one caveat that MAC-based classification is dependent on the correct classification done in the first place. If the first few -69 flows are misclassified and a subsequent flow does not have a keyword, IoTHUnter uses previously stored MAC address for classification which will add to the misclassification. Experiment II: To study the classification performance of IoTHunter without byte offset values used, we performed the second set of experiments without using offset values of keywords but only the presence of keywords. The rationale for this experiment is to understand the impact of byte offset values of keywords in classification. Figure 6 shows the recall rate for only keyword based and keyword plus MAC address based flow classification for this experiment.
From Figures 5 and 6, we can notice that offset-based classifier achieved higher recall rate compared with nonoffset-based classifier. The reason behind this can be understood from confusion matrices in Table 4 and Table 6 for only keyword-based classification and Table 5 and Table 7 for keyword plus MAC address-based classification. This is slightly counter intuitive as relaxing the offset constraint, we expect a higher recall rate. However, except in one case (Samsung smart things) in all other cases the recall rate is decreasing. This decrease is due to the misclassifications with one device's keyword being found in another device flow. With no restriction on the location of keyword these are classified as other device type. Further, removing byte offset constraint increases the number of misclassifications when MAC address-based classification is combined as previous misclassifications influence the subsequent flow classification. This fact is supported by other entries in the confusion matrices in Tables 4 and 6 for keyword-based classification  and Tables 5 and 7 for keyword plus MAC address-based classification, respectively. The importance of offset-based classifier can be inferred by comparing Figures 5b and 6b. Using offset-based classifier flows from Belkin motion sensor, HP printer and Withings sleep sensor are classified with the recall rate of 0.38, 0.84, and 0.98, respectively. Using non-offset-based classifier Belkin motion sensor, HP printer, and Withings sleep sensor exhibit poor recall rate of 0.01, 0.37, and 0.49, respectively. This is because the non-offset based classifier is less restrictive, and a keyword is matched if it appears anywhere in the entire payload. This relaxed version without the byte offset constraint increases the false positives and results into lower recall rates.

| COMPARISON
We compare IoTHunter with a recent work of Bazawada et al. [18]. They described a method to identify IoT devices in a smart environment using a collection of fingerprints representing the behavioural profile of the IoT devices. An unknown device is identified as D i if a fingerprint corresponding to D i matches the already captured behavioural profile B i of the device D i . These behavioural profiles are generated using communication patterns of devices. In particular, five packets from each device session are collected, and 20 features are extracted from each packet to generate a feature vector of one hundred features. This includes features collected from packet header and also from payload. Packet header features are binary features like whether the packet is HTTP, HTTPS, or ICMP/ARP, etc. In all, there are 17 such features extracted from every packet header. Payload-based features include size of payload, entropy of payload, and TCP window size (set to zero for UDP packet) of packet in question. Each feature vector represents a fingerprint, and a series of such vectors are collectively called the behavioural profile of an IoT device. Machine learning algorithms are then trained and tested using the generated feature vectors. Authors experimented with several machine learning algorithms, including K-nearest neighbour, Decision trees, Gradient boosting, and majority voting.
In Ref. [18], the authors claimed that the gradient boosting regression algorithm performed better than others and reported the results for each experiment using the same. We replicated the authors' work on our dataset for classification performance comparison. As mentioned earlier, 20 features are extracted from each packet and five such packets are taken from each session. From our experiments, we noticed that some of the sessions of IoT devices have less than five packets. Table 8 shows the distribution of sessions having one to five or more number of packets for every device. We can notice that