Electric theft detection in advanced metering infrastructure using Jaya optimized combined Kernel-Tree boosting classifier—A novel sequentially executed supervised machine learning approach
Abstract
This paper presents a novel, sequentially executed supervised machine learning-based electric theft detection framework using a Jaya-optimized combined Kernel and Tree Boosting (KTBoost) classifier. It utilizes the intelligence of the XGBoost algorithm to estimate the missing values in the acquired dataset during the data pre-processing phase. An oversampling algorithm based on the Robust-SMOTE technique is utilized to avoid the unbalanced data class distribution issue. Afterward, with the aid of few very significant statistical, temporal, and spectral features extracted from the acquired kWh dataset, the complex underlying data patterns are comprehended to enhance the accuracy and detection rate of the classifier. For effectively classifying the consumers into “Honest” and “Fraudster,” the ensemble machine learning-based classifier KTBoost, with Jaya algorithm optimized hyperparameters, is utilized. Finally, the developed model is re-trained using a reduced set of highly important features to minimize the computational resources without compromising the performance of the developed model. The outcome of this study reveals that the proposed theft detection method achieves the highest accuracy (93.38%), precision (95%), and recall (93.18%) among all the studied methods, thus signifying its importance in the studied area of research.
1 INTRODUCTION
The integration of communication and information technologies with electrical infrastructure has become more prevalent in recent years. Smart grids, the next generation of energy distribution networks, are emerging due to the increasing penetration of advances in modern technology [1, 2]. One of the crucial components of smart grids is Advanced Metering Infrastructure (AMI) which allows the transfer of two-way data like time and quantity of energy used by a customer. With this new bi-directional information flow, AMI facilities power companies to perform accurate modelling of the customer energy consumption behaviour [3], including predicting energy usage [4], demand response [5], and real-time pricing [6]. However, despite numerous advantages, threats such as cyber-attacks, smart meter hacking, and malicious data manipulation restrict the vast expansion of AMI [7-9] and jeopardize the grid's security. The most significant consequence of AMI is Non-Technical Losses (NTL) which accounts for power theft, errors in the metering/registering process, and invoicing mistakes [10]. Among all the mentioned NTL causes, electric power theft shares the major portion. Theft of power is not only associated with economic loss, but it also affects the power quality, increased load on the generating stations, and irrational tariffs imposed on legitimate consumers. Power utilities all over the globe incur significant revenue loss as a result of power theft. In the United States alone, this loss ranges from 0.5 percent to 3.5 percent of their annual income [11]. The case is even worse in underdeveloped nations where the revenue loss from this type of NTL becomes a significant portion of their gross domestic product [12, 13].
To decrease the NTLs, power utilities check all suspected consumers daily or weekly and then enforce punitive measures for any proven fraud practices. However, this process is time-consuming, expensive, and error prone. Currently, the majority of the power utilities, especially in under-developed countries, are employing traditional inefficient, laborious, costly, and time-consuming NTL detection systems. Nevertheless, in recent years, a significant increase in the deployment of AMI in distribution networks has been witnessed, which provides additional features such as monitoring, storing, and retrieving a broad variety of data at any time. In addition, data-oriented algorithms have been developed as an effective automated tool for screening aberrant energy consumption patterns and identifying possible electrical fraud activities. These data-oriented theft detection methods can be broadly categorized into four categories, statistical-based [14-17], game-theory-based [18, 19], expert system [20, 21] and ML-based [22-25].
1.1 Major and minor contributions of the proposed theft detection system
-
The proposed framework initiates its operation by substituting the missing entries in the obtained smart meter dataset using the machine learning (ML)-based predictive modelling technique. This technique estimates the missing data records by employing the XGBoost algorithm in such a manner that missing attributes act as the target class and the rest of the feature set as an input for model training. The important aspects of this algorithm include handling various kinds of missing data, being adaptable to interactions and non-linearity within the dataset, and being scalable to large data situations.
-
After handling the missing values problem, the data class imbalance issue is addressed by using the robust synthetic minority oversampling approach (robust-SMOTE). The robust-SMOTE technique generates the minority samples (i.e., fraud cases) from all minority sample regions present in the dataset, such as those which are present within the majority class area (Healthy cases), on the borderline of the majority class, and the one which is far away from the majority class samples. Subsequently, to accurately depict the underlying properties of consumption data, the proposed method utilizes the statistical, temporal, and spectral domains to extract features from collected consumption data.
-
After collecting the most relevant characteristics, the model training-testing procedure is commenced by classifying customers into two different groups (“Genuine/Healthy” and “Theft/Fraudster”) using the KTBoost algorithm. The KTBoost algorithm combines kernel boosting and tree boosting methods for classification purposes. In each boosting iteration, it either adds a regression tree or a penalized reproducing kernel Hilbert space RKHS/kernel ridge regression function to the ensemble of base classifiers. Later, to obtain the best possible results, the model's hyperparameters are tuned using a meta-heuristic-based optimization technique called the Jaya algorithm. The Jaya algorithm is a stochastic population-based optimization technique that modifies a population of individual solutions on an ordered basis by keeping the notion that each individual solution strives to attain the best solution while avoiding the least fit/worst one.
-
Finally, the proposed model is retained with a smaller set of highly significant features while maintaining the same degree of accuracy, thus conserving computing resources.
In Section-2, the most relevant literature on the challenges encountered during the development of the SML framework is discussed. Section 3 discusses data exploration, the missing values imputation approach, the data class balancing method, feature engineering, and the theoretical background of the KTBoost and Jaya algorithms. Section 4 provides the outcomes of the proposed research work. Finally, Section 5 of this study contains the conclusion.
2 LITERATURE REVIEW
The current research explores an application of the supervised ML-based theft detection framework; therefore, the most relevant information and literature are highlighted to understand better the proposed methodology and its significance in the studied field of research.
-
Handling of missing and outlying values occurrence in the accumulated raw dataset
-
Target/data class imbalance distribution
-
Method for relevant features extraction and selection
-
The right choice of classification algorithm and its hyperparameters to maximize the prediction accuracy
-
Understanding/interpreting the model's prediction.
A number of attempts have been made in the literature to solve these issues, out of which few prominent research works are cited as per the sequence of the above-mentioned problems.
The data from smart meters is often irregular, with several null and outlying readings mainly due to unstable synchronous transmission between sensors and databases, unexpected device maintenance, storage issues, unreliable/inadequate quality network, the incorrect estimate of sent data, and various unknown environmental factors [26]. Such irregularities in the dataset may jeopardize the learning ability of the SML classifier, resulting in biased and erroneous estimations [27]. In order to address this issue, typically, two approaches have been adopted in literature: imputation or elimination. In the imputation method, an estimated value for the missing attribute is substituted, while in elimination, the missing entries in the dataset are removed. The imputation process is often used for dealing with missing features since it is based on the concept that if an essential feature is missing for a specific instance, it may be approximated from the already available data [28]. In general, the imputation process is carried out either by statistical or machine learning methods. The estimation techniques are based on statistical methods such as mean, mode, median, linear interpolation [29], or autoregressive integrated moving average [30]. These data imputation methods are computationally fast and simple to execute. However, they generally lead to erroneous and skewed results due to the possible presence of outliers (individuals or observations with unusual characteristics) in the data. Furthermore, most of the classifiers cannot comprehend the complex relationships between input data variables and missing values occurrence patterns in the data, which consequently leads to misleading outcomes. Nevertheless, few machine learning methods such as k-nearest neighbour missing values imputer [31], fuzzy clustering [32], support vector regressor (SVR) [33], random forest imputation (RFI) [34], Bayesian missing values imputer [35], etc., employ efficient predictive modelling techniques for estimating missing data values accurately. However, in the presence of huge amounts of data, such as the high-resolution data from smart meters, the mentioned techniques require enormous computing resources. Another way to deal with missing data is to discard/eliminate it entirely from the rest of the data. Despite the fact that “discarding” techniques such as list and pair-wise can be implemented smoothly, a significant loss of information might happen, leading to skewed estimates at the end of the classification process.
Another challenge in NTL detection is the unbalanced data class distribution, that is, the frequency of fraudulent cases is disproportionately low compared to genuine consumer cases. The performance of machine learning classifiers is severely affected by the imbalanced distribution of data classes. Moreover, the over-representation of the majority class (Healthy consumers) prevents a classifier from focusing on minority class (Fraudster customers); thus, producing irrational results. Various methods based on the concepts of minority oversampling and majority under-sampling have been proposed in the literature to counteract this issue. Two prominent research works that have thoroughly addressed this imbalanced data class distribution problem are Nazmul et al. [36] and Sravan et al. [37]. Both works used the Synthetic Minority Oversampling Method (SMOTE) to balance the data class distribution in the acquired NTL detection dataset. The SMOTE method randomly generates the minority class samples by setting the same sampling rate for all samples of the minority class. The problem associated with this approach is that it causes overfitting and low generalizing ability of the classifier. In another research work, Madalina et al. [38], an under-sampling method is employed where the number of data samples from the majority class is eliminated to balance the data class distribution. Such data balancing methods are simple to execute; however, they can cause significant data loss, resulting in a reduction in the accuracy of the developed model. In another article [39], the data class distribution was balanced via the use of the ADAptive SYNthesis (ADASYN) based oversampling technique. While the developed approach obtained better generalizing ability, it achieved lower accuracy owing to the underfitting of the developed model.
As mentioned earlier in this section, the third major problem in the fraud detection techniques is the selection of the most relevant features for the model training process. Due to the fact that raw smart meters contain only consumption data and lack any statistical or supplementary features, it becomes difficult for the learning classifier to differentiate/understand the complex underlying patterns present in the data. In order to mitigate this issue, Punmiya et al. [40] and Salman et al. [24] extracted additional features from raw data employing simple statistical techniques such as mean, median, standard deviation, minimum and maximum. However, even though these techniques are simplistic to implement and computationally fast yet, they produce misleading results in the presence of outliers in the data.
After feature engineering, choosing a suitable classifier for efficiently separating genuine and fraudulent customers is the next challenge in any supervised ML technique. Nagi et al. [39] used a predictive modelling technique based on support vector machines (SVM) to identify abnormal behaviour of the consumers. The SVM-based ML model was developed using customer load profile data and other characteristics such as creditworthiness rating, meter reading data, and fraudulent activity report to identify abnormal consumer behaviour effectively. However, the detection hit rate achieved was merely 60% which is significantly very low, particularly when consumers are in the millions. In one of the most recent studies, a deep Siamese network (DSN) coupled with a convolutional neural network (CNN) and long-short term memory (LSTM) was proposed by Javaid et al. [39] to differentiate the characteristics of genuine and dishonest consumers. The authors achieved a reasonable accuracy; however, the precision and recall rates were comparatively lower. In another study, Paria et al. [41] developed a theft detection framework to identify regions of significant energy theft at the transformer level using data gathered from different distribution transformer meters. The developed methodology achieved a high detection rate (94%); however, since the fraudster consumption patterns introduced in this research work were produced synthetically, they do not precisely depict the actual fraudster customer's profiles; therefore, attained outcomes may diverge from a realistic scenario.
In one of the recent studies, Oprea et al. [42], utilized feature engineered light gradient boosting to effectively find irregular consumption patterns in the acquired conventional meter dataset. However, the data class balancing technique employed in the quoted study used the SMOTE algorithm, which is prone to overfitting and often results in a high generalizing error. In addition to that, it may increase noise since it ignores class distributions and has some sample selection blindness. Sarkar et al. [25] presented the fraud detection framework utilizing ensemble machine learning methods with considerable high accuracy, precision, and recall. However, they failed to interpret the developed model outcomes, which are crucial in strengthening the ML model further. The model's outcomes interpretation benefits in two ways: first, it helps concentrate and fine-tune the characteristics that contributed most to generating positive outcomes. Second, by re-training the model with a smaller set of very important features (features importance score assigned by the model), computational time may be substantially lowered without compromising real accuracy values. Table 1 presents the summary of the different techniques utilized in developing SML-based electric theft detection methods.
S. No. | References | Method used | Missing values | Data class imbalance | Feature extraction | Feature selection | Performance metrics utilized |
---|---|---|---|---|---|---|---|
1 | Nizar et al.[43] | Naïve Bayes and Decision tree | – | – | Load profiles | – | Accuracy |
2 | Nagi et al. [44] | Genetic algorithm-SVM | Average values | – | Statistical features | – | Accuracy, detection rate |
3 | Nizar et al. [45] | Extreme learning machine -SVM | – | – | – | – | Accuracy |
4 | Nagi et al. [46] | SVM | Average values | – | Statistical features | – | Accuracy, detection rate |
5 | Ramos et al. [47] | Optimum path forest (OPF) | – | – | Statistical features | – | Accuracy |
6 | Caio et al. [48] | Harmony search algorithm and OPF | – | – | Principal component analysis | Harmony search algorithm | Accuracy |
7 | Carlos et al. [49] | Integrated expert system, rule-based system | Removal | – | Text mining | – | Accuracy |
8 | Faria et al. [50] | Spatial-temporal estimation | – | – | Statistical features | – | Loss probability |
9 | Juan et al. [51] | SVM-DT | – | – | Statistical features | Filter wrapper | Accuracy, recall, precision, and F1score |
10 | Paria et al. [52] | Consumption pattern-based energy theft detection | – | Different sampling proportions | Statistical features | – | Bayesian detection rate, accuracy, recall, detection rate, and precision |
11 | Selvam et al. [53] | Decision Tree, Random Forest | – | – | – | – | Accuracy, ROC |
12 | Zheng et al. [54] | Wide and deep convolutional neural networks | Average values | – | CNN | – | Accuracy, recall, detection rate, and precision |
13 | Punmiya et al. [40] | Feature engineered extreme gradient boosting machine | – | SMOTE | Statistical features | – | Accuracy, recall, detection rate, and precision |
14 | Salman et al. [13] | Ensemble machine learning | – | – | – | – | Accuracy, recall, detection rate, and precision |
15 | Blazakis et al. [55] | Adaptive Neuro-Fuzzy Inference System | – | – | Statistical features | Neighbourhood component analysis | Accuracy, F1 score, precision, recall, specificity, AUC |
16 | Sravan et al. [25] | Ensemble machine learning | Deletion | SMOTE | – | – | Accuracy, ROC, recall, precision |
17 | Salman et al. [24] | Boosted C5.0 decision tree | – | – | Statistical features | Pearson's Chi-Square | Accuracy, recall, detection rate, and precision |
18 | Zhengwei et al. [56] | Random Forest | – | Kmeans-SMOTE | – | – | Accuracy, TPR, FPR, TNR, G-mean |
19 | Guoying et al. [57] | Autoencoder and Random Forest | – | Undersampling and re-sampling | Stacked autoencoder | – | Probabilistic prediction |
20 | Munwar et al. [58] | Recurrent neural network | Rule-based | – | – | – | Accuracy, recall, detection rate, and precision |
21 | Cheng et al. [59] | Deep learning, random forest | Rule-based | – | CNN | – | Precision, recall, true positive rate, false-positive rate |
22 | This work | Jaya optimized-KTBoost | XGboost algorithm | Robust-SMOTE | Statistical, temporal, and spectral domain-based features | KTBoost algorithm | Accuracy, detection rate, precision, F1score, kappa and MCC |
3 PROPOSED METHODOLOGY
A stage-wise representation of the proposed theft detection framework is depicted in Figure 2.
Each of the stages mentioned in Figure 2 is detailly discussed in subsequent subsections.
3.1 Exploratory data analysis
In this subsection, the pre-processing of the acquired dataset is explained in detail. The dataset used for this study is real smart meter data obtained from the State Grid Corporation of China (SGCC). The acquired dataset distribution is summarized in Table 2. Like most of the real-time datasets, the number of fraudster consumers in SGCC kWh data is lower than that of healthy consumers. Figures 3 and 4 illustrate the consumption patterns of a few randomly selected fraudulent and healthy consumers, respectively.
Parameter description | Parameter value |
---|---|
Number of total consumers | 42,372 |
Number of healthy/genuine consumers | 38,757 or 91.46% of total data |
Number of fraudster/theft consumers | 3615 or 8.54% of total data |
Number of days of consumption record | 1035 days (January 2014 to December 2016) |
It can be observed from the provided figures that the consumption patterns of the theft customers are highly unpredictable and contain low repeatability, while the genuine consumers’ patterns are recurrent and exhibit a relationship among identical periods of subsequent years.
3.2 Missing values and their imputation using XGBoost algorithm
The smart meter data often contains numerous missing entries mainly due to the malfunction of equipment, lag in registering/collection of data remotely, accidental deletion, cyber-attacks or fabrication of their smart meter devices, etc. In order to illustrate the occurrence of the missing values in consumption patterns, a few consumer's electric power consumption randomly sampled from acquired consumption data are illustrated in Figure 5.
From Figure 5, it can be observed that there are several blank spots in between the consumption values. If such kind of incomplete dataset is directly fed into the ML framework, the ML algorithms within the framework would be unable to comprehend the complicated relationships between input data variables and missing values occurrence patterns present, thus leading to misleading conclusions. The missing values in the entire dataset are computed and plotted in Figure 6. Figure 6 illustrates the missing values present in each consumer's consumption data where the x-axis is the time window of acquired consumption data, and the y-axis is the number of consumers present in the data. The darker regions in the mentioned figure demonstrate a higher density of missing entries, and lighter or dotted areas express lesser missing entries. For example, from the time window of 2014 to 2015, consumers' consumption data carries a lot of missing entries, whereas, in 2016, these missing entries are comparatively lower. In addition to that, the kernel density estimation and histogram plot of missing values present in the data is computed and illustrated in Figure 7.
It can be observed from Figure 8, the estimated missing values (in black colour) coincide with the actual consumption data. Thus, the missing values imputed through this process enhance the ML classifier performance and avoid unintentional model bias towards the missing values.
3.3 Robust-SMOTE for data class imbalance issue
The SML-based classifier's performance deviates largely if the proportion of data classes present in the acquired dataset varies [60]. Since the acquired smart meter data is highly unbalanced, class balancing must be performed through an intelligent technique before training and testing the classifier. Figure 9 shows the class distribution of the collected dataset; the red data points represent the theft samples and green points healthy samples (majority class).
It can be observed in Figure 9 that the minority class samples are scarcer than the majority class samples. The ML-classifiers trained on such datasets are likely to be biased towards the data class that is present in a greater proportion. Generally, legitimate customers are more than fraudsters in most of the smart meters dataset [42]. Therefore, it is essential to balance the distribution of the data classes prior to feeding the ML-classifier.
In order to mitigate this issue, the robust SMOTE algorithm is used in this study. The robust SMOTE method addresses all frequently occurring categories of minority data samples, that is, minority points in the majority class region, minority class close to majority class samples, and safe minority points [61]. It accomplishes the mentioned task by measuring the relative data density for computing the local density of the minority data points between its k-nearest heterogeneous neighbours and k-nearest homogeneous neighbours initially. Afterward, it divides minority samples into borderline and safe samples relying on the relative density of minority samples' 2-means clustering outcomes. The quantity produced by each minority data point is re-weighted depending on the number of majority classes in its k-nearest neighbours, resulting in more samples close to the safe data points. In comparison, the scarcer samples are brought near the disorder samples to improve the divisibility of the classification boundary between classes. The data class distribution of the acquired dataset after implementing the robust-SMOTE is illustrated in Figure 10.
It can be observed from Figure 10 that the minority (red data points) and majority class (green data points) distribution is justifiably balanced. Furthermore, most of the minority class samples are generated from those safe minority samples that are far away from the healthy samples; thus, this method aids the ML-classifier in defining the classification border more eloquently.
3.4 Feature engineering
The successful development of the ML model is often contingent on the appropriate selection of input features used during model training [62]. The feature engineering approach is specifically dedicated to that purpose; it assists in summarizing the dynamics of the data and enhances its overall representation by extracting the most important features while simultaneously improving the performance and detection accuracy of the model [63]. The acquired smart meter dataset consists of only consumption data in kWh and lacks any other statistical significance. Therefore, in this study, several features from statistical, temporal, and spectral domain-based features are extracted from each consumer's consumption data, as presented in Table 3. Since there are no less than 39 extracted features presented in Table 3, therefore, it is quite hard to add the theoretical and mathematical background of all the extracted features due to the scope and length of the article. Nevertheless, interested readers can find all the relevant information in reference [64].
S. No. | Feature | S. No. | Feature | S. No | Feature |
---|---|---|---|---|---|
1 | Mean | 14 | Zero crossing rate | 27 | Variance |
2 | Median | 15 | Peak to peak distance | 28 | Relative desperation |
3 | Mode | 16 | Minimum peaks | 29 | Autocorrelation |
4 | Maximum | 17 | Entropy | 30 | Histogram with different bandwidths |
5 | Minimum | 18 | Maximum peaks | 31 | Mel frequency cepstrum coefficients (MFCC) |
6 | Interquartile range | 19 | Histogram | 32 | Spectral variation |
7 | Kurtosis | 20 | Fast Fourier transform | 33 | Centroid |
8 | Skewness | 21 | Spectral centroid | 34 | Positive turning points |
9 | Standard deviation | 22 | Spectral kurtosis | 35 | Negative turning point |
10 | Median absolute deviation | 23 | Median frequency | 36 | Slope |
11 | Mean absolute deviation | 24 | Wavelet entropy | 37 | Mean absolute difference |
12 | Mean absolute differences | 25 | Wavelet energy | 38 | Maximum frequency |
13 | Median absolute differences | 26 | Empirical cumulative distribution | 39 | Median frequency |
3.5 Proposed classifier: Jaya optimized KTBoost algorithm
Boosting algorithms are widely used in practical data science and machine learning-based research works due to their outstanding prediction accuracy on highly complex datasets [65]. The boosting algorithms additively chain weak (base) classifiers by consecutively reducing both bias and variance at each boosting iterations. Despite the widespread usage of boosting algorithms, only one type of function is used as a base learner in most cases. In contrast to that, the KT-Boost algorithm either adds a regression tree or a penalized reproducing kernel Hilbert space RKHS (kernel ridge regression function) to the ensemble of base classifiers in each boosting iteration [66]. In the beginning, the base learner is learned from both regression tree and RKHS function by employing gradient or newton as optimization techniques; afterward, the base learner whose inclusion in the ensemble results in the lower empirical risk is chosen. In this way, at each subsequent iteration, a base learner from two fundamentally different learners is selected to achieve high predictive accuracy. In addition to that, this amalgamation facilitates enhanced learning about functions that have different regularity degrees, such as discontinuities and smooth portions, as most discontinuities portions are learned through regression trees through smooth (continuous) portions using RKHS regression functions. The most important hyper-parameters of the KTBoost algorithms are given in Table 4.
Parameter name | Description |
---|---|
learning_rate | Parameter helps in setting weighting factors for the addition of new trees at each iteration to the classifier. |
n_estimatiors | The number of boosting iterations to be performed. |
subsample | The number of samples to be used for fitting the individual base learners. Optimal selection of this parameter can assist in setting bias and variance values. |
criterion | This is an evaluation metric to compute the quality of split, by default, it is selected as the mean square error (mse) but can be chosen as mean absolute error or Friedman mse. |
min_samples_split | The minimum number of samples to be present at a leaf/internal node. This parameter controls the model overfitting/ underfitting related problems. |
min_samples_leaf | The minimum number of samples to be present at the leaf. Controlling this parameter helps in overfitting/underfitting related issues. |
min_weight_leaf | |
max_depth | Parameter helps in building the structure of regression tree. |
max_features | Number of features to be selected when searching for split. |
max_leaf_nodes | Optimal selection of these value facilities reducing the impurity of regression trees. |
base_learner | This parameter sets the base learners, in this either trees or kernel or a combination of both can be chosen. |
update_step | This parameter estimates boosting updates at each iteration. If the base learner is chosen only trees and update step as a hybrid then gradient step estimates the structure of trees and Newton step assists in finding the number of the leaf. Similarly, if the base learner is chosen kernel and update step as a hybrid, then gradient descent is used as an update step. |
Tol | This value facilities for early stopping if there is no change in the loss. |
kernel | In the case of kernel booting, Laplace, radial basis function and generalized Wendland can be chosen as kernel functions. |
range_adjust | Regularization parameter for RKHS regression function. |
Nystroem | The Nystroem sampling method is used if set to true. In the case of large data set, this parameter helps in reducing computation resources. |
n_components | The number of samples used in Nystroem samples. |
Unlike the previous research work where these parameters are either selected by using inefficient and time-consuming “trial and error” method or are adopted from previous literature, the current study utilizes the intelligence of a swarm intelligence based optimization technique called the Jaya algorithm to select the most optimal hyperparameters of the KTBoost algorithm. The Jaya algorithm is a gradient-free metaheuristic optimization method for solving constrained and unconstrained optimization problems. It is a stochastic population-based technique that modifies a population of individual solutions on an ordered basis by keeping the notion that each individual solution strives to attain the best solution while avoiding the least fit (worst) solution. One of the important features of this algorithm that makes it different from the other swarm intelligence-based optimization methods is that it does not require any algorithm-specific or control parameters for its operation. To avoid the computational complexity and to achieve the most optimal results within the limited number of iterations, only eight of the most important hyperparameters (base_learner, kernel, learning_rate, loss, max_depth, max_leaf_nodes, n_neighbors, update_step) are taken as decision variables in the current research work.
- Step 1: Initialize the input parameters of Jaya ( and of the problem which is to optimize (. In this the is the population size. is the number of maximum iterations to set is the design variables of the function which is to be optimized.
- Step 2: Initiate by randomly initializing the population within the predetermined lower and upper boundaries as given as in Equation (2),
(2)
- Step: 3: For each solution vector, estimate the value of the cost function and compute the best and worst solutions.
- Step: 4: Update the solutions as follows
(3)
- Step: 5: Evaluate the updated solutions by restricting them not to exceed the boundary conditions.
(4)
- Step: 6: To evaluate whether the updated solution or the existing solution will advance to the next iteration, compute the value of the costs function for each set of search agents by employing the greedy selection technique. If the revised solution is better than the current solution, replace the former. On the contrary, the revised solution will be discarded, but the current solution will be retained in the population.
4 RESULT AND DISCUSSION
At this stage, the dataset developed during the feature engineering process is retrieved for model training and validation purposes. The fetched dataset comprises 1035 days of real consumption data and 39 additional features (mentioned in Table 3). Moreover, the raw input dataset's data class distribution was balanced with the robust-SMOTE method prior to feeding it to the algorithm for model training. The train-test split method is used in which 80% of the data is used for model training while 20% is for testing purposes. The proposed theft detection framework utilizes the KTBoost algorithm for model training, while the Jaya algorithm-based meta-heuristic optimization is used for its hyper-parameter tuning. In this scenario, the objective function for optimization purposes is to optimize the model's accuracy by minimizing the difference between predicted and actual outcomes. By initializing more than 35 trails/iterations employing the Jaya algorithm model attained an accuracy of 0.937 as presented in the optimization history plot in Figure 11. The x-axis represents the trail count, while the y-axis shows the accuracy value. The blue dots show the accuracy value attained at different combinations of hyperparameters in the graph.
Furthermore, in Figures 12 and 13, the slice and contour plots of the model's hyperparameters optimization process are shown, neatly illustrating the implication of the hyper parameter's variation on the objective value/accuracy. For example, Figures 12 and 13 depict that a learning rate within the range of 1.5 to 2.5 achieves high objective values, but increasing beyond that produces a considerable reduction in objective value. Similarly, max_depth greater than 1500 yields better accuracy values; increasing beyond that yields a significant reduction in accuracy, which can be attributed to the model overfitting on the training data.
The optimal hyper-parameters set, which attained the best accuracy value during several optimizations trials, is given in Table 5. As presented in the table, the combined base learner (kernel boosting and tree boosting) and hybrid update step achieve the best accuracy value.
Hyperparameter | Value |
---|---|
base_learner | Combined (Kernel boosting and tree boosting) |
kernel | GW |
learning_rate | 0.2 |
loss | deviance |
max_leaf_nodes | 34 |
max_depth | 1863 |
n_neighbors | 50 |
update_step | hybrid |
4.1 K-fold cross-validation results of the Jaya optimized-KTBoost model
To effectively implement the proposed Jaya optimized-KTBoost algorithm, the designed model is initially trained on the data developed after the data class balancing and feature engineering stage. Afterward, the tenfold cross-validation (CV) technique employing the mentioned performance metrics (Equations (5)–(12)) is utilized for the performance evaluation of the designed model. This evaluation has produced the following results; as presented in Table 6, the proposed model has achieved a mean accuracy and precision of 0.9338 and 0.9508 with a standard deviation (SD) of 0.0029 and 0.0035, respectively.
No. of folds | Accuracy | Recall | Precision | Flscore | Kappa-value | MCC |
---|---|---|---|---|---|---|
1 | 0.9311 | 0.9216 | 0.9479 | 0.9345 | 0.8891 | 0.8922 |
2 | 0.9354 | 0.9278 | 0.95 | 0.9388 | 0.8705 | 0.9108 |
3 | 0.9354 | 0.9239 | 0.9536 | 0.9385 | 0.8706 | 0.9111 |
4 | 0.9326 | 0.9196 | 0.9524 | 0.9357 | 0.8921 | 0.9123 |
5 | 0.937 | 0.9263 | 0.9542 | 0.94 | 0.8736 | 0.9201 |
6 | 0.939 | 0.9292 | 0.9552 | 0.942 | 0.8777 | 0.9021 |
7 | 0.9285 | 0.9 191 | 0.9454 | 0.9321 | 0.8921 | 0.9154 |
8 | 0.9331 | 0.9258 | 0.9476 | 0.9366 | 0.881 | 0.9125 |
9 | 0.9313 | 0.923 | 0.947 | 0.9348 | 0.887 | 0.8926 |
10 | 0.9344 | 0.9206 | 0.9548 | 0.9374 | 0.8891 | 0.9092 |
Mean | 0.9338 | 0.9318 | 0.9508 | 0.9371 | 0.8873 | 0.9077 |
Standard deviation | 0.0029 | 0.0033 | 0.00365 | 0.00292 | 0.0087 | 0.00931 |
4.2 Confusion matrix evaluation of the proposed model
The confusion matrix (CM) is a prominent metric for addressing classification issues. It may be used for both binary classification and multiclass classification issues. CM represents counts from the actual and predicted values, as illustrated in Figure 14. In this study, represents the number of theft consumers rightly classified by the classifier whereas represents the fraudster consumers misclassified as the healthy consumers. Similarly, represents the number of rightly classified healthy consumers while depicts the healthy consumer misclassified as the fraudster consumer.
The confusion matrix of the proposed model is shown in Figure 15, “0” represents here the actual negative class or Healthy consumers and “1” represents the positive class or Fraudster consumer. The values in CM are normalized in the percentage form for ease in readability purposes. From the mentioned figure, it can be observed that the classifier rightly classified 93.16% of the theft consumers while 6.84% of actual theft consumers were misclassified as healthy. Similarly, 95.25% of healthy consumers were rightly classified, whereas 4.75% of actual healthy consumers were misclassified as theft.
4.3 AUC-ROC curve of the proposed model
where the represents the number of positive samples, number of negative samples and depicts the rank value or of sample j belonging to the positive class. The AUC value is the likelihood that a randomly selected positive data sample would rank higher than a randomly selected negative data sample. The AUC value varies between 0.5 to 1, where 0.5 specifies that the classifier performs random guessing, and 1 indicates that the classifier is perfect in classifying the healthy and theft consumers.
The ROC curve of the proposed classifier is shown in Figure 16; the x-axis represents the FPR, and the y-axis the TPR. The average AUC value of the proposed classifier is 0.98, which indicates that most of the theft and healthy consumers are rightly classified.
4.4 The learning curve of the proposed theft detection model
A learning curve depicts the relationship between the training score and cross-validated (CV) test score for a classifier with different training data instances graphically [68]. The basic notion of this curve is to check the classifier's generalizing ability on different data samples. The learning curve of the proposed classifier is shown in Figure 17. The curves in the graph illustrate the mean scores, while the shaded areas depict the standard deviations above and below the mean for all cross-validations. If the model is flawed because of the bias, the training score curve will most likely be more variable than expected. Likewise, if the model is prone to error owing to variance, the cross-validated score will be more unpredictable.
In Figure 17, it can be seen that when the data samples are minimal, the model training score is very high in comparison to the CV-score, which is a result of the high bias of the model. In contrast, as the number of training data samples grows, the training score decreases, while the CV- score increases, albeit with considerable fluctuation due to the model's high variance. Additionally, it is interesting to note from the learning curve that the model's CV-score and accuracy are above 0.9338, implying that the model can accurately distinguish fraudster consumers from healthy consumers.
4.5 Proposed model's outcomes interpretation and their impact on training time
In this section, the proposed model's prediction or outcomes are interpreted. The model's prediction interpretation is the process by which the input data features utilized for model training are evaluated based on their positive influence on predicting the correct result. In this study, the KTBoost algorithm is employed to rank all the given input features in terms of their contribution in predicting the right outcome.
Due to the fact that the input training data contains over 1200 features, it is not feasible to display the importance score of each feature in the graph; thus, only the top ten most important features are displayed in Figure 18 together with their importance score. The figure shows that the feature from actual consumption had the highest significance value, followed by statistical features derived from actual consumption. In order to demonstrate the significance of the importance score assigned by the KTBoost model to each feature, the KTBoost model was re-trained to incorporate a much smaller yet essential feature set. Figures 19 and 20 show the computing time required to analyse the entire collection of data features (1071 features) and the 23 most important data features. As can be seen in the mentioned figures, when a smaller number of features set is given, a substantial decrease in computing time is achieved.
In addition to that, Figure 21 depicts the effect of important features on the model's accuracy. The model achieved an accuracy value of 80 percent when just the five most important features were supplied. By increasing the number of important features set from 5 to 23, the model achieved the same accuracy as when trained with all 1071 features. Thus, the conclusion from this can experiment be made that, if the model is retained with the most important features set, the computational resource required can be drastically reduced without violation in accuracy values.
4.6 Proposed model's comparison against the latest and traditional methods
This section presents a side-by-side comparison of the proposed theft detection framework with a series of well-known traditional machine learning models and the latest bagging and boosting models under an identical feature set. To assess the performance of all studied classifiers, the ten-fold cross-validation method is used in conjunction with the five most commonly used performance measures, namely accuracy, recall, precision, F1-score, Kappa value, and MCC-value.
The proposed framework is sequentially implemented using the Google-Collaboratory (Python 3 Google Compute Engine backend, 12-GB RAM, without GPU-enabled) environment. The comparison's results are summarized in Table 7. As summarized in the table, the proposed approach surpasses all other ML techniques in terms of accuracy, recall, precision, F1score, Kappa-value, and MCC value, thus evidencing its efficacy and importance. In addition, the proposed model obtained a 93.38% accuracy and recall, the precision of 93.18% and 95%, respectively, which is considerably better than all competing models.
Model | Accuracy | Recall | Precision | F1-score | Kappa-value | MCC |
---|---|---|---|---|---|---|
Proposed model | 0.9338 | 0.9318 | 0.9508 | 0.9371 | 0.8873 | 0.9077 |
XGBoost classifier | 0.9112 | 0.9123 | 0.9012 | 0.912 | 0.867 | 0.875 |
Extra tree classifier | 0.901 | 0.8921 | 0.912 | 0.934 | 0.854 | 0.812 |
SNAP boost algorithm | 0.90 | 0.8912 | 0.9216 | 0.9123 | 0.8412 | 0.845 |
lightGBM | 0.891 | 0.8751 | 0.8631 | 0.8641 | 0.8124 | 0.854 |
Wide-Deep CNN | 0.89 | 0.812 | 0.881 | 0.7921 | 0.812 | 0.8213 |
Gaussian process based boosting | 0.885 | 0.8754 | 0.8698 | 0.8412 | 0.8421 | 0.7892 |
Boosted C5.0 algorithm | 0.881 | 0.8541 | 0.824 | 0.8121 | 0.824 | 0.8245 |
NGBoost algorithm | 0.87 | 0.861 | 0.834 | 0.8251 | 0.834 | 0.8964 |
Random-forest classifier | 0.834 | 0.8123 | 0.8241 | 0.8125 | 0.8453 | 0.831 |
SVM - linear Kernel | 0.823 | 0.7601 | 0.8292 | 0.7928 | 0.6042 | 0.6066 |
AdaBoost classifier | 0.814 | 0.7562 | 0.7213 | 0.745 | 0.751 | 0.761 |
Ridge classifier | 0.795 | 0.7931 | 0.8584 | 0.8244 | 0.6622 | 0.6641 |
Quadratic discriminant analysis | 0.721 | 0.2251 | 0.8911 | 0.3594 | 0.1976 | 0.2974 |
Logistic regression | 0.712 | 0.8063 | 0.8482 | 0.8267 | 0.6619 | 0.6627 |
Linear discriminant analysis | 0.698 | 0.7929 | 0.8583 | 0.8243 | 0.662 | 0.6639 |
K neighbour's classifier | 0.587 | 0.6412 | 0.7606 | 0.8284 | 0.6233 | 0.6356 |
Naive Bayes | 0.54 | 0.3478 | 0.6261 | 0.4472 | 0.1401 | 0.1563 |
5 CONCLUSION
This study presented a novel sequentially executed data-driven approach for identifying electric fraud in a smart meter dataset. The raw smart meter data often contains several null and irregular values mostly due to the malfunction of equipment, poor network, or device storage-related issues. Since most machine learning classifiers cannot process the null values present in the data; therefore, this study estimated missing values using an ensemble machine learning-based predictive modelling technique called XGBoost. Afterward, the robust-SMOTE algorithm was used to balance the class distribution in the acquired data. By considering all regions of minority samples in the dataset, the robust-SMOTE technique produces the minority class samples that are less prone to overfitting and noisy sample generation. Once a balanced dataset is obtained, a set of statistical, temporal, and spectral features were extracted from it. These additional features aid the ML-classifier in understanding the underlying complicated data patterns contained in the data. Finally, in order to effectively classify the data into “Honest” and “Fraudster” consumers, the Jaya optimized KTBoost classifier was used. The Jaya-KTBoost technique combines kernel boosting and tree boosting with its hyperparameters are tuned by utilizing the intelligence of the Jaya algorithm. The proposed model attained an accuracy of 93.38%, precision of 95%, and recall of 93.11%, which are significantly higher than all compared methods.
FUNDING INFORMATION
This work was supported by the Fundamental Research Grant Scheme under Grant R.J130000.7851.5F062 through the Ministry of Higher Education, Malaysia.
CONFLICT OF INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
ACKNOWLEDGEMENT
This work was supported by the Fundamental Research Grant Scheme under Grant R.J130000.7851.5F062 through the Ministry of Higher Education, Malaysia.
Open Research
DATA AVAILABILITY STATEMENT
The data that support the findings of this study is publicly available at: https://github.com/henryRDlab/ElectricityTheftDetection.