Electric theft detection in advanced metering infrastructure using Jaya optimized combined KernelTree boosting classifier—A novel sequentially executed supervised machine learning approach
Abstract
This paper presents a novel, sequentially executed supervised machine learningbased electric theft detection framework using a Jayaoptimized combined Kernel and Tree Boosting (KTBoost) classifier. It utilizes the intelligence of the XGBoost algorithm to estimate the missing values in the acquired dataset during the data preprocessing phase. An oversampling algorithm based on the RobustSMOTE technique is utilized to avoid the unbalanced data class distribution issue. Afterward, with the aid of few very significant statistical, temporal, and spectral features extracted from the acquired kWh dataset, the complex underlying data patterns are comprehended to enhance the accuracy and detection rate of the classifier. For effectively classifying the consumers into “Honest” and “Fraudster,” the ensemble machine learningbased classifier KTBoost, with Jaya algorithm optimized hyperparameters, is utilized. Finally, the developed model is retrained using a reduced set of highly important features to minimize the computational resources without compromising the performance of the developed model. The outcome of this study reveals that the proposed theft detection method achieves the highest accuracy (93.38%), precision (95%), and recall (93.18%) among all the studied methods, thus signifying its importance in the studied area of research.
1 INTRODUCTION
The integration of communication and information technologies with electrical infrastructure has become more prevalent in recent years. Smart grids, the next generation of energy distribution networks, are emerging due to the increasing penetration of advances in modern technology [1, 2]. One of the crucial components of smart grids is Advanced Metering Infrastructure (AMI) which allows the transfer of twoway data like time and quantity of energy used by a customer. With this new bidirectional information flow, AMI facilities power companies to perform accurate modelling of the customer energy consumption behaviour [3], including predicting energy usage [4], demand response [5], and realtime pricing [6]. However, despite numerous advantages, threats such as cyberattacks, smart meter hacking, and malicious data manipulation restrict the vast expansion of AMI [79] and jeopardize the grid's security. The most significant consequence of AMI is NonTechnical Losses (NTL) which accounts for power theft, errors in the metering/registering process, and invoicing mistakes [10]. Among all the mentioned NTL causes, electric power theft shares the major portion. Theft of power is not only associated with economic loss, but it also affects the power quality, increased load on the generating stations, and irrational tariffs imposed on legitimate consumers. Power utilities all over the globe incur significant revenue loss as a result of power theft. In the United States alone, this loss ranges from 0.5 percent to 3.5 percent of their annual income [11]. The case is even worse in underdeveloped nations where the revenue loss from this type of NTL becomes a significant portion of their gross domestic product [12, 13].
To decrease the NTLs, power utilities check all suspected consumers daily or weekly and then enforce punitive measures for any proven fraud practices. However, this process is timeconsuming, expensive, and error prone. Currently, the majority of the power utilities, especially in underdeveloped countries, are employing traditional inefficient, laborious, costly, and timeconsuming NTL detection systems. Nevertheless, in recent years, a significant increase in the deployment of AMI in distribution networks has been witnessed, which provides additional features such as monitoring, storing, and retrieving a broad variety of data at any time. In addition, dataoriented algorithms have been developed as an effective automated tool for screening aberrant energy consumption patterns and identifying possible electrical fraud activities. These dataoriented theft detection methods can be broadly categorized into four categories, statisticalbased [1417], gametheorybased [18, 19], expert system [20, 21] and MLbased [2225].
1.1 Major and minor contributions of the proposed theft detection system

The proposed framework initiates its operation by substituting the missing entries in the obtained smart meter dataset using the machine learning (ML)based predictive modelling technique. This technique estimates the missing data records by employing the XGBoost algorithm in such a manner that missing attributes act as the target class and the rest of the feature set as an input for model training. The important aspects of this algorithm include handling various kinds of missing data, being adaptable to interactions and nonlinearity within the dataset, and being scalable to large data situations.

After handling the missing values problem, the data class imbalance issue is addressed by using the robust synthetic minority oversampling approach (robustSMOTE). The robustSMOTE technique generates the minority samples (i.e., fraud cases) from all minority sample regions present in the dataset, such as those which are present within the majority class area (Healthy cases), on the borderline of the majority class, and the one which is far away from the majority class samples. Subsequently, to accurately depict the underlying properties of consumption data, the proposed method utilizes the statistical, temporal, and spectral domains to extract features from collected consumption data.

After collecting the most relevant characteristics, the model trainingtesting procedure is commenced by classifying customers into two different groups (“Genuine/Healthy” and “Theft/Fraudster”) using the KTBoost algorithm. The KTBoost algorithm combines kernel boosting and tree boosting methods for classification purposes. In each boosting iteration, it either adds a regression tree or a penalized reproducing kernel Hilbert space RKHS/kernel ridge regression function to the ensemble of base classifiers. Later, to obtain the best possible results, the model's hyperparameters are tuned using a metaheuristicbased optimization technique called the Jaya algorithm. The Jaya algorithm is a stochastic populationbased optimization technique that modifies a population of individual solutions on an ordered basis by keeping the notion that each individual solution strives to attain the best solution while avoiding the least fit/worst one.

Finally, the proposed model is retained with a smaller set of highly significant features while maintaining the same degree of accuracy, thus conserving computing resources.
In Section2, the most relevant literature on the challenges encountered during the development of the SML framework is discussed. Section 3 discusses data exploration, the missing values imputation approach, the data class balancing method, feature engineering, and the theoretical background of the KTBoost and Jaya algorithms. Section 4 provides the outcomes of the proposed research work. Finally, Section 5 of this study contains the conclusion.
2 LITERATURE REVIEW
The current research explores an application of the supervised MLbased theft detection framework; therefore, the most relevant information and literature are highlighted to understand better the proposed methodology and its significance in the studied field of research.

Handling of missing and outlying values occurrence in the accumulated raw dataset

Target/data class imbalance distribution

Method for relevant features extraction and selection

The right choice of classification algorithm and its hyperparameters to maximize the prediction accuracy

Understanding/interpreting the model's prediction.
A number of attempts have been made in the literature to solve these issues, out of which few prominent research works are cited as per the sequence of the abovementioned problems.
The data from smart meters is often irregular, with several null and outlying readings mainly due to unstable synchronous transmission between sensors and databases, unexpected device maintenance, storage issues, unreliable/inadequate quality network, the incorrect estimate of sent data, and various unknown environmental factors [26]. Such irregularities in the dataset may jeopardize the learning ability of the SML classifier, resulting in biased and erroneous estimations [27]. In order to address this issue, typically, two approaches have been adopted in literature: imputation or elimination. In the imputation method, an estimated value for the missing attribute is substituted, while in elimination, the missing entries in the dataset are removed. The imputation process is often used for dealing with missing features since it is based on the concept that if an essential feature is missing for a specific instance, it may be approximated from the already available data [28]. In general, the imputation process is carried out either by statistical or machine learning methods. The estimation techniques are based on statistical methods such as mean, mode, median, linear interpolation [29], or autoregressive integrated moving average [30]. These data imputation methods are computationally fast and simple to execute. However, they generally lead to erroneous and skewed results due to the possible presence of outliers (individuals or observations with unusual characteristics) in the data. Furthermore, most of the classifiers cannot comprehend the complex relationships between input data variables and missing values occurrence patterns in the data, which consequently leads to misleading outcomes. Nevertheless, few machine learning methods such as knearest neighbour missing values imputer [31], fuzzy clustering [32], support vector regressor (SVR) [33], random forest imputation (RFI) [34], Bayesian missing values imputer [35], etc., employ efficient predictive modelling techniques for estimating missing data values accurately. However, in the presence of huge amounts of data, such as the highresolution data from smart meters, the mentioned techniques require enormous computing resources. Another way to deal with missing data is to discard/eliminate it entirely from the rest of the data. Despite the fact that “discarding” techniques such as list and pairwise can be implemented smoothly, a significant loss of information might happen, leading to skewed estimates at the end of the classification process.
Another challenge in NTL detection is the unbalanced data class distribution, that is, the frequency of fraudulent cases is disproportionately low compared to genuine consumer cases. The performance of machine learning classifiers is severely affected by the imbalanced distribution of data classes. Moreover, the overrepresentation of the majority class (Healthy consumers) prevents a classifier from focusing on minority class (Fraudster customers); thus, producing irrational results. Various methods based on the concepts of minority oversampling and majority undersampling have been proposed in the literature to counteract this issue. Two prominent research works that have thoroughly addressed this imbalanced data class distribution problem are Nazmul et al. [36] and Sravan et al. [37]. Both works used the Synthetic Minority Oversampling Method (SMOTE) to balance the data class distribution in the acquired NTL detection dataset. The SMOTE method randomly generates the minority class samples by setting the same sampling rate for all samples of the minority class. The problem associated with this approach is that it causes overfitting and low generalizing ability of the classifier. In another research work, Madalina et al. [38], an undersampling method is employed where the number of data samples from the majority class is eliminated to balance the data class distribution. Such data balancing methods are simple to execute; however, they can cause significant data loss, resulting in a reduction in the accuracy of the developed model. In another article [39], the data class distribution was balanced via the use of the ADAptive SYNthesis (ADASYN) based oversampling technique. While the developed approach obtained better generalizing ability, it achieved lower accuracy owing to the underfitting of the developed model.
As mentioned earlier in this section, the third major problem in the fraud detection techniques is the selection of the most relevant features for the model training process. Due to the fact that raw smart meters contain only consumption data and lack any statistical or supplementary features, it becomes difficult for the learning classifier to differentiate/understand the complex underlying patterns present in the data. In order to mitigate this issue, Punmiya et al. [40] and Salman et al. [24] extracted additional features from raw data employing simple statistical techniques such as mean, median, standard deviation, minimum and maximum. However, even though these techniques are simplistic to implement and computationally fast yet, they produce misleading results in the presence of outliers in the data.
After feature engineering, choosing a suitable classifier for efficiently separating genuine and fraudulent customers is the next challenge in any supervised ML technique. Nagi et al. [39] used a predictive modelling technique based on support vector machines (SVM) to identify abnormal behaviour of the consumers. The SVMbased ML model was developed using customer load profile data and other characteristics such as creditworthiness rating, meter reading data, and fraudulent activity report to identify abnormal consumer behaviour effectively. However, the detection hit rate achieved was merely 60% which is significantly very low, particularly when consumers are in the millions. In one of the most recent studies, a deep Siamese network (DSN) coupled with a convolutional neural network (CNN) and longshort term memory (LSTM) was proposed by Javaid et al. [39] to differentiate the characteristics of genuine and dishonest consumers. The authors achieved a reasonable accuracy; however, the precision and recall rates were comparatively lower. In another study, Paria et al. [41] developed a theft detection framework to identify regions of significant energy theft at the transformer level using data gathered from different distribution transformer meters. The developed methodology achieved a high detection rate (94%); however, since the fraudster consumption patterns introduced in this research work were produced synthetically, they do not precisely depict the actual fraudster customer's profiles; therefore, attained outcomes may diverge from a realistic scenario.
In one of the recent studies, Oprea et al. [42], utilized feature engineered light gradient boosting to effectively find irregular consumption patterns in the acquired conventional meter dataset. However, the data class balancing technique employed in the quoted study used the SMOTE algorithm, which is prone to overfitting and often results in a high generalizing error. In addition to that, it may increase noise since it ignores class distributions and has some sample selection blindness. Sarkar et al. [25] presented the fraud detection framework utilizing ensemble machine learning methods with considerable high accuracy, precision, and recall. However, they failed to interpret the developed model outcomes, which are crucial in strengthening the ML model further. The model's outcomes interpretation benefits in two ways: first, it helps concentrate and finetune the characteristics that contributed most to generating positive outcomes. Second, by retraining the model with a smaller set of very important features (features importance score assigned by the model), computational time may be substantially lowered without compromising real accuracy values. Table 1 presents the summary of the different techniques utilized in developing SMLbased electric theft detection methods.
S. No.  References  Method used  Missing values  Data class imbalance  Feature extraction  Feature selection  Performance metrics utilized 

1  Nizar et al.[43]  Naïve Bayes and Decision tree  –  –  Load profiles  –  Accuracy 
2  Nagi et al. [44]  Genetic algorithmSVM  Average values  –  Statistical features  –  Accuracy, detection rate 
3  Nizar et al. [45]  Extreme learning machine SVM  –  –  –  –  Accuracy 
4  Nagi et al. [46]  SVM  Average values  –  Statistical features  –  Accuracy, detection rate 
5  Ramos et al. [47]  Optimum path forest (OPF)  –  –  Statistical features  –  Accuracy 
6  Caio et al. [48]  Harmony search algorithm and OPF  –  –  Principal component analysis  Harmony search algorithm  Accuracy 
7  Carlos et al. [49]  Integrated expert system, rulebased system  Removal  –  Text mining  –  Accuracy 
8  Faria et al. [50]  Spatialtemporal estimation  –  –  Statistical features  –  Loss probability 
9  Juan et al. [51]  SVMDT  –  –  Statistical features  Filter wrapper  Accuracy, recall, precision, and F1_{score} 
10  Paria et al. [52]  Consumption patternbased energy theft detection  –  Different sampling proportions  Statistical features  –  Bayesian detection rate, accuracy, recall, detection rate, and precision 
11  Selvam et al. [53]  Decision Tree, Random Forest  –  –  –  –  Accuracy, ROC 
12  Zheng et al. [54]  Wide and deep convolutional neural networks  Average values  –  CNN  –  Accuracy, recall, detection rate, and precision 
13  Punmiya et al. [40]  Feature engineered extreme gradient boosting machine  –  SMOTE  Statistical features  –  Accuracy, recall, detection rate, and precision 
14  Salman et al. [13]  Ensemble machine learning  –  –  –  –  Accuracy, recall, detection rate, and precision 
15  Blazakis et al. [55]  Adaptive NeuroFuzzy Inference System  –  –  Statistical features  Neighbourhood component analysis  Accuracy, F1 score, precision, recall, specificity, AUC 
16  Sravan et al. [25]  Ensemble machine learning  Deletion  SMOTE  –  –  Accuracy, ROC, recall, precision 
17  Salman et al. [24]  Boosted C5.0 decision tree  –  –  Statistical features  Pearson's ChiSquare  Accuracy, recall, detection rate, and precision 
18  Zhengwei et al. [56]  Random Forest  –  KmeansSMOTE  –  –  Accuracy, TPR, FPR, TNR, Gmean 
19  Guoying et al. [57]  Autoencoder and Random Forest  –  Undersampling and resampling  Stacked autoencoder  –  Probabilistic prediction 
20  Munwar et al. [58]  Recurrent neural network  Rulebased  –  –  –  Accuracy, recall, detection rate, and precision 
21  Cheng et al. [59]  Deep learning, random forest  Rulebased  –  CNN  –  Precision, recall, true positive rate, falsepositive rate 
22  This work  Jaya optimizedKTBoost  XGboost algorithm  RobustSMOTE  Statistical, temporal, and spectral domainbased features  KTBoost algorithm  Accuracy, detection rate, precision, F1score, kappa and MCC 
3 PROPOSED METHODOLOGY
A stagewise representation of the proposed theft detection framework is depicted in Figure 2.
Each of the stages mentioned in Figure 2 is detailly discussed in subsequent subsections.
3.1 Exploratory data analysis
In this subsection, the preprocessing of the acquired dataset is explained in detail. The dataset used for this study is real smart meter data obtained from the State Grid Corporation of China (SGCC). The acquired dataset distribution is summarized in Table 2. Like most of the realtime datasets, the number of fraudster consumers in SGCC kWh data is lower than that of healthy consumers. Figures 3 and 4 illustrate the consumption patterns of a few randomly selected fraudulent and healthy consumers, respectively.
Parameter description  Parameter value 

Number of total consumers  42,372 
Number of healthy/genuine consumers  38,757 or 91.46% of total data 
Number of fraudster/theft consumers  3615 or 8.54% of total data 
Number of days of consumption record  1035 days (January 2014 to December 2016) 
It can be observed from the provided figures that the consumption patterns of the theft customers are highly unpredictable and contain low repeatability, while the genuine consumers’ patterns are recurrent and exhibit a relationship among identical periods of subsequent years.
3.2 Missing values and their imputation using XGBoost algorithm
The smart meter data often contains numerous missing entries mainly due to the malfunction of equipment, lag in registering/collection of data remotely, accidental deletion, cyberattacks or fabrication of their smart meter devices, etc. In order to illustrate the occurrence of the missing values in consumption patterns, a few consumer's electric power consumption randomly sampled from acquired consumption data are illustrated in Figure 5.
From Figure 5, it can be observed that there are several blank spots in between the consumption values. If such kind of incomplete dataset is directly fed into the ML framework, the ML algorithms within the framework would be unable to comprehend the complicated relationships between input data variables and missing values occurrence patterns present, thus leading to misleading conclusions. The missing values in the entire dataset are computed and plotted in Figure 6. Figure 6 illustrates the missing values present in each consumer's consumption data where the xaxis is the time window of acquired consumption data, and the yaxis is the number of consumers present in the data. The darker regions in the mentioned figure demonstrate a higher density of missing entries, and lighter or dotted areas express lesser missing entries. For example, from the time window of 2014 to 2015, consumers' consumption data carries a lot of missing entries, whereas, in 2016, these missing entries are comparatively lower. In addition to that, the kernel density estimation and histogram plot of missing values present in the data is computed and illustrated in Figure 7.
It can be observed from Figure 8, the estimated missing values (in black colour) coincide with the actual consumption data. Thus, the missing values imputed through this process enhance the ML classifier performance and avoid unintentional model bias towards the missing values.
3.3 RobustSMOTE for data class imbalance issue
The SMLbased classifier's performance deviates largely if the proportion of data classes present in the acquired dataset varies [60]. Since the acquired smart meter data is highly unbalanced, class balancing must be performed through an intelligent technique before training and testing the classifier. Figure 9 shows the class distribution of the collected dataset; the red data points represent the theft samples and green points healthy samples (majority class).
It can be observed in Figure 9 that the minority class samples are scarcer than the majority class samples. The MLclassifiers trained on such datasets are likely to be biased towards the data class that is present in a greater proportion. Generally, legitimate customers are more than fraudsters in most of the smart meters dataset [42]. Therefore, it is essential to balance the distribution of the data classes prior to feeding the MLclassifier.
In order to mitigate this issue, the robust SMOTE algorithm is used in this study. The robust SMOTE method addresses all frequently occurring categories of minority data samples, that is, minority points in the majority class region, minority class close to majority class samples, and safe minority points [61]. It accomplishes the mentioned task by measuring the relative data density for computing the local density of the minority data points between its knearest heterogeneous neighbours and knearest homogeneous neighbours initially. Afterward, it divides minority samples into borderline and safe samples relying on the relative density of minority samples' 2means clustering outcomes. The quantity produced by each minority data point is reweighted depending on the number of majority classes in its knearest neighbours, resulting in more samples close to the safe data points. In comparison, the scarcer samples are brought near the disorder samples to improve the divisibility of the classification boundary between classes. The data class distribution of the acquired dataset after implementing the robustSMOTE is illustrated in Figure 10.
It can be observed from Figure 10 that the minority (red data points) and majority class (green data points) distribution is justifiably balanced. Furthermore, most of the minority class samples are generated from those safe minority samples that are far away from the healthy samples; thus, this method aids the MLclassifier in defining the classification border more eloquently.
3.4 Feature engineering
The successful development of the ML model is often contingent on the appropriate selection of input features used during model training [62]. The feature engineering approach is specifically dedicated to that purpose; it assists in summarizing the dynamics of the data and enhances its overall representation by extracting the most important features while simultaneously improving the performance and detection accuracy of the model [63]. The acquired smart meter dataset consists of only consumption data in kWh and lacks any other statistical significance. Therefore, in this study, several features from statistical, temporal, and spectral domainbased features are extracted from each consumer's consumption data, as presented in Table 3. Since there are no less than 39 extracted features presented in Table 3, therefore, it is quite hard to add the theoretical and mathematical background of all the extracted features due to the scope and length of the article. Nevertheless, interested readers can find all the relevant information in reference [64].
S. No.  Feature  S. No.  Feature  S. No  Feature 

1  Mean  14  Zero crossing rate  27  Variance 
2  Median  15  Peak to peak distance  28  Relative desperation 
3  Mode  16  Minimum peaks  29  Autocorrelation 
4  Maximum  17  Entropy  30  Histogram with different bandwidths 
5  Minimum  18  Maximum peaks  31  Mel frequency cepstrum coefficients (MFCC) 
6  Interquartile range  19  Histogram  32  Spectral variation 
7  Kurtosis  20  Fast Fourier transform  33  Centroid 
8  Skewness  21  Spectral centroid  34  Positive turning points 
9  Standard deviation  22  Spectral kurtosis  35  Negative turning point 
10  Median absolute deviation  23  Median frequency  36  Slope 
11  Mean absolute deviation  24  Wavelet entropy  37  Mean absolute difference 
12  Mean absolute differences  25  Wavelet energy  38  Maximum frequency 
13  Median absolute differences  26  Empirical cumulative distribution  39  Median frequency 
3.5 Proposed classifier: Jaya optimized KTBoost algorithm
Boosting algorithms are widely used in practical data science and machine learningbased research works due to their outstanding prediction accuracy on highly complex datasets [65]. The boosting algorithms additively chain weak (base) classifiers by consecutively reducing both bias and variance at each boosting iterations. Despite the widespread usage of boosting algorithms, only one type of function is used as a base learner in most cases. In contrast to that, the KTBoost algorithm either adds a regression tree or a penalized reproducing kernel Hilbert space RKHS (kernel ridge regression function) to the ensemble of base classifiers in each boosting iteration [66]. In the beginning, the base learner is learned from both regression tree and RKHS function by employing gradient or newton as optimization techniques; afterward, the base learner whose inclusion in the ensemble results in the lower empirical risk is chosen. In this way, at each subsequent iteration, a base learner from two fundamentally different learners is selected to achieve high predictive accuracy. In addition to that, this amalgamation facilitates enhanced learning about functions that have different regularity degrees, such as discontinuities and smooth portions, as most discontinuities portions are learned through regression trees through smooth (continuous) portions using RKHS regression functions. The most important hyperparameters of the KTBoost algorithms are given in Table 4.
Parameter name  Description 

learning_rate  Parameter helps in setting weighting factors for the addition of new trees at each iteration to the classifier. 
n_estimatiors  The number of boosting iterations to be performed. 
subsample  The number of samples to be used for fitting the individual base learners. Optimal selection of this parameter can assist in setting bias and variance values. 
criterion  This is an evaluation metric to compute the quality of split, by default, it is selected as the mean square error (mse) but can be chosen as mean absolute error or Friedman mse. 
min_samples_split  The minimum number of samples to be present at a leaf/internal node. This parameter controls the model overfitting/ underfitting related problems. 
min_samples_leaf  The minimum number of samples to be present at the leaf. Controlling this parameter helps in overfitting/underfitting related issues. 
min_weight_leaf  
max_depth  Parameter helps in building the structure of regression tree. 
max_features  Number of features to be selected when searching for split. 
max_leaf_nodes  Optimal selection of these value facilities reducing the impurity of regression trees. 
base_learner  This parameter sets the base learners, in this either trees or kernel or a combination of both can be chosen. 
update_step  This parameter estimates boosting updates at each iteration. If the base learner is chosen only trees and update step as a hybrid then gradient step estimates the structure of trees and Newton step assists in finding the number of the leaf. Similarly, if the base learner is chosen kernel and update step as a hybrid, then gradient descent is used as an update step. 
Tol  This value facilities for early stopping if there is no change in the loss. 
kernel  In the case of kernel booting, Laplace, radial basis function and generalized Wendland can be chosen as kernel functions. 
range_adjust  Regularization parameter for RKHS regression function. 
Nystroem  The Nystroem sampling method is used if set to true. In the case of large data set, this parameter helps in reducing computation resources. 
n_components  The number of samples used in Nystroem samples. 
Unlike the previous research work where these parameters are either selected by using inefficient and timeconsuming “trial and error” method or are adopted from previous literature, the current study utilizes the intelligence of a swarm intelligence based optimization technique called the Jaya algorithm to select the most optimal hyperparameters of the KTBoost algorithm. The Jaya algorithm is a gradientfree metaheuristic optimization method for solving constrained and unconstrained optimization problems. It is a stochastic populationbased technique that modifies a population of individual solutions on an ordered basis by keeping the notion that each individual solution strives to attain the best solution while avoiding the least fit (worst) solution. One of the important features of this algorithm that makes it different from the other swarm intelligencebased optimization methods is that it does not require any algorithmspecific or control parameters for its operation. To avoid the computational complexity and to achieve the most optimal results within the limited number of iterations, only eight of the most important hyperparameters (base_learner, kernel, learning_rate, loss, max_depth, max_leaf_nodes, n_neighbors, update_step) are taken as decision variables in the current research work.
 Step 1: Initialize the input parameters of Jaya ($$Po{p_{size}},\ It{r_n})$$ and of the problem which is to optimize ($$Va{r_n})$$. In this the $$Po{p_{size}}\ $$is the population size. $$It{r_n}$$ is the number of maximum iterations to set $$Va{r_n}$$ is the design variables of the function which is to be optimized.
 Step 2: Initiate by randomly initializing the population within the predetermined lower and upper boundaries as given as in Equation (2),
$\begin{equation} {S_{ij}} = {S_{min,}}\ {_j} + \left( {{S_{max,}}\ {_j}  {S_{min,}} \ {_j}} \right).rand\left( {0,\ 1} \right)\end{equation}$(2)
 Step: 3: For each solution vector, estimate the value of the cost function and compute the best and worst solutions.
 Step: 4: Update the solutions as follows
$\begin{eqnarray} S_{i,j,m}^{updated} &=& \ {S_{ij}}{,_m} + {x_{1,j}}{,_{m\ }}\left( {{S_i}{,_{best}}{,_m}  \left {{S_{ij}}{,_m}} \right} \right)\nonumber\\ &&  {x_{2,j}}{,_{m\ }}\left( {{S_i}{,_{worst}}{,_m}  \left {{S_{ij}}{,_m}} \right} \right)\end{eqnarray}$(3)
 Step: 5: Evaluate the updated solutions by restricting them not to exceed the boundary conditions.
$\begin{equation}S_{i,j,m}^{updated} = \left[ { \def\eqcellsep{&}\begin{array}{l} {{S_{max}}{,_j}\qquad \ \ if\ S_{i,j,m}^{updated} > \ {S_{max}}{,_j}}\\[10pt] {{S_{min}}{,_j}\qquad \ \ if\ S_{i,j,m}^{updated} < {S_{min}}{,_j}}\\[10pt] {S_{i,j,m}^{updated}\qquad otherwise} \ \end{array} } \right]\end{equation}$(4)
 Step: 6: To evaluate whether the updated solution or the existing solution will advance to the next iteration, compute the value of the costs function for each set of search agents by employing the greedy selection technique. If the revised solution is better than the current solution, replace the former. On the contrary, the revised solution will be discarded, but the current solution will be retained in the population.
4 RESULT AND DISCUSSION
At this stage, the dataset developed during the feature engineering process is retrieved for model training and validation purposes. The fetched dataset comprises 1035 days of real consumption data and 39 additional features (mentioned in Table 3). Moreover, the raw input dataset's data class distribution was balanced with the robustSMOTE method prior to feeding it to the algorithm for model training. The traintest split method is used in which 80% of the data is used for model training while 20% is for testing purposes. The proposed theft detection framework utilizes the KTBoost algorithm for model training, while the Jaya algorithmbased metaheuristic optimization is used for its hyperparameter tuning. In this scenario, the objective function for optimization purposes is to optimize the model's accuracy by minimizing the difference between predicted and actual outcomes. By initializing more than 35 trails/iterations employing the Jaya algorithm model attained an accuracy of 0.937 as presented in the optimization history plot in Figure 11. The xaxis represents the trail count, while the yaxis shows the accuracy value. The blue dots show the accuracy value attained at different combinations of hyperparameters in the graph.
Furthermore, in Figures 12 and 13, the slice and contour plots of the model's hyperparameters optimization process are shown, neatly illustrating the implication of the hyper parameter's variation on the objective value/accuracy. For example, Figures 12 and 13 depict that a learning rate within the range of 1.5 to 2.5 achieves high objective values, but increasing beyond that produces a considerable reduction in objective value. Similarly, max_depth greater than 1500 yields better accuracy values; increasing beyond that yields a significant reduction in accuracy, which can be attributed to the model overfitting on the training data.
The optimal hyperparameters set, which attained the best accuracy value during several optimizations trials, is given in Table 5. As presented in the table, the combined base learner (kernel boosting and tree boosting) and hybrid update step achieve the best accuracy value.
Hyperparameter  Value 

base_learner  Combined (Kernel boosting and tree boosting) 
kernel  GW 
learning_rate  0.2 
loss  deviance 
max_leaf_nodes  34 
max_depth  1863 
n_neighbors  50 
update_step  hybrid 
4.1 Kfold crossvalidation results of the Jaya optimizedKTBoost model
To effectively implement the proposed Jaya optimizedKTBoost algorithm, the designed model is initially trained on the data developed after the data class balancing and feature engineering stage. Afterward, the tenfold crossvalidation (CV) technique employing the mentioned performance metrics (Equations (5)–(12)) is utilized for the performance evaluation of the designed model. This evaluation has produced the following results; as presented in Table 6, the proposed model has achieved a mean accuracy and precision of 0.9338 and 0.9508 with a standard deviation (SD) of 0.0029 and 0.0035, respectively.
No. of folds  Accuracy  Recall  Precision  Fl_{score}  Kappavalue  MCC 

1  0.9311  0.9216  0.9479  0.9345  0.8891  0.8922 
2  0.9354  0.9278  0.95  0.9388  0.8705  0.9108 
3  0.9354  0.9239  0.9536  0.9385  0.8706  0.9111 
4  0.9326  0.9196  0.9524  0.9357  0.8921  0.9123 
5  0.937  0.9263  0.9542  0.94  0.8736  0.9201 
6  0.939  0.9292  0.9552  0.942  0.8777  0.9021 
7  0.9285  0.9 191  0.9454  0.9321  0.8921  0.9154 
8  0.9331  0.9258  0.9476  0.9366  0.881  0.9125 
9  0.9313  0.923  0.947  0.9348  0.887  0.8926 
10  0.9344  0.9206  0.9548  0.9374  0.8891  0.9092 
Mean  0.9338  0.9318  0.9508  0.9371  0.8873  0.9077 
Standard deviation  0.0029  0.0033  0.00365  0.00292  0.0087  0.00931 
4.2 Confusion matrix evaluation of the proposed model
The confusion matrix (CM) is a prominent metric for addressing classification issues. It may be used for both binary classification and multiclass classification issues. CM represents counts from the actual and predicted values, as illustrated in Figure 14. In this study, $${T^ + }$$ represents the number of theft consumers rightly classified by the classifier whereas $${F^  }$$ represents the fraudster consumers misclassified as the healthy consumers. Similarly, $${T^  }$$ represents the number of rightly classified healthy consumers while $${F^ + }$$ depicts the healthy consumer misclassified as the fraudster consumer.
The confusion matrix of the proposed model is shown in Figure 15, “0” represents here the actual negative class or Healthy consumers and “1” represents the positive class or Fraudster consumer. The values in CM are normalized in the percentage form for ease in readability purposes. From the mentioned figure, it can be observed that the classifier rightly classified 93.16% of the theft consumers while 6.84% of actual theft consumers were misclassified as healthy. Similarly, 95.25% of healthy consumers were rightly classified, whereas 4.75% of actual healthy consumers were misclassified as theft.
4.3 AUCROC curve of the proposed model
where the $${P_s}$$ represents the number of positive samples, $${N_s}$$ number of negative samples and $$Ran{k_j}$$ depicts the rank value or of sample j belonging to the positive class. The AUC value is the likelihood that a randomly selected positive data sample would rank higher than a randomly selected negative data sample. The AUC value varies between 0.5 to 1, where 0.5 specifies that the classifier performs random guessing, and 1 indicates that the classifier is perfect in classifying the healthy and theft consumers.
The ROC curve of the proposed classifier is shown in Figure 16; the xaxis represents the FPR, and the yaxis the TPR. The average AUC value of the proposed classifier is 0.98, which indicates that most of the theft and healthy consumers are rightly classified.
4.4 The learning curve of the proposed theft detection model
A learning curve depicts the relationship between the training score and crossvalidated (CV) test score for a classifier with different training data instances graphically [68]. The basic notion of this curve is to check the classifier's generalizing ability on different data samples. The learning curve of the proposed classifier is shown in Figure 17. The curves in the graph illustrate the mean scores, while the shaded areas depict the standard deviations above and below the mean for all crossvalidations. If the model is flawed because of the bias, the training score curve will most likely be more variable than expected. Likewise, if the model is prone to error owing to variance, the crossvalidated score will be more unpredictable.
In Figure 17, it can be seen that when the data samples are minimal, the model training score is very high in comparison to the CVscore, which is a result of the high bias of the model. In contrast, as the number of training data samples grows, the training score decreases, while the CV score increases, albeit with considerable fluctuation due to the model's high variance. Additionally, it is interesting to note from the learning curve that the model's CVscore and accuracy are above 0.9338, implying that the model can accurately distinguish fraudster consumers from healthy consumers.
4.5 Proposed model's outcomes interpretation and their impact on training time
In this section, the proposed model's prediction or outcomes are interpreted. The model's prediction interpretation is the process by which the input data features utilized for model training are evaluated based on their positive influence on predicting the correct result. In this study, the KTBoost algorithm is employed to rank all the given input features in terms of their contribution in predicting the right outcome.
Due to the fact that the input training data contains over 1200 features, it is not feasible to display the importance score of each feature in the graph; thus, only the top ten most important features are displayed in Figure 18 together with their importance score. The figure shows that the feature from actual consumption had the highest significance value, followed by statistical features derived from actual consumption. In order to demonstrate the significance of the importance score assigned by the KTBoost model to each feature, the KTBoost model was retrained to incorporate a much smaller yet essential feature set. Figures 19 and 20 show the computing time required to analyse the entire collection of data features (1071 features) and the 23 most important data features. As can be seen in the mentioned figures, when a smaller number of features set is given, a substantial decrease in computing time is achieved.
In addition to that, Figure 21 depicts the effect of important features on the model's accuracy. The model achieved an accuracy value of 80 percent when just the five most important features were supplied. By increasing the number of important features set from 5 to 23, the model achieved the same accuracy as when trained with all 1071 features. Thus, the conclusion from this can experiment be made that, if the model is retained with the most important features set, the computational resource required can be drastically reduced without violation in accuracy values.
4.6 Proposed model's comparison against the latest and traditional methods
This section presents a sidebyside comparison of the proposed theft detection framework with a series of wellknown traditional machine learning models and the latest bagging and boosting models under an identical feature set. To assess the performance of all studied classifiers, the tenfold crossvalidation method is used in conjunction with the five most commonly used performance measures, namely accuracy, recall, precision, F1score, Kappa value, and MCCvalue.
The proposed framework is sequentially implemented using the GoogleCollaboratory (Python 3 Google Compute Engine backend, 12GB RAM, without GPUenabled) environment. The comparison's results are summarized in Table 7. As summarized in the table, the proposed approach surpasses all other ML techniques in terms of accuracy, recall, precision, F1_{score}, Kappavalue, and MCC value, thus evidencing its efficacy and importance. In addition, the proposed model obtained a 93.38% accuracy and recall, the precision of 93.18% and 95%, respectively, which is considerably better than all competing models.
Model  Accuracy  Recall  Precision  F1score  Kappavalue  MCC 

Proposed model  0.9338  0.9318  0.9508  0.9371  0.8873  0.9077 
XGBoost classifier  0.9112  0.9123  0.9012  0.912  0.867  0.875 
Extra tree classifier  0.901  0.8921  0.912  0.934  0.854  0.812 
SNAP boost algorithm  0.90  0.8912  0.9216  0.9123  0.8412  0.845 
lightGBM  0.891  0.8751  0.8631  0.8641  0.8124  0.854 
WideDeep CNN  0.89  0.812  0.881  0.7921  0.812  0.8213 
Gaussian process based boosting  0.885  0.8754  0.8698  0.8412  0.8421  0.7892 
Boosted C5.0 algorithm  0.881  0.8541  0.824  0.8121  0.824  0.8245 
NGBoost algorithm  0.87  0.861  0.834  0.8251  0.834  0.8964 
Randomforest classifier  0.834  0.8123  0.8241  0.8125  0.8453  0.831 
SVM  linear Kernel  0.823  0.7601  0.8292  0.7928  0.6042  0.6066 
AdaBoost classifier  0.814  0.7562  0.7213  0.745  0.751  0.761 
Ridge classifier  0.795  0.7931  0.8584  0.8244  0.6622  0.6641 
Quadratic discriminant analysis  0.721  0.2251  0.8911  0.3594  0.1976  0.2974 
Logistic regression  0.712  0.8063  0.8482  0.8267  0.6619  0.6627 
Linear discriminant analysis  0.698  0.7929  0.8583  0.8243  0.662  0.6639 
K neighbour's classifier  0.587  0.6412  0.7606  0.8284  0.6233  0.6356 
Naive Bayes  0.54  0.3478  0.6261  0.4472  0.1401  0.1563 
5 CONCLUSION
This study presented a novel sequentially executed datadriven approach for identifying electric fraud in a smart meter dataset. The raw smart meter data often contains several null and irregular values mostly due to the malfunction of equipment, poor network, or device storagerelated issues. Since most machine learning classifiers cannot process the null values present in the data; therefore, this study estimated missing values using an ensemble machine learningbased predictive modelling technique called XGBoost. Afterward, the robustSMOTE algorithm was used to balance the class distribution in the acquired data. By considering all regions of minority samples in the dataset, the robustSMOTE technique produces the minority class samples that are less prone to overfitting and noisy sample generation. Once a balanced dataset is obtained, a set of statistical, temporal, and spectral features were extracted from it. These additional features aid the MLclassifier in understanding the underlying complicated data patterns contained in the data. Finally, in order to effectively classify the data into “Honest” and “Fraudster” consumers, the Jaya optimized KTBoost classifier was used. The JayaKTBoost technique combines kernel boosting and tree boosting with its hyperparameters are tuned by utilizing the intelligence of the Jaya algorithm. The proposed model attained an accuracy of 93.38%, precision of 95%, and recall of 93.11%, which are significantly higher than all compared methods.
FUNDING INFORMATION
This work was supported by the Fundamental Research Grant Scheme under Grant R.J130000.7851.5F062 through the Ministry of Higher Education, Malaysia.
CONFLICT OF INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
ACKNOWLEDGEMENT
This work was supported by the Fundamental Research Grant Scheme under Grant R.J130000.7851.5F062 through the Ministry of Higher Education, Malaysia.
Open Research
DATA AVAILABILITY STATEMENT
The data that support the findings of this study is publicly available at: https://github.com/henryRDlab/ElectricityTheftDetection.