Volume 16, Issue 6 p. 1257-1275
ORIGINAL RESEARCH PAPER
Open Access

Electric theft detection in advanced metering infrastructure using Jaya optimized combined Kernel-Tree boosting classifier—A novel sequentially executed supervised machine learning approach

Saddam Hussain

Corresponding Author

Saddam Hussain

School of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru, Malaysia

Correspondence

Bander Ali Saleh Al-rimy, School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Johor Bahru, Johor 81310, Malaysia.

Email: [email protected]

Saddam Hussain, School of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia.

Email: [email protected]

Search for more papers by this author
Mohd. Wazir Mustafa

Mohd. Wazir Mustafa

School of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru, Malaysia

Search for more papers by this author
Khalil Hamdi Ateyeh Al-Shqeerat

Khalil Hamdi Ateyeh Al-Shqeerat

Department of Computer Science, College of Computer, Qassim University, Buraydah, Saudi Arabia

Search for more papers by this author
Bander Ali Saleh Al-rimy

Corresponding Author

Bander Ali Saleh Al-rimy

School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Johor Bahru, Johor, Malaysia

Correspondence

Bander Ali Saleh Al-rimy, School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Johor Bahru, Johor 81310, Malaysia.

Email: [email protected]

Saddam Hussain, School of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia.

Email: [email protected]

Search for more papers by this author
Faisal Saeed

Faisal Saeed

School of Computing and Digital Technology, Birmingham City University, Birmingham, UK

Search for more papers by this author
First published: 11 January 2022
Citations: 2

Abstract

This paper presents a novel, sequentially executed supervised machine learning-based electric theft detection framework using a Jaya-optimized combined Kernel and Tree Boosting (KTBoost) classifier. It utilizes the intelligence of the XGBoost algorithm to estimate the missing values in the acquired dataset during the data pre-processing phase. An oversampling algorithm based on the Robust-SMOTE technique is utilized to avoid the unbalanced data class distribution issue. Afterward, with the aid of few very significant statistical, temporal, and spectral features extracted from the acquired kWh dataset, the complex underlying data patterns are comprehended to enhance the accuracy and detection rate of the classifier. For effectively classifying the consumers into “Honest” and “Fraudster,” the ensemble machine learning-based classifier KTBoost, with Jaya algorithm optimized hyperparameters, is utilized. Finally, the developed model is re-trained using a reduced set of highly important features to minimize the computational resources without compromising the performance of the developed model. The outcome of this study reveals that the proposed theft detection method achieves the highest accuracy (93.38%), precision (95%), and recall (93.18%) among all the studied methods, thus signifying its importance in the studied area of research.

1 INTRODUCTION

The integration of communication and information technologies with electrical infrastructure has become more prevalent in recent years. Smart grids, the next generation of energy distribution networks, are emerging due to the increasing penetration of advances in modern technology [1, 2]. One of the crucial components of smart grids is Advanced Metering Infrastructure (AMI) which allows the transfer of two-way data like time and quantity of energy used by a customer. With this new bi-directional information flow, AMI facilities power companies to perform accurate modelling of the customer energy consumption behaviour [3], including predicting energy usage [4], demand response [5], and real-time pricing [6]. However, despite numerous advantages, threats such as cyber-attacks, smart meter hacking, and malicious data manipulation restrict the vast expansion of AMI [7-9] and jeopardize the grid's security. The most significant consequence of AMI is Non-Technical Losses (NTL) which accounts for power theft, errors in the metering/registering process, and invoicing mistakes [10]. Among all the mentioned NTL causes, electric power theft shares the major portion. Theft of power is not only associated with economic loss, but it also affects the power quality, increased load on the generating stations, and irrational tariffs imposed on legitimate consumers. Power utilities all over the globe incur significant revenue loss as a result of power theft. In the United States alone, this loss ranges from 0.5 percent to 3.5 percent of their annual income [11]. The case is even worse in underdeveloped nations where the revenue loss from this type of NTL becomes a significant portion of their gross domestic product [12, 13].

To decrease the NTLs, power utilities check all suspected consumers daily or weekly and then enforce punitive measures for any proven fraud practices. However, this process is time-consuming, expensive, and error prone. Currently, the majority of the power utilities, especially in under-developed countries, are employing traditional inefficient, laborious, costly, and time-consuming NTL detection systems. Nevertheless, in recent years, a significant increase in the deployment of AMI in distribution networks has been witnessed, which provides additional features such as monitoring, storing, and retrieving a broad variety of data at any time. In addition, data-oriented algorithms have been developed as an effective automated tool for screening aberrant energy consumption patterns and identifying possible electrical fraud activities. These data-oriented theft detection methods can be broadly categorized into four categories, statistical-based [14-17], game-theory-based [18, 19], expert system [20, 21] and ML-based [22-25].

1.1 Major and minor contributions of the proposed theft detection system

This study endeavours to develop a novel supervised machine learning (SML)-based sequentially executed electricity theft detection framework that effectively detects fraudster consumers from an acquired smart meter dataset. The simplified flowchart of the developed method is illustrated in Figure 1, and the brief explanation of each executed novel stage is as follows
  1. The proposed framework initiates its operation by substituting the missing entries in the obtained smart meter dataset using the machine learning (ML)-based predictive modelling technique. This technique estimates the missing data records by employing the XGBoost algorithm in such a manner that missing attributes act as the target class and the rest of the feature set as an input for model training. The important aspects of this algorithm include handling various kinds of missing data, being adaptable to interactions and non-linearity within the dataset, and being scalable to large data situations.

  2. After handling the missing values problem, the data class imbalance issue is addressed by using the robust synthetic minority oversampling approach (robust-SMOTE). The robust-SMOTE technique generates the minority samples (i.e., fraud cases) from all minority sample regions present in the dataset, such as those which are present within the majority class area (Healthy cases), on the borderline of the majority class, and the one which is far away from the majority class samples. Subsequently, to accurately depict the underlying properties of consumption data, the proposed method utilizes the statistical, temporal, and spectral domains to extract features from collected consumption data.

  3. After collecting the most relevant characteristics, the model training-testing procedure is commenced by classifying customers into two different groups (“Genuine/Healthy” and “Theft/Fraudster”) using the KTBoost algorithm. The KTBoost algorithm combines kernel boosting and tree boosting methods for classification purposes. In each boosting iteration, it either adds a regression tree or a penalized reproducing kernel Hilbert space RKHS/kernel ridge regression function to the ensemble of base classifiers. Later, to obtain the best possible results, the model's hyperparameters are tuned using a meta-heuristic-based optimization technique called the Jaya algorithm. The Jaya algorithm is a stochastic population-based optimization technique that modifies a population of individual solutions on an ordered basis by keeping the notion that each individual solution strives to attain the best solution while avoiding the least fit/worst one.

  4. Finally, the proposed model is retained with a smaller set of highly significant features while maintaining the same degree of accuracy, thus conserving computing resources.

Details are in the caption following the image
Proposed Jaya optimized-KTBoost based electric theft detection framework

In Section-2, the most relevant literature on the challenges encountered during the development of the SML framework is discussed. Section 3 discusses data exploration, the missing values imputation approach, the data class balancing method, feature engineering, and the theoretical background of the KTBoost and Jaya algorithms. Section 4 provides the outcomes of the proposed research work. Finally, Section 5 of this study contains the conclusion.

2 LITERATURE REVIEW

The current research explores an application of the supervised ML-based theft detection framework; therefore, the most relevant information and literature are highlighted to understand better the proposed methodology and its significance in the studied field of research.

Typically, SML-based NTL detection techniques encounter five major issues:
  1. Handling of missing and outlying values occurrence in the accumulated raw dataset

  2. Target/data class imbalance distribution

  3. Method for relevant features extraction and selection

  4. The right choice of classification algorithm and its hyperparameters to maximize the prediction accuracy

  5. Understanding/interpreting the model's prediction.

A number of attempts have been made in the literature to solve these issues, out of which few prominent research works are cited as per the sequence of the above-mentioned problems.

The data from smart meters is often irregular, with several null and outlying readings mainly due to unstable synchronous transmission between sensors and databases, unexpected device maintenance, storage issues, unreliable/inadequate quality network, the incorrect estimate of sent data, and various unknown environmental factors [26]. Such irregularities in the dataset may jeopardize the learning ability of the SML classifier, resulting in biased and erroneous estimations [27]. In order to address this issue, typically, two approaches have been adopted in literature: imputation or elimination. In the imputation method, an estimated value for the missing attribute is substituted, while in elimination, the missing entries in the dataset are removed. The imputation process is often used for dealing with missing features since it is based on the concept that if an essential feature is missing for a specific instance, it may be approximated from the already available data [28]. In general, the imputation process is carried out either by statistical or machine learning methods. The estimation techniques are based on statistical methods such as mean, mode, median, linear interpolation [29], or autoregressive integrated moving average [30]. These data imputation methods are computationally fast and simple to execute. However, they generally lead to erroneous and skewed results due to the possible presence of outliers (individuals or observations with unusual characteristics) in the data. Furthermore, most of the classifiers cannot comprehend the complex relationships between input data variables and missing values occurrence patterns in the data, which consequently leads to misleading outcomes. Nevertheless, few machine learning methods such as k-nearest neighbour missing values imputer [31], fuzzy clustering [32], support vector regressor (SVR) [33], random forest imputation (RFI) [34], Bayesian missing values imputer [35], etc., employ efficient predictive modelling techniques for estimating missing data values accurately. However, in the presence of huge amounts of data, such as the high-resolution data from smart meters, the mentioned techniques require enormous computing resources. Another way to deal with missing data is to discard/eliminate it entirely from the rest of the data. Despite the fact that “discarding” techniques such as list and pair-wise can be implemented smoothly, a significant loss of information might happen, leading to skewed estimates at the end of the classification process.

Another challenge in NTL detection is the unbalanced data class distribution, that is, the frequency of fraudulent cases is disproportionately low compared to genuine consumer cases. The performance of machine learning classifiers is severely affected by the imbalanced distribution of data classes. Moreover, the over-representation of the majority class (Healthy consumers) prevents a classifier from focusing on minority class (Fraudster customers); thus, producing irrational results. Various methods based on the concepts of minority oversampling and majority under-sampling have been proposed in the literature to counteract this issue. Two prominent research works that have thoroughly addressed this imbalanced data class distribution problem are Nazmul et al. [36] and Sravan et al. [37]. Both works used the Synthetic Minority Oversampling Method (SMOTE) to balance the data class distribution in the acquired NTL detection dataset. The SMOTE method randomly generates the minority class samples by setting the same sampling rate for all samples of the minority class. The problem associated with this approach is that it causes overfitting and low generalizing ability of the classifier. In another research work, Madalina et al. [38], an under-sampling method is employed where the number of data samples from the majority class is eliminated to balance the data class distribution. Such data balancing methods are simple to execute; however, they can cause significant data loss, resulting in a reduction in the accuracy of the developed model. In another article [39], the data class distribution was balanced via the use of the ADAptive SYNthesis (ADASYN) based oversampling technique. While the developed approach obtained better generalizing ability, it achieved lower accuracy owing to the underfitting of the developed model.

As mentioned earlier in this section, the third major problem in the fraud detection techniques is the selection of the most relevant features for the model training process. Due to the fact that raw smart meters contain only consumption data and lack any statistical or supplementary features, it becomes difficult for the learning classifier to differentiate/understand the complex underlying patterns present in the data. In order to mitigate this issue, Punmiya et al. [40] and Salman et al. [24] extracted additional features from raw data employing simple statistical techniques such as mean, median, standard deviation, minimum and maximum. However, even though these techniques are simplistic to implement and computationally fast yet, they produce misleading results in the presence of outliers in the data.

After feature engineering, choosing a suitable classifier for efficiently separating genuine and fraudulent customers is the next challenge in any supervised ML technique. Nagi et al. [39] used a predictive modelling technique based on support vector machines (SVM) to identify abnormal behaviour of the consumers. The SVM-based ML model was developed using customer load profile data and other characteristics such as creditworthiness rating, meter reading data, and fraudulent activity report to identify abnormal consumer behaviour effectively. However, the detection hit rate achieved was merely 60% which is significantly very low, particularly when consumers are in the millions. In one of the most recent studies, a deep Siamese network (DSN) coupled with a convolutional neural network (CNN) and long-short term memory (LSTM) was proposed by Javaid et al. [39] to differentiate the characteristics of genuine and dishonest consumers. The authors achieved a reasonable accuracy; however, the precision and recall rates were comparatively lower. In another study, Paria et al. [41] developed a theft detection framework to identify regions of significant energy theft at the transformer level using data gathered from different distribution transformer meters. The developed methodology achieved a high detection rate (94%); however, since the fraudster consumption patterns introduced in this research work were produced synthetically, they do not precisely depict the actual fraudster customer's profiles; therefore, attained outcomes may diverge from a realistic scenario.

In one of the recent studies, Oprea et al. [42], utilized feature engineered light gradient boosting to effectively find irregular consumption patterns in the acquired conventional meter dataset. However, the data class balancing technique employed in the quoted study used the SMOTE algorithm, which is prone to overfitting and often results in a high generalizing error. In addition to that, it may increase noise since it ignores class distributions and has some sample selection blindness. Sarkar et al. [25] presented the fraud detection framework utilizing ensemble machine learning methods with considerable high accuracy, precision, and recall. However, they failed to interpret the developed model outcomes, which are crucial in strengthening the ML model further. The model's outcomes interpretation benefits in two ways: first, it helps concentrate and fine-tune the characteristics that contributed most to generating positive outcomes. Second, by re-training the model with a smaller set of very important features (features importance score assigned by the model), computational time may be substantially lowered without compromising real accuracy values. Table 1 presents the summary of the different techniques utilized in developing SML-based electric theft detection methods.

TABLE 1. Summary of most widely used techniques in building SML based electric theft detection methods
S. No. References Method used Missing values Data class imbalance Feature extraction Feature selection Performance metrics utilized
1 Nizar et al.[43] Naïve Bayes and Decision tree Load profiles Accuracy
2 Nagi et al. [44] Genetic algorithm-SVM Average values Statistical features Accuracy, detection rate
3 Nizar et al. [45] Extreme learning machine -SVM Accuracy
4 Nagi et al. [46] SVM Average values Statistical features Accuracy, detection rate
5 Ramos et al. [47] Optimum path forest (OPF) Statistical features Accuracy
6 Caio et al. [48] Harmony search algorithm and OPF Principal component analysis Harmony search algorithm Accuracy
7 Carlos et al. [49] Integrated expert system, rule-based system Removal Text mining Accuracy
8 Faria et al. [50] Spatial-temporal estimation Statistical features Loss probability
9 Juan et al. [51] SVM-DT Statistical features Filter wrapper Accuracy, recall, precision, and F1score
10 Paria et al. [52] Consumption pattern-based energy theft detection Different sampling proportions Statistical features Bayesian detection rate, accuracy, recall, detection rate, and precision
11 Selvam et al. [53] Decision Tree, Random Forest Accuracy, ROC
12 Zheng et al. [54] Wide and deep convolutional neural networks Average values CNN Accuracy, recall, detection rate, and precision
13 Punmiya et al. [40] Feature engineered extreme gradient boosting machine SMOTE Statistical features Accuracy, recall, detection rate, and precision
14 Salman et al. [13] Ensemble machine learning Accuracy, recall, detection rate, and precision
15 Blazakis et al. [55] Adaptive Neuro-Fuzzy Inference System Statistical features Neighbourhood component analysis Accuracy, F1 score, precision, recall, specificity, AUC
16 Sravan et al. [25] Ensemble machine learning Deletion SMOTE Accuracy, ROC, recall, precision
17 Salman et al. [24] Boosted C5.0 decision tree Statistical features Pearson's Chi-Square Accuracy, recall, detection rate, and precision
18 Zhengwei et al. [56] Random Forest Kmeans-SMOTE Accuracy, TPR, FPR, TNR, G-mean
19 Guoying et al. [57] Autoencoder and Random Forest Undersampling and re-sampling Stacked autoencoder Probabilistic prediction
20 Munwar et al. [58] Recurrent neural network Rule-based Accuracy, recall, detection rate, and precision
21 Cheng et al. [59] Deep learning, random forest Rule-based CNN Precision, recall, true positive rate, false-positive rate
22 This work Jaya optimized-KTBoost XGboost algorithm Robust-SMOTE Statistical, temporal, and spectral domain-based features KTBoost algorithm Accuracy, detection rate, precision, F1score, kappa and MCC

3 PROPOSED METHODOLOGY

A stage-wise representation of the proposed theft detection framework is depicted in Figure 2.

Details are in the caption following the image
Proposed Jaya optimized KTBoost based electric theft detection framework

Each of the stages mentioned in Figure 2 is detailly discussed in subsequent subsections.

3.1 Exploratory data analysis

In this subsection, the pre-processing of the acquired dataset is explained in detail. The dataset used for this study is real smart meter data obtained from the State Grid Corporation of China (SGCC). The acquired dataset distribution is summarized in Table 2. Like most of the real-time datasets, the number of fraudster consumers in SGCC kWh data is lower than that of healthy consumers. Figures 3 and 4 illustrate the consumption patterns of a few randomly selected fraudulent and healthy consumers, respectively.

TABLE 2. Data statistics of acquired SGCC dataset
Parameter description Parameter value
Number of total consumers 42,372
Number of healthy/genuine consumers 38,757 or 91.46% of total data
Number of fraudster/theft consumers 3615 or 8.54% of total data
Number of days of consumption record 1035 days (January 2014 to December 2016)
Details are in the caption following the image
Electric consumption patterns of fraudster consumers
Details are in the caption following the image
Electric consumption patterns of healthy consumers

It can be observed from the provided figures that the consumption patterns of the theft customers are highly unpredictable and contain low repeatability, while the genuine consumers’ patterns are recurrent and exhibit a relationship among identical periods of subsequent years.

3.2 Missing values and their imputation using XGBoost algorithm

The smart meter data often contains numerous missing entries mainly due to the malfunction of equipment, lag in registering/collection of data remotely, accidental deletion, cyber-attacks or fabrication of their smart meter devices, etc. In order to illustrate the occurrence of the missing values in consumption patterns, a few consumer's electric power consumption randomly sampled from acquired consumption data are illustrated in Figure 5.

Details are in the caption following the image
Randomly samples consumers’ consumption data with missing entries

From Figure 5, it can be observed that there are several blank spots in between the consumption values. If such kind of incomplete dataset is directly fed into the ML framework, the ML algorithms within the framework would be unable to comprehend the complicated relationships between input data variables and missing values occurrence patterns present, thus leading to misleading conclusions. The missing values in the entire dataset are computed and plotted in Figure 6. Figure 6 illustrates the missing values present in each consumer's consumption data where the x-axis is the time window of acquired consumption data, and the y-axis is the number of consumers present in the data. The darker regions in the mentioned figure demonstrate a higher density of missing entries, and lighter or dotted areas express lesser missing entries. For example, from the time window of 2014 to 2015, consumers' consumption data carries a lot of missing entries, whereas, in 2016, these missing entries are comparatively lower. In addition to that, the kernel density estimation and histogram plot of missing values present in the data is computed and illustrated in Figure 7.

Details are in the caption following the image
Missing values occurrence in the acquired smart meter (SGCC) dataset
Details are in the caption following the image
Histogram-Kernel density estimation plot of missing values present in acquired smart meter dataset
It may be noted from Figure 7 that there are more than 7000 consumers whose missing value count is greater than 700, while the same count for the majority of the consumers is in between 10 to 200. To address this issue, the proposed framework utilizes a machine learning-based technique to build a predictive model employing the XGBoost algorithm for estimating the missing attributes present in the data. The XGBoost algorithm is one of a group of ensemble machine learning algorithms that use the decision tree-based boosting technique to generate the most accurate models/estimators. In addition, it can impute missing entries present in a dataset, adaptable to interactions and nonlinearity within data, and scalable to large data situations. The boosting technique in the XGBoost refers to the process of progressively creating multiple models where each newly created model attempts to fix the error in the preceding model. XGBoost utilizes the decision tree as a base classifier and progressively builds each subsequent new decision tree based on the prediction results of the previous decision trees. The overall objective function of the XGBoost algorithm is given in Equation (1).
O b j e c t i v e f u n c t i o n θ = j T r a i n i n g L o s s ( y j ̂ , y j ) + i f i , f i F \begin{eqnarray} && Objective_{function\left( \theta \right)}\nonumber\\ &&\quad =\, \mathop \sum \limits_j TrainingLoss(\widehat {{y_j}}{\rm{\ }},{\rm{\ }}{y_j}) + {\rm{\ }}\mathop \sum \limits_i \leftthreetimes \left( {{f_i}} \right),{\rm{\ }}{f_i}{\rm{\ }} \in F\end{eqnarray} (1)
where y j $\ {y_j}$ is the actual value and y j ̂ $\widehat {{y_j}}$ is a prediction made by the model. The training loss here controls the overall performance of the models. The regularization function $ \leftthreetimes $ computes the complexity of the model, which further assists in preventing the model from overfitting. F represents the function space where the set of all possible regression tree functions (f) occurs. The current research work utilizes the intelligence of the XGBoost algorithm for imputing missing entries in the acquired dataset. To visualize the data imputation process, the missing values for two of the randomly selected samples from the acquired dataset are imputed using the mentioned algorithm. The results attained by the proposed missing values imputation technique are provided in Figure 8.
Details are in the caption following the image
(A, B) Missing values imputation in consumer's consumption data using XGBoost algorithm

It can be observed from Figure 8, the estimated missing values (in black colour) coincide with the actual consumption data. Thus, the missing values imputed through this process enhance the ML classifier performance and avoid unintentional model bias towards the missing values.

3.3 Robust-SMOTE for data class imbalance issue

The SML-based classifier's performance deviates largely if the proportion of data classes present in the acquired dataset varies [60]. Since the acquired smart meter data is highly unbalanced, class balancing must be performed through an intelligent technique before training and testing the classifier. Figure 9 shows the class distribution of the collected dataset; the red data points represent the theft samples and green points healthy samples (majority class).

Details are in the caption following the image
The unbalanced data class distribution in obtained smart meter dataset

It can be observed in Figure 9 that the minority class samples are scarcer than the majority class samples. The ML-classifiers trained on such datasets are likely to be biased towards the data class that is present in a greater proportion. Generally, legitimate customers are more than fraudsters in most of the smart meters dataset [42]. Therefore, it is essential to balance the distribution of the data classes prior to feeding the ML-classifier.

In order to mitigate this issue, the robust SMOTE algorithm is used in this study. The robust SMOTE method addresses all frequently occurring categories of minority data samples, that is, minority points in the majority class region, minority class close to majority class samples, and safe minority points [61]. It accomplishes the mentioned task by measuring the relative data density for computing the local density of the minority data points between its k-nearest heterogeneous neighbours and k-nearest homogeneous neighbours initially. Afterward, it divides minority samples into borderline and safe samples relying on the relative density of minority samples' 2-means clustering outcomes. The quantity produced by each minority data point is re-weighted depending on the number of majority classes in its k-nearest neighbours, resulting in more samples close to the safe data points. In comparison, the scarcer samples are brought near the disorder samples to improve the divisibility of the classification boundary between classes. The data class distribution of the acquired dataset after implementing the robust-SMOTE is illustrated in Figure 10.

Details are in the caption following the image
The balanced data class distribution after robust-SMOTE algorithm

It can be observed from Figure 10 that the minority (red data points) and majority class (green data points) distribution is justifiably balanced. Furthermore, most of the minority class samples are generated from those safe minority samples that are far away from the healthy samples; thus, this method aids the ML-classifier in defining the classification border more eloquently.

3.4 Feature engineering

The successful development of the ML model is often contingent on the appropriate selection of input features used during model training [62]. The feature engineering approach is specifically dedicated to that purpose; it assists in summarizing the dynamics of the data and enhances its overall representation by extracting the most important features while simultaneously improving the performance and detection accuracy of the model [63]. The acquired smart meter dataset consists of only consumption data in kWh and lacks any other statistical significance. Therefore, in this study, several features from statistical, temporal, and spectral domain-based features are extracted from each consumer's consumption data, as presented in Table 3. Since there are no less than 39 extracted features presented in Table 3, therefore, it is quite hard to add the theoretical and mathematical background of all the extracted features due to the scope and length of the article. Nevertheless, interested readers can find all the relevant information in reference [64].

TABLE 3. Extracted features from time-series data
S. No. Feature S. No. Feature S. No Feature
1 Mean 14 Zero crossing rate 27 Variance
2 Median 15 Peak to peak distance 28 Relative desperation
3 Mode 16 Minimum peaks 29 Autocorrelation
4 Maximum 17 Entropy 30 Histogram with different bandwidths
5 Minimum 18 Maximum peaks 31 Mel frequency cepstrum coefficients (MFCC)
6 Interquartile range 19 Histogram 32 Spectral variation
7 Kurtosis 20 Fast Fourier transform 33 Centroid
8 Skewness 21 Spectral centroid 34 Positive turning points
9 Standard deviation 22 Spectral kurtosis 35 Negative turning point
10 Median absolute deviation 23 Median frequency 36 Slope
11 Mean absolute deviation 24 Wavelet entropy 37 Mean absolute difference
12 Mean absolute differences 25 Wavelet energy 38 Maximum frequency
13 Median absolute differences 26 Empirical cumulative distribution 39 Median frequency

3.5 Proposed classifier: Jaya optimized KTBoost algorithm

Boosting algorithms are widely used in practical data science and machine learning-based research works due to their outstanding prediction accuracy on highly complex datasets [65]. The boosting algorithms additively chain weak (base) classifiers by consecutively reducing both bias and variance at each boosting iterations. Despite the widespread usage of boosting algorithms, only one type of function is used as a base learner in most cases. In contrast to that, the KT-Boost algorithm either adds a regression tree or a penalized reproducing kernel Hilbert space RKHS (kernel ridge regression function) to the ensemble of base classifiers in each boosting iteration [66]. In the beginning, the base learner is learned from both regression tree and RKHS function by employing gradient or newton as optimization techniques; afterward, the base learner whose inclusion in the ensemble results in the lower empirical risk is chosen. In this way, at each subsequent iteration, a base learner from two fundamentally different learners is selected to achieve high predictive accuracy. In addition to that, this amalgamation facilitates enhanced learning about functions that have different regularity degrees, such as discontinuities and smooth portions, as most discontinuities portions are learned through regression trees through smooth (continuous) portions using RKHS regression functions. The most important hyper-parameters of the KTBoost algorithms are given in Table 4.

TABLE 4. Hyperparameters of the KTBoost classifier
Parameter name Description
learning_rate Parameter helps in setting weighting factors for the addition of new trees at each iteration to the classifier.
n_estimatiors The number of boosting iterations to be performed.
subsample The number of samples to be used for fitting the individual base learners. Optimal selection of this parameter can assist in setting bias and variance values.
criterion This is an evaluation metric to compute the quality of split, by default, it is selected as the mean square error (mse) but can be chosen as mean absolute error or Friedman mse.
min_samples_split The minimum number of samples to be present at a leaf/internal node. This parameter controls the model overfitting/ underfitting related problems.
min_samples_leaf The minimum number of samples to be present at the leaf. Controlling this parameter helps in overfitting/underfitting related issues.
min_weight_leaf
max_depth Parameter helps in building the structure of regression tree.
max_features Number of features to be selected when searching for split.
max_leaf_nodes Optimal selection of these value facilities reducing the impurity of regression trees.
base_learner This parameter sets the base learners, in this either trees or kernel or a combination of both can be chosen.
update_step This parameter estimates boosting updates at each iteration. If the base learner is chosen only trees and update step as a hybrid then gradient step estimates the structure of trees and Newton step assists in finding the number of the leaf. Similarly, if the base learner is chosen kernel and update step as a hybrid, then gradient descent is used as an update step.
Tol This value facilities for early stopping if there is no change in the loss.
kernel In the case of kernel booting, Laplace, radial basis function and generalized Wendland can be chosen as kernel functions.
range_adjust Regularization parameter for RKHS regression function.
Nystroem The Nystroem sampling method is used if set to true. In the case of large data set, this parameter helps in reducing computation resources.
n_components The number of samples used in Nystroem samples.

Unlike the previous research work where these parameters are either selected by using inefficient and time-consuming “trial and error” method or are adopted from previous literature, the current study utilizes the intelligence of a swarm intelligence based optimization technique called the Jaya algorithm to select the most optimal hyperparameters of the KTBoost algorithm. The Jaya algorithm is a gradient-free metaheuristic optimization method for solving constrained and unconstrained optimization problems. It is a stochastic population-based technique that modifies a population of individual solutions on an ordered basis by keeping the notion that each individual solution strives to attain the best solution while avoiding the least fit (worst) solution. One of the important features of this algorithm that makes it different from the other swarm intelligence-based optimization methods is that it does not require any algorithm-specific or control parameters for its operation. To avoid the computational complexity and to achieve the most optimal results within the limited number of iterations, only eight of the most important hyperparameters (base_learner, kernel, learning_rate, loss, max_depth, max_leaf_nodes, n_neighbors, update_step) are taken as decision variables in the current research work.

To achieve the best solution, the Jaya algorithm undergoes the following sequential steps,
  • Step 1: Initialize the input parameters of Jaya ( P o p s i z e , I t r n ) $Po{p_{size}},\ It{r_n})$ and of the problem which is to optimize ( V a r n ) $Va{r_n})$ . In this the P o p s i z e $Po{p_{size}}\ $ is the population size. I t r n $It{r_n}$ is the number of maximum iterations to set V a r n $Va{r_n}$ is the design variables of the function which is to be optimized.
  • Step 2: Initiate by randomly initializing the population within the predetermined lower and upper boundaries as given as in Equation (2),
    S i j = S m i n , j + S m a x , j S m i n , j . r a n d 0 , 1 \begin{equation} {S_{ij}} = {S_{min,}}\ {_j} + \left( {{S_{max,}}\ {_j} - {S_{min,}} \ {_j}} \right).rand\left( {0,\ 1} \right)\end{equation} (2)
where, S i j ${S_{ij}}$ is solution vector ( S i 1 , S i 2 , S i 3 , S i 4 , . . S i n ) ${S_{i1\ }},\ {S_{i2}},\ {S_{i3}},\ {S_{i4}},\ \ldots \ldots \ldots \ldots ..{S_{in}})$ , j = 1, 2, 3, 4…..n (number of given design variables) and i = 1, 2, 3, …. P o p s i z e $\ Po{p_{size}}$ (total number of search agents). S m a x , j ${S_{max}}{,_j}$ Upper bound and S m i n , j ${S_{min}}{,_j}$ lower bounds of design variables.
  • Step: 3: For each solution vector, estimate the value of the cost function and compute the best and worst solutions.
  • Step: 4: Update the solutions as follows
    S i , j , m u p d a t e d = S i j , m + x 1 , j , m S i , b e s t , m S i j , m x 2 , j , m S i , w o r s t , m S i j , m \begin{eqnarray} S_{i,j,m}^{updated} &=& \ {S_{ij}}{,_m} + {x_{1,j}}{,_{m\ }}\left( {{S_i}{,_{best}}{,_m} - \left| {{S_{ij}}{,_m}} \right|} \right)\nonumber\\ && - {x_{2,j}}{,_{m\ }}\left( {{S_i}{,_{worst}}{,_m} - \left| {{S_{ij}}{,_m}} \right|} \right)\end{eqnarray} (3)
where x 1 ${x_1}$ , x 2 ${x_2}$ are the two random numbers in between (0, 1) assisting in achieving the right balance between the exploration and exploitation process. The term x 1 , j , m ( S i , w o r s t , m | S i j , m | ) ${x_{1,j}}{,_{m\ }}( {{S_i}{,_{worst}}{,_m} - | {{S_{ij}}{,_m}} |} )$ leads towards the worst solution whereas x 1 , j , m ( S i , b e s t , m | S i j , m | ) ${x_{1,j}} {,_m} ( {{S_i}{,_{best}}{,_m} - | {{S_{ij}}{,_m}} |} )$ leads towards the best solution.
  • Step: 5: Evaluate the updated solutions by restricting them not to exceed the boundary conditions.
    S i , j , m u p d a t e d = S m a x , j i f S i , j , m u p d a t e d > S m a x , j S m i n , j i f S i , j , m u p d a t e d < S m i n , j S i , j , m u p d a t e d o t h e r w i s e \begin{equation}S_{i,j,m}^{updated} = \left[ { \def\eqcellsep{&}\begin{array}{l} {{S_{max}}{,_j}\qquad \ \ if\ S_{i,j,m}^{updated} > \ {S_{max}}{,_j}}\\[10pt] {{S_{min}}{,_j}\qquad \ \ if\ S_{i,j,m}^{updated} < {S_{min}}{,_j}}\\[10pt] {S_{i,j,m}^{updated}\qquad otherwise} \ \end{array} } \right]\end{equation} (4)
  • Step: 6: To evaluate whether the updated solution or the existing solution will advance to the next iteration, compute the value of the costs function for each set of search agents by employing the greedy selection technique. If the revised solution is better than the current solution, replace the former. On the contrary, the revised solution will be discarded, but the current solution will be retained in the population.

4 RESULT AND DISCUSSION

In this section, the performance of the proposed theft detection framework is evaluated and compared against the latest ML techniques such as XGBoost, lightGBM, Extra Trees classifier, and traditional ML techniques such as SVM, logistic regression, KNN, Ridge classifier, Linear discriminant classifier, and Naive Bayes classifier. In supervised ML learning, the trained classifiers are validated based on their ability to effectively predict and generalize the unlabelled data. In order to accomplish this task, various performance metrics exist, as mentioned in this study [10]. However, it is not practical to assess and analyse all of the metrics specified in the study; thus, few of the most relevant metrics are considered, as noted below.
Accuracy = T + + T T + + T + F + + F \begin{equation}{\rm{Accuracy}} = \frac{{{{\rm{T}}^ + } + {{\rm{T}}^ - }}}{{{{\rm{T}}^ + } + {{\rm{T}}^ - } + {{\rm{F}}^ + } + {{\rm{F}}^ - }}}\ \end{equation} (5)
Recall or detection rate = T + T + + F \begin{equation}{\text{Recall\ or\ detection\ rate\ }} = {\rm{\ }}\frac{{{{\rm{T}}^ + }}}{{{{\rm{T}}^ + } + {{\rm{F}}^ - }}}\end{equation} (6)
False positive rate = F + F + + T \begin{equation}{\rm{False}} - {\text{positive\ rate\ }} = \frac{{{{\rm{F}}^ + }}}{{{{\rm{F}}^ + } + {{\rm{T}}^ - }}}\ \end{equation} (7)
False negative rate = F F + T + \begin{equation}{\rm{False}} - {\text{negative\ rate\ }} = {\rm{\ }}\frac{{{{\rm{F}}^ - }}}{{{{\rm{F}}^ - } + {{\rm{T}}^ + }}}\end{equation} (8)
Precision or positive predictive value = T + F + + T + \begin{equation}{\text{Precision\ or\ positive\ predictive\ value\ }} = {\rm{\ }}\frac{{{{\rm{T}}^ + }}}{{{{\rm{F}}^ + } + {{\rm{T}}^ + }}}\end{equation} (9)
F 1 score = 2 T + 2 T + + F + + F = 2 × Precision Recall Precision + Recall \begin{eqnarray}{{\rm{F}}_1} - {\rm{score\ }} = \frac{{2{{\rm{T}}^ + }}}{{2{{\rm{T}}^ + } + {{\rm{F}}^ + } + {{\rm{F}}^ - }}}{\rm{\ }} = {\rm{\ }}2 \times \frac{{{\rm{Precision*Recall}}}}{{{\rm{Precision}} + {\rm{Recall}}}}\nonumber\\ \end{eqnarray} (10)
Matthews correlation coefficient MCC = T + T F + F T + + F + T + + F T + F + T + F \begin{eqnarray} && {\text{Matthews\ correlation\ coefficient\ }}\left( {{\rm{MCC}}} \right){\rm{\ }}\nonumber\\ && = \dfrac{{{{\rm{T}}^ + }\,{\rm{*}}\,{{\rm{T}}^ - } - {{\rm{F}}^ + }\,{\rm{*}}\,{{\rm{F}}^ - }}}{{\sqrt {\left( {{{\rm{T}}^ + } + {{\rm{F}}^ + }} \right)\left( {{{\rm{T}}^ + } + {{\rm{F}}^ - }} \right)\left( {{{\rm{T}}^ - } + {{\rm{F}}^ + }} \right)\left( {{{\rm{T}}^ - } + {{\rm{F}}^ - }} \right)} }}\ \end{eqnarray} (11)
Kappa value = ρ 0 ρ e 1 ρ e \begin{equation}{\text{Kappa\ value\ }} = \frac{{{{{\rho}}_0} - {{{\rho}}_{\rm{e}}}}}{{1 - {{{\rho}}_{\rm{e}}}}}\ \end{equation} (12)
where T + ${T^ + }$ is the true positive, T ${T^ - }$ is the true negative, F + ${F^ + }$ is the false positive and F ${F^ - }$ is the false negative. ρ 0 ${\rho _0}$ predicted value and ρ e ${\rho _e}$ actual value.

At this stage, the dataset developed during the feature engineering process is retrieved for model training and validation purposes. The fetched dataset comprises 1035 days of real consumption data and 39 additional features (mentioned in Table 3). Moreover, the raw input dataset's data class distribution was balanced with the robust-SMOTE method prior to feeding it to the algorithm for model training. The train-test split method is used in which 80% of the data is used for model training while 20% is for testing purposes. The proposed theft detection framework utilizes the KTBoost algorithm for model training, while the Jaya algorithm-based meta-heuristic optimization is used for its hyper-parameter tuning. In this scenario, the objective function for optimization purposes is to optimize the model's accuracy by minimizing the difference between predicted and actual outcomes. By initializing more than 35 trails/iterations employing the Jaya algorithm model attained an accuracy of 0.937 as presented in the optimization history plot in Figure 11. The x-axis represents the trail count, while the y-axis shows the accuracy value. The blue dots show the accuracy value attained at different combinations of hyperparameters in the graph.

Details are in the caption following the image
The proposed model's accuracy values against several optimization trails

Furthermore, in Figures 12 and 13, the slice and contour plots of the model's hyperparameters optimization process are shown, neatly illustrating the implication of the hyper parameter's variation on the objective value/accuracy. For example, Figures 12 and 13 depict that a learning rate within the range of 1.5 to 2.5 achieves high objective values, but increasing beyond that produces a considerable reduction in objective value. Similarly, max_depth greater than 1500 yields better accuracy values; increasing beyond that yields a significant reduction in accuracy, which can be attributed to the model overfitting on the training data.

Details are in the caption following the image
Slice plot of the proposed model against several optimization trails
Details are in the caption following the image
The contour plot of the proposed model against several optimization trails

The optimal hyper-parameters set, which attained the best accuracy value during several optimizations trials, is given in Table 5. As presented in the table, the combined base learner (kernel boosting and tree boosting) and hybrid update step achieve the best accuracy value.

TABLE 5. Optimal hyperparameters
Hyperparameter Value
base_learner Combined (Kernel boosting and tree boosting)
kernel GW
learning_rate 0.2
loss deviance
max_leaf_nodes 34
max_depth 1863
n_neighbors 50
update_step hybrid

4.1 K-fold cross-validation results of the Jaya optimized-KTBoost model

To effectively implement the proposed Jaya optimized-KTBoost algorithm, the designed model is initially trained on the data developed after the data class balancing and feature engineering stage. Afterward, the tenfold cross-validation (CV) technique employing the mentioned performance metrics (Equations (5)–(12)) is utilized for the performance evaluation of the designed model. This evaluation has produced the following results; as presented in Table 6, the proposed model has achieved a mean accuracy and precision of 0.9338 and 0.9508 with a standard deviation (SD) of 0.0029 and 0.0035, respectively.

TABLE 6. Jaya optimized-KTBoost model tenfold-cross validation results
No. of folds Accuracy Recall Precision Flscore Kappa-value MCC
1 0.9311 0.9216 0.9479 0.9345 0.8891 0.8922
2 0.9354 0.9278 0.95 0.9388 0.8705 0.9108
3 0.9354 0.9239 0.9536 0.9385 0.8706 0.9111
4 0.9326 0.9196 0.9524 0.9357 0.8921 0.9123
5 0.937 0.9263 0.9542 0.94 0.8736 0.9201
6 0.939 0.9292 0.9552 0.942 0.8777 0.9021
7 0.9285 0.9 191 0.9454 0.9321 0.8921 0.9154
8 0.9331 0.9258 0.9476 0.9366 0.881 0.9125
9 0.9313 0.923 0.947 0.9348 0.887 0.8926
10 0.9344 0.9206 0.9548 0.9374 0.8891 0.9092
Mean 0.9338 0.9318 0.9508 0.9371 0.8873 0.9077
Standard deviation 0.0029 0.0033 0.00365 0.00292 0.0087 0.00931

4.2 Confusion matrix evaluation of the proposed model

The confusion matrix (CM) is a prominent metric for addressing classification issues. It may be used for both binary classification and multiclass classification issues. CM represents counts from the actual and predicted values, as illustrated in Figure 14. In this study, T + ${T^ + }$ represents the number of theft consumers rightly classified by the classifier whereas F ${F^ - }$ represents the fraudster consumers misclassified as the healthy consumers. Similarly, T ${T^ - }$ represents the number of rightly classified healthy consumers while F + ${F^ + }$ depicts the healthy consumer misclassified as the fraudster consumer.

Details are in the caption following the image
Confusion matrix for binary classification problem

The confusion matrix of the proposed model is shown in Figure 15, “0” represents here the actual negative class or Healthy consumers and “1” represents the positive class or Fraudster consumer. The values in CM are normalized in the percentage form for ease in readability purposes. From the mentioned figure, it can be observed that the classifier rightly classified 93.16% of the theft consumers while 6.84% of actual theft consumers were misclassified as healthy. Similarly, 95.25% of healthy consumers were rightly classified, whereas 4.75% of actual healthy consumers were misclassified as theft.

Details are in the caption following the image
Confusion matrix of proposed theft detection model

4.3 AUC-ROC curve of the proposed model

The receiver operating characteristic (ROC) is an important performance metric for evaluating binary classification algorithms [67]. It represents the trade between the true positive rate and the false-positive rate of the classifier in a bi-dimensional plot. The area under the ROC curve can be computed using Equation (13),
A r e a U n d e r t h e C u r v e A U C = j p o s i t i v e T a r g e t R a n k j P s 1 + P s 2 P s N s \begin{eqnarray} && Area{\rm{\ }}Under{\rm{\ }}the{\rm{\ }}Curve{\text{\ \ }}\left( {AUC} \right)\nonumber\\ &&\quad =\, \dfrac{{\sum j \in positiveTarget{\rm{\ }}Ran{k_j}{\rm{\ }} - \dfrac{{{P_s}\left( {1 + {P_s}} \right)}}{2}}}{{{P_s}\,{\rm{*}}\,{N_s}}}\end{eqnarray} (13)

where the P s ${P_s}$ represents the number of positive samples, N s ${N_s}$ number of negative samples and R a n k j $Ran{k_j}$ depicts the rank value or of sample j belonging to the positive class. The AUC value is the likelihood that a randomly selected positive data sample would rank higher than a randomly selected negative data sample. The AUC value varies between 0.5 to 1, where 0.5 specifies that the classifier performs random guessing, and 1 indicates that the classifier is perfect in classifying the healthy and theft consumers.

The ROC curve of the proposed classifier is shown in Figure 16; the x-axis represents the FPR, and the y-axis the TPR. The average AUC value of the proposed classifier is 0.98, which indicates that most of the theft and healthy consumers are rightly classified.

Details are in the caption following the image
The ROC curve of the KTBoost classifier

4.4 The learning curve of the proposed theft detection model

A learning curve depicts the relationship between the training score and cross-validated (CV) test score for a classifier with different training data instances graphically [68]. The basic notion of this curve is to check the classifier's generalizing ability on different data samples. The learning curve of the proposed classifier is shown in Figure 17. The curves in the graph illustrate the mean scores, while the shaded areas depict the standard deviations above and below the mean for all cross-validations. If the model is flawed because of the bias, the training score curve will most likely be more variable than expected. Likewise, if the model is prone to error owing to variance, the cross-validated score will be more unpredictable.

Details are in the caption following the image
The learning curve of the proposed theft detection model

In Figure 17, it can be seen that when the data samples are minimal, the model training score is very high in comparison to the CV-score, which is a result of the high bias of the model. In contrast, as the number of training data samples grows, the training score decreases, while the CV- score increases, albeit with considerable fluctuation due to the model's high variance. Additionally, it is interesting to note from the learning curve that the model's CV-score and accuracy are above 0.9338, implying that the model can accurately distinguish fraudster consumers from healthy consumers.

4.5 Proposed model's outcomes interpretation and their impact on training time

In this section, the proposed model's prediction or outcomes are interpreted. The model's prediction interpretation is the process by which the input data features utilized for model training are evaluated based on their positive influence on predicting the correct result. In this study, the KTBoost algorithm is employed to rank all the given input features in terms of their contribution in predicting the right outcome.

Due to the fact that the input training data contains over 1200 features, it is not feasible to display the importance score of each feature in the graph; thus, only the top ten most important features are displayed in Figure 18 together with their importance score. The figure shows that the feature from actual consumption had the highest significance value, followed by statistical features derived from actual consumption. In order to demonstrate the significance of the importance score assigned by the KTBoost model to each feature, the KTBoost model was re-trained to incorporate a much smaller yet essential feature set. Figures 19 and 20 show the computing time required to analyse the entire collection of data features (1071 features) and the 23 most important data features. As can be seen in the mentioned figures, when a smaller number of features set is given, a substantial decrease in computing time is achieved.

Details are in the caption following the image
Feature's importance derived using the KTBoost classifier
Details are in the caption following the image
The computational time-training loss when the entire feature set of data is provided for model training
Details are in the caption following the image
The computational time-training loss when the most essential features are provided for model training

In addition to that, Figure 21 depicts the effect of important features on the model's accuracy. The model achieved an accuracy value of 80 percent when just the five most important features were supplied. By increasing the number of important features set from 5 to 23, the model achieved the same accuracy as when trained with all 1071 features. Thus, the conclusion from this can experiment be made that, if the model is retained with the most important features set, the computational resource required can be drastically reduced without violation in accuracy values.

Details are in the caption following the image
Proposed model performance with essential features set

4.6 Proposed model's comparison against the latest and traditional methods

This section presents a side-by-side comparison of the proposed theft detection framework with a series of well-known traditional machine learning models and the latest bagging and boosting models under an identical feature set. To assess the performance of all studied classifiers, the ten-fold cross-validation method is used in conjunction with the five most commonly used performance measures, namely accuracy, recall, precision, F1-score, Kappa value, and MCC-value.

The proposed framework is sequentially implemented using the Google-Collaboratory (Python 3 Google Compute Engine backend, 12-GB RAM, without GPU-enabled) environment. The comparison's results are summarized in Table 7. As summarized in the table, the proposed approach surpasses all other ML techniques in terms of accuracy, recall, precision, F1score, Kappa-value, and MCC value, thus evidencing its efficacy and importance. In addition, the proposed model obtained a 93.38% accuracy and recall, the precision of 93.18% and 95%, respectively, which is considerably better than all competing models.

TABLE 7. Proposed model comparison against latest and traditional ML methods
Model Accuracy Recall Precision F1-score Kappa-value MCC
Proposed model 0.9338 0.9318 0.9508 0.9371 0.8873 0.9077
XGBoost classifier 0.9112 0.9123 0.9012 0.912 0.867 0.875
Extra tree classifier 0.901 0.8921 0.912 0.934 0.854 0.812
SNAP boost algorithm 0.90 0.8912 0.9216 0.9123 0.8412 0.845
lightGBM 0.891 0.8751 0.8631 0.8641 0.8124 0.854
Wide-Deep CNN 0.89 0.812 0.881 0.7921 0.812 0.8213
Gaussian process based boosting 0.885 0.8754 0.8698 0.8412 0.8421 0.7892
Boosted C5.0 algorithm 0.881 0.8541 0.824 0.8121 0.824 0.8245
NGBoost algorithm 0.87 0.861 0.834 0.8251 0.834 0.8964
Random-forest classifier 0.834 0.8123 0.8241 0.8125 0.8453 0.831
SVM - linear Kernel 0.823 0.7601 0.8292 0.7928 0.6042 0.6066
AdaBoost classifier 0.814 0.7562 0.7213 0.745 0.751 0.761
Ridge classifier 0.795 0.7931 0.8584 0.8244 0.6622 0.6641
Quadratic discriminant analysis 0.721 0.2251 0.8911 0.3594 0.1976 0.2974
Logistic regression 0.712 0.8063 0.8482 0.8267 0.6619 0.6627
Linear discriminant analysis 0.698 0.7929 0.8583 0.8243 0.662 0.6639
K neighbour's classifier 0.587 0.6412 0.7606 0.8284 0.6233 0.6356
Naive Bayes 0.54 0.3478 0.6261 0.4472 0.1401 0.1563

5 CONCLUSION

This study presented a novel sequentially executed data-driven approach for identifying electric fraud in a smart meter dataset. The raw smart meter data often contains several null and irregular values mostly due to the malfunction of equipment, poor network, or device storage-related issues. Since most machine learning classifiers cannot process the null values present in the data; therefore, this study estimated missing values using an ensemble machine learning-based predictive modelling technique called XGBoost. Afterward, the robust-SMOTE algorithm was used to balance the class distribution in the acquired data. By considering all regions of minority samples in the dataset, the robust-SMOTE technique produces the minority class samples that are less prone to overfitting and noisy sample generation. Once a balanced dataset is obtained, a set of statistical, temporal, and spectral features were extracted from it. These additional features aid the ML-classifier in understanding the underlying complicated data patterns contained in the data. Finally, in order to effectively classify the data into “Honest” and “Fraudster” consumers, the Jaya optimized KTBoost classifier was used. The Jaya-KTBoost technique combines kernel boosting and tree boosting with its hyperparameters are tuned by utilizing the intelligence of the Jaya algorithm. The proposed model attained an accuracy of 93.38%, precision of 95%, and recall of 93.11%, which are significantly higher than all compared methods.

FUNDING INFORMATION

This work was supported by the Fundamental Research Grant Scheme under Grant R.J130000.7851.5F062 through the Ministry of Higher Education, Malaysia.

CONFLICT OF INTEREST

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

ACKNOWLEDGEMENT

This work was supported by the Fundamental Research Grant Scheme under Grant R.J130000.7851.5F062 through the Ministry of Higher Education, Malaysia.

    DATA AVAILABILITY STATEMENT

    The data that support the findings of this study is publicly available at: https://github.com/henryRDlab/ElectricityTheftDetection.