A regression model for extreme events and the presence of bimodality with application to energy generation data

Funding information CNPq; CAPES; Santo Antônio Hydroelectric Plant Abstract The application of the theory of extreme values has been growing due to increasing interest in extreme natural events. Many articles on extreme values in data modelling consider unimodal data. This work introduces an appropriate regression for extreme values to detect the presence of bimodality by means of systematic components of two parameters of the odd log-logistic log-normal distribution. The global influence is addressed to verify the model robustness and to find possible influential points. Quantile residuals are proposed to detect distribution deficiencies and outliers in the new regression. A real dataset from the electricity generation area is analysed, namely the Santo Antônio Hydroelectric Plant in the state of Rondônia (Brazil), to illustrate the potential of the new regression. The main results indicate that the proposed regression can identify changes in the means and variability of the power generation between extreme events, that is, between the months of June and December.

Kummer beta GGu distribution. However, none of these distributions are able to model bimodal data, and not all of these authors studied the effect of covariables on the response variable, that is, constructed regression models.
Conscious consumption of electricity is essential for sustainable development. For example, Halvorsen and Larsen (2001) [5] analysed data in order to find factors that influence residential electricity consumption in Norway, Filippini and Pachauri (2004) [6] estimated the price and income elasticities of residential demand for electricity of all urban areas in India and Arisoy and Ozturk (2014) [7] estimated the price and income elasticities of demand for industrial and residential electricity in Turkey for the period 1960 and 2008. This paper analyses the average daily potential to generate electricity (measured in megawatt-hours) by the Santo Antônio Hydroelectric Plant in Rondônia (Brazil). For this purpose, it proposes a new regression for bimodal data. Figure 1a shows that the histogram of the energy data has a bimodal shape. The tests prove that this regression is appropriate for these data. Figure 1b shows the presence of extreme events with similar behaviour in June and December, and also similarity between  July and November. It rains very little in June and July in relation to the previous months and there is evidence of a large accumulation of water in the reservoir between January and May. Thus, in the months of drought (June and July), the plant continues to generate energy similar to November and December, since the rains occur regularly in November with a considerable increase in December. The proposed regression is also applied to analyse extreme events. In recent years, regression analysis has been investigated carefully in extreme value theory. The location-scale regression (Lawless, 2003) [8] is frequently used for weather events. This paper proposes a regression model under the odd loglogistic log-normal (OLLLN) distribution (Ozel et al., 2018) [9]. In fact, this distribution is very flexible to model extreme events and bimodal data. The parameters are estimated via maximum likelihood. Unless otherwise stated, the results in the paper are new and original. They can encourage further research on the new regression. Monte Carlo simulations are conducted to study the accuracy of the OLLLN regression in terms of variances and mean squared error (MSE) measures.
Some assumptions are verified to detect influential observations in the regression. Quantile residuals (qrs) are defined to check the underlying distribution and to detect outliers in the regression.
This rest of the article is structured as follows. Section 2 reviews some characteristics of the OLLLN distribution. Section 3 defines the OLLLN regression for extreme events with two systematic components and discuss the maximum likelihood estimators (MLEs) of the parameters and qrs. Section 4 generates data based on selected parameter values for three simulation studies to evaluate the accuracy of the estimators in the OLLLN distribution and its associated regression and the adequacy of the normal approximation for the qrs. Section 5 contains an empirical demonstration of the utility of the new regression. Finally, Section 6 contains the concluding remarks.

THE OLLLN DISTRIBUTION
The log-normal (LN) distribution has many applications in different areas. It is characterised by a strong positive asymmetry and the occurrence of a large number of low values and a small number of high to very high values. Its cumulative distribution function (cdf) and probability density function (pdf) are: and respectively. Here, the location parameter is ∈ ℝ, the scale parameter is > 0 and Φ(⋅) is the standard normal cdf. Let W ∼ LN( , ) be a random variable having density function (2). The mean and variance of W are E (W ) = e¯+ oe 2 ∕2 and V (W ) = e 2¯+oe 2 (e oe 2 − 1), respectively. In recent years, various methods for generalising or modifying this distribution have been investigated. By integrating the log-logistic density function, Ozel et al. (2018) [9] defined the OLLLN cdf (for y > 0): whereḠ , (y) = 1 − G , (y), and > 0 is a shape parameter. By setting = 1 in Equation (3), one obtains the cdf of the LN distribution in Equation (1). Henceforth, (y) = G , (y) and the OLLLN density is written as (for y > 0): Henceforth, the random variable Y has the OLLLN density (4). Clearly, the LN distribution follows when = 1. The asymmetry and kurtosis of Y are more flexible than those of LN quantities. Another motivation for the OLLLN distribution is to have bimodality.
Some plots of the OLLLN density are given in Figure 2. Figure 2a,b show the behaviour of the pdf of Y when varying and , respectively. Figure 2a reveals that the parameter is a kind of dispersion parameter. In Figure 2c, the behaviour of the scale parameter is clear. This distribution is more flexible than the LN distribution especially in terms of bimodality when 0 < < 0.5. The bimodality property was not investigated by Ozel et al. (2018) [9]. However, the OLLLN distribution can model bimodal and asymmetric data.
The quantile function (qf) takes the simple form: where Generated values for Y can be simulated as follows: Some shapes of the density function (4) for fixed parameters are given in Figure 3. The main characteristic of the OLLLN distribution is that it can model bimodal data. These plots confirm that the generated values agree with the OLLLN distribution.

THE OLLLN REGRESSION
In several real problems from different fields, it is important to verify the relationship of two or more variables. This type of modelling is called regression analysis, and determines how the explanatory variables explain the variability of the response variable. The data collection allows understanding the dependence between variables and carrying out studies for formulation of the regression, estimation and inference of the parameters, diagnostic and residual analysis. Some studies of regression models in extreme events can be found in the literature. For example, Barreto-Souza and Vasconcellos (2011) [10] introduced an extended extreme value regression, Pinheiro [16] studied the odd log-logistic regression based on a generalised inverse Gaussian distribution.
Following these ideas, here a regression from the OLLLN distribution for bimodal data is defined using likelihood techniques. A new regression is introduced for extreme events by modelling the scale and dispersion parameters. The OLLLN regression is defined by the distribution (4) of the response variable Y , where the parameters i and i are related to a known vector Let (y 1 , x 1 ), … , (y n , x n ) be a sample of n independent observations. The total log-likelihood function for the unknown vector = ( (4) and (6) The MLEˆof can be calculated numerically by maximising (7) using the gamlss package (Stasinopoulos and Rigby, 2007) [17] of the R software. This process can be executed for several initial values since it often leads to more than one maximum. The final estimate corresponds to the largest of the maxima. Confidence intervals for the parameters (in large samples) can be constructed via the normal approximation for the estimates. The global influence and residual analysis are explored after fitting the regression. The generalised Cook distance determines the influential observations. Another measure is given by LD i ( ) = 2 [l (̂) − l (̂( i ) )], where the notation "(i)" refers to the deletion of the ith observation (Cook, 1977) [18].
Residual analysis investigates the suitability of the OLLLN regression based on residuals. Some graphic techniques for waste analysis are: residuals versus adjusted value plots; residual versus order of data plots and simulated residual envelope plots. The well-known qrs can be expressed as

SIMULATION STUDIES
Three Monte Carlo simulation studies are demonstrated. First the precision of the estimates in the current distribution is checked. Second, the behaviour of the estimates for the OLLLN regression defined in the previous section is studied. Third, the normal approximation for the empirical distribution of the residuals is analysed.

The OLLLN distribution
Some properties of the MLEs of the OLLLN distribution are evaluated under a classical analysis using Monte Carlo simula- tions by setting = 1, = 0.08 and = 0.17. The simulation process follows as: • The sample sizes n = 50, 100, 300 and 1000 correspond to four scenarios; • Calculate the estimateŝ,̂and̂; • Repeat the previous step 1000 times to obtain the average estimates (AEs), biases and MSEs.
The figures of the simulation reported in Table 1 clearly indicate that the MLEs are accurate. In addition, the MSEs decrease to zero when n increases.

The OLLLN regression
The second simulation study examines the precision of the estimates in the OLLLN regression for n = 100, 300 and 500. Ten thousand samples are simulated in two scenarios ( = 0.2 and = 1.3) by considering i = 0.5 + 1.9 x i1 and i = 0.4 + 0.1 x i1 (for i = 1, … , n). The observations are generated from Y i ∼ OLLLN( i , i , ) and x i1 ∼ Binomial (6, 0.5).
The results of the simulations in Table A.1 (Section A) indicate that: • The MSEs tend to zero when n increases.
• The AEs are in agreement with the true parameters for n large.
So, the finite sample distribution of the estimators can be well approximated by the normal distribution.

Empirical distribution of the residuals
Here samples of 1000 observations are generated based on the same scheme of the previous simulation study. Figures 4 and 5 provide normal probability plots of the qrs. They indicate that the empirical distribution of these residuals is close to the normal distribution. The normal probability plot can be adopted for these residuals with simulated envelopes.

ENERGY GENERATION DATA
Here, the OLLLN regression is applied to evaluate the mean daily energy generation potential of the Santo Antônio Hydroelectric Plant in Porto Velho (Brazil). It has 50 bulb-type turbines for power generation with a capacity of about 71.6 megawatts (MW) each. It is the fourth largest hydroelectric plant in operation in Brazil and one of the largest in the world. This plant is extremely important for supply of power to the national grid in Brazil, besides contributing significantly to serve regional demand for electricity in the states of Rondônia and Acre. The data were collected in 2016 and refer to n = 213 daily averages December. An OLLLN regression is constructed with nonlinear functions for and in the presence of bimodality. The purpose is to detail the relationship between the daily mean and dispersion of generation by the plant with other explanatory factors, thus shedding some light on the behaviour of the response variable during this time frame. The following variables are considered in this study: • y i : mean daily electricity generation in megawatt-hours (MWh);  • d i j : month (levels: 0 = June to 6 = December). There are six dummy variables for i = 1, … , 213, j = 1, … , 6. Table 2 lists some descriptive statistics of these data, which have positive asymmetry and negative kurtosis.
The GGu distribution, very popular to model extreme events, is an alternative approach for modelling these data. The cdf and pdf are , y ∈ ℝ, (9) and g(y; , , respectively, where ∈ ℝ, > 0 and > 0. For four distributions, the estimates and their standard errors (SEs) (between parentheses) are calculated, along with the Akaike information criterion (AIC), Bayesian information criterion (BIC) and global deviance (GD), which are reported in Tables 3 and 4 Table 4 show that    The OLLLN distribution includes as sub-model the LN distribution, which allows their relative comparison. The LR statistic in Table 5 supports the OLLLN distribution for these data.

. The figures in
The plots of the estimated densities and cdfs of the OLLLN, LN, N and GGu distributions and the histogram and the empirical cdf (black curve) for the energy data are shown in Figure 6. Once more, it is proven that the OLLLN distribution is suitable for these data. Figure 6a reveals the bimodality shape of the energy generation data, which the LN, N and GGu distributions do not have.

Estimation in the OLLLN regression
The parameters and take non-linear equations (for i = 1, … , 213) The estimates, SEs and p-values are reported in Table 6. The covariable "month" is significant for both components in and at a 5% significance level. Some final interpretations of the fitted regression are addressed at the end of this section. Table 7 lists the goodness-of-fit statistics for three fitted regressions, showing that the OLLLN regression gives the smallest measures for them.
The LR statistic for comparing two fitted regressions in Table 8 supports the OLLLN regression. The rejection of the null model is significant at 5% level, thus giving clear evidence that the shape parameter is necessary when modelling data of this type.   Figure 7. They show that only the observation ♯187 is a possible influential case. Further, Figure 8a displays plots of the residuals for the fitted OLLLN regression, which reveal that they have a random behaviour and that just the observation ♯187 is out of the range [−3, 3]. Thus, is no indications contrary the current assumptions for the fitted regression. The normal plot for the residuals with simulated envelope in Figure 8 bindicates only one external point outside the envelope. Hence, the OLLLN regression is adequate for the energy data. Figure 8c reports the estimated cdfs based on the fitted OLLLN regression and the empirical cumulative function in specified months.

Interpretations of the systematic component for
The covariable month has June as reference for .
• The covariable d i6 is not significant at the 5% level, so there is no difference in the means for June and December. • A relevant difference exists between June on the one hand and July, August, September, October and November on the other. Since the estimates are negative, there is evidence that the average daily energy generation declined from July to November in relation to June. • Figure 8c shows the absence of a significant difference among the months of August, September and October. The same applies to the months of July and November as well as June and December. The OLLLN regression manages to detect similar behaviour between June and December and also between July and November (extreme events). Thus, there are three clusters, namely "June and December", "July and November", and "August, September and October". • Regarding these three clusters, the highest energy generation occurs in June and December, followed by July and November. In turn, the lowest generation is in the cluster "August, September and October".

Interpretations of the systematic part of
For the systematic component of the dispersion parameter , the estimates related to August (d i2 ) and September (d i3 ) are significantly negative, thus indicating a lower variability in energy generation in these two months in relation to June as noted in Figure 8c. So, the OLLLN regression is accepted for these data.

CONCLUDING REMARKS
This paper presents the odd OLLLN regression for extreme events and the presence of bimodality with two systematic components. The proposed regression can be a valuable tool, especially for bimodal data as well as extreme energy values. The parameters are estimated via maximum likelihood and influence diagnostics and model performance are investigated based on qrs. The new regression is suitable to explain daily average power generation by the Santo Antônio Hydroelectric Plant located in Rondônia, Brazil.