Application of the mixed effects model for analysing photovoltaic datasets and interpreting into meaningful insights

Energy power studies commonly apply the analysis of variance (ANOVA) method in data analysis context to ﬁnd the signiﬁcant difference among many compared groups or to evaluate the impact of an inﬂuential factor. However, most of the datasets in these studies are time series data or longitudinal data, which are collected from the same object over some periods. Therefore, violating the independent assumption of ANOVA is the error commonly made. This leads to the misinterpretation of the comparison tests. In this article, this problem is solved by providing the application of mixed effects model (MEM) as a viable alternative to the ANOVA. In detail, two models based on MEM are proposed to analyse the time series data of micro-inverter PV stations located in Concord city, Mas-sachusetts, USA. In the ﬁrst scenario, the average models are implemented to compare the seasonal variation of monthly yield taking into account the effects of shading conditions and different orientations. In the second scenario, the linear regression model based on MEM is implemented to estimate and compare the decline rate in a 6-year period of PV stations. The analysis results have shown that the models based on MEM performs better than ANOVA in interpreting raw time-series dataset into meaningful insights.

built the Fourier time series model to predict the one-day-ahead and two-day-ahead power consumption. They used ANOVA to choose the most significant frequency components. In cluster analysis study, Jenny Palm et al. [7] proposed the ANOVA test to classify individual activity sequence, i.e. cooking, doing the laundry, or watching television, from the time series data of energy consumption at residential level. However, most of the above studies have ignored the required assumptions in order to apply ANOVA test. The most crucial assumption to apply ANOVA is that the observed data should be independent [8]. Unfortunately, such time series data as in the aforementioned studies clearly violated this assumption. The reason is that time series data come from a single object (for example, PV station, smart meter), hence data taken between two arbitrary intervals are obviously correlated. Therefore, ANOVA test is an inappropriate statistical technique in longitudinal study. In addition, ANOVA is a nonparametric method, which only gives the answer whether or not there is a significant difference between many groups. In the comparison of more than two groups, ANOVA test has failed to address which group is significantly different from the remains. Moreover, ANOVA cannot give in-depth details about the amount of variation within group or between groups caused by many affecting factors.
In this study, we solve the problem by providing the application of mixed effects model (MEM) as a viable alternative to the ANOVA for longitudinal studies in power energy context. In detail, we propose two models based on MEM to analyse the time-series dataset of micro-inverter PV stations located in Concord city, Massachusetts, USA and show their advantage to ANOVA. Firstly, we implement the average models to compare the seasonal variation of monthly yield taking into account the effects of shading conditions and different orientations. Secondly, we implement the linear regression model based on MEM to estimate and compare the decline rate in a 6-year period of PV stations.
Based on the linear model, MEM has been developed to take into account both the variation in time-series data across individuals (repeated measures context) and the variation across groups (multilevel or hierarchical context). MEM has been widely used in other research fields such as biological, clinical, or medical fields. However, a few scholars have explored the power of MEM for longitudinal studies in power energy [9,10].
In general, the key contributions of our paper are summarized as follows: • Increasing the awareness and understanding of the mixed effect model, as a superior method to ANOVA, to interpret the meaningful insights of time-series dataset in power energy; • Proposing the new average model based on MEM to evaluate the variation of energy yield during a year of PV systems under influences of shading and orientation; and • Proposing the new linear decline model based on MEM to evaluate the monthly energy yield of PV systems under the influence of shading.
The article is structured as follows. Section 2 reviews the related studies that applied mixed effects model for longitudinal analysis in power energy and other research fields. Section 3 presents the data structure of micro-inverter PV stations, followed by a decomposition method to analyse these data. The average model and linear decline model based on MEM are proposed in Section 4. Section 5 provides the analysis results of the proposed models. Finally, Section 6 summarizes our study and discusses future directions.

RELATED WORKS
The mixed effects model has been widely used over the last couple of decades in biological, clinical, or medical studies [11][12][13][14][15][16]. This model is suitable to analyse or evaluate the influence of related factors to an observed variable that is repeatedly measured over time such as hourly, daily, or monthly. Such study and repeated measured data are called longitudinal study and longitudinal data or panel data respectively. In recent years, there is a significant number of studies in PubMed and Google Scholar that applied the mixed effects model in many research fields. For example, in hypertension study, Kaneto Mitsumata et al. [17], explored the association of parental hypertension on longitudinal data of blood pressure and metabolism by mixed effects model. In agriculture, Baey Charlotte et al. [18], proposed mixed effects model to describe the interactions among plants and explain the variability observed in the plant's population. In chemistry, Daniel A. H. et al. [19], applied nonlinear mixed effects model in a longitudinal study of a batch recycle trickle bed reactor system. Furthermore, in order to predict the daily consumption of nature gas, Marek Brabec et al. [20], successfully implemented the nonlinear mixed effects model, which took into account the irregularities of individual customer. In power engineering, the mixed effects model has recently been attracted and explored by scholars. For example, Cameron Roach [9] used mixed effects model to generate the impact profiles for daily energy demand of the commercial buildings based on data from smart meters. His model was able to quantify the changes in energy caused by building characteristics. In addition, Paulo Fernando et al. [10], utilized the power of mixed effects model combined with the first-order moving average model and first-order autoregressive errors model to predict the short-term electric power generation from Brazilian distributing utilities . Our previous study in [21] is the first study that found the average degradation rate of micro-inverter PV system of 3% per year by applying the linear version of mixed effects model. To the best of our knowledge, we are among the first research groups exploring the potential applications of mixed effects model to conduct data analytics in power engineering. Such research efforts have opened the new room of applying mixed effects model in many smart grid data analytics in the future. Table 1 illustrates the list of eighteen PV stations that are used in our study. Those PVs all used monocrystalline solar panels and the micro-inverter configuration from Enphase manufacturer [22]. The measurements of generated energy and power are uploaded every five min from the registration day to PVoutput website [23]. In addition, the technical details of each PV station are the rated PV power; the brand, type of solar panel, total number and power of solar panels; the brand, total number and power of inverters; orientation; shading condition; and tilt degrees of solar panel. The shading situations of PV in Table 1 are classified as follows:

MONTHLY DATASET OF PV STATIONS
• No shading: PV panel does not suffer from shading; • Low shading: The shading covers up to 25% of PV panel area; • Medium shading: The shading covers from 25% to 50% of PV panel area; • High shading: The shading covers higher than 50% of PV panel area; These time series data are useful and reliable to perform longitudinal study since all PV stations share the same climate condition (Concord city) and the measurement data are continuously recorded from 2014 to 2019. The monthly yield m of electric energy is chosen as the variable to investigate the energy performance of a PV station in our study. The calculation of this yield is given in Equation (1), where P pv is the rated power of PV station, M is the total number of recorded days of the month of interest, and E day j is the generated energy from PV station on the j th day. The resulting monthly yield of PV stations in Table 1 is represented in Figure 1.
Plotting time series data as in Figure 1 is the important first step in analysing their various components. In the second step, we need a decomposition model for reducing the time series data into three components: trend, seasonal effects, and random errors. From there, we can start to work on evaluation or comparison study based on these components to yield some valuable insights. If we assume an additive decomposition, then we formulate the monthly energy yield m i of the ith PV station as Equation (2),  [24,25]. Since there are 12 months in a year we need to use a moving average of length twelve. This means that at each data point n (n starts at 7) we average the 6 data points behind and 6 data points in front of the position we want to calculate an average value for (a = 6). However, since the moving average is an even length we have to take the average of two averages as Equation (3).
The result of these moving averages yields the overall trend of monthly energy yield from 2014 to 2019 as shown in Figure 2.
Note that there are 6 month lag and 6 month lead on each end of these trends. This is due to the moving average which requires 6 months of data both behind and in front of each month's data to calculate the average. To extract the seasonal component of monthly energy yield, firstly we subtract the trend component from the original data and calculate the mean value for each month as m i − t i . Then we apply the normalization by subtract m i − t i from the mean value of twelve months in a year as Equation (4) to obtain the seasonal component.
The error component, also referred to as the irregular or random component, is composed of all the leftover signals which are not explained by the combination of the trend and  (5).
The decomposition process of monthly yield of PV stations was implemented using R programming version 3.4.1 [26] and decomposing function in R stats package. Then, the trend component t i (n) and seasonal component s i (n) of PV stations are used as observed variables in the MEMs in our study.

THE MIXED EFFECTS MODEL
Mixed effects models are commonly applied when dataset is longitudinal (repeated measures) and classified into many groups of objects. In linear regression aspect, each mixed effects model is combined from two terms: fixed effects and random effects. While the former describes general characteristic of whole dataset or between-groups variability, the latter term reflects the variability among many objects in a group (withingroup variability) or the variability in repeated measurements of each object (within-object variability). By calculating the fixed effects terms of grouped objects, any differences among groups can be evaluated easily. Further related theories and method of mixed effects models are referenced to [27][28][29].
In general, the comparison of many groups using MEM is based on a hypothesis test as below: • The null hypothesis (H 0 ): There is no difference in values of a variable among groups. If the P value (the probability assuming that the null hypothesis is correct) is larger than 0.05, then H 0 can be assumed; • The alternative hypothesis (H 1 ): There is real difference in values of a variable among groups (P < 0.05).  Table 1 used for average models. Level 1 is the seasonal variation of individual PV station. Level 2 is the shading condition groups, which are coded as j = {0, 1, 2, 3} Figure 3 shows the data structure of the seasonal variation v i j (kWh/kW) of the ith PV station and shading group j , collected from the Equation (6) below:

Average model
Our objectives are to find out the common seasonal variation of all PV stations, compare the seasonal variations under different shading conditions, and evaluate the effect of orientation to seasonal variation. We propose three average models based on MEM in order to obtain the results as follow.

Average model 1: Finding common seasonal variation
In the first average model, the v i j of the ith PV station is treated as the level 1 of model and nested into the level 2 of shading condition as Figure 3. The seasonal variation value of the ith PV station is formulated as Equation (7), and V j is defined as Equation (8): where V j is assumed as the seasonal variation value of shading group j . V 0 is the common variation of all PV stations regardless of the shading conditions. r i j is the residual between the measured variation and estimated variation from model in Equation (7) of the ith station. The residual term r i j is assumed to follow a normal distribution with the variance 2 r , r i j ∼  (0, 2 r ). By substituting Equation (8) to Equation (7), Equation (7) is written as Equation (9) below: The term V 0 in Equation (9) is called the fixed-effect part of model since it represents the common seasonal variation of all PV stations during a 6-year period. The term r i j is called the random-effect since it includes the variant elements in the model. This random-effect explains the variability in v i j from many PV stations.

Average model 2: Interpreting the effect of shading
In the second average model, the seasonal variation model is formulated in the same way as Equation (7). Since we explore whether there is any difference in seasonal variations among shading groups j , the Equation (8) is rewritten as Equation (10) below: where V 0 j is assumed as the amount of difference in V j caused by shading group j = {0, 1, 2, 3}. Then Equation (7) is written as Equation (11) below: The fixed-effect part (V 0 + V 0 j j ) in Equation (11) represents the seasonal variation of PV stations belonging to group j . 0 j is the residual between the variation of PV stations in each shading group and the common variation V 0 . The residual term 0 j is assumed to follow a normal distribution with the variance 2 , 0 j ∼  (0, 2 ).

Average model 3: Interpreting the effect of orientation
In the third average model, the seasonal variation model is the same as Equation (7). However, we assume that there exists an effect of the orientation of PV panel (South (coded as w = 0) or South West (coded as w = 1)) to v i j and this effect is constant regardless of the shading conditions. Hence, the value of V j is formulated as Equation (12) below: The Equation (7) is then rewritten as Equation (13) below: The term (V 0 + V 0w w) in Equation (13) is called the fixed-effect part of model since it represents the seasonal variation plus the effect of orientation on the variation of PV stations. The resulting analysis of three average models are shown in Subsection 5.1.1.  Table 1 used for linear decline model. Station which suffers from shading is coded as S = 1, while non-shading PV station is coded as S = 0 Figure 4 shows the data structure of the trend values t i of the ith PV station during 6-year period. In this scenario, PV stations are classified into only two groups: non-shading (S = 0) and shading (S = 1) since we intend to compare the results to our previous study [21] at yearly scale. As the trend values tend to decrease over time, we propose two linear decline models as follows.

The unconditional model
In the unconditional model, we treat the trend dataset in Figure 2 as only one group regardless of the shading conditions. The trend of the ith PV station is represented as Equation (14): where t i is the trend values for individual PV station i = 1, 2, … , 18, measured every month from July 2014 to June 2019 and represented by the index of month n = 0, 1, 2, .., 71. The i and i are named the initial trend (kWh/kW) and decline rate of the ith PV station, respectively. The meaning of initial trend is the first value of trend that was calculated in July 2014 (n = 0). e i is the residual between the measured trend and estimated trend from model of the ith PV station. The residual e i is assumed to follow a normal distribution with zero mean and variance 2 e , e i ∼  (0, 2 e ). Since all the PV stations are assumed as one group, it is reasonable to imply that they all shared a common initial trend and common decline rate. Therefore, we interpret the parameters i and i as Equation (15), where A and B are the common initial trend and the common decline rate of all PV stations, respectively. The residual terms u and v are assumed to follow a normal distribution with their respective variances u ∼  (0, 2 u ) and v ∼  (0, 2 v ). By substituting Equation (15) to Equation (14), Equation (14) is written as below The term (A + Bn) in Equation (16) is called the fixed-effect part of model since it represents the general trend of all PV stations during a 6-year period. The random-effect term (u + vn) includes the variant elements in the model. This random-effect explains the variability in trends from many PV stations.

The conditional model
In this model, the trend model is formulated in the same way as Equation (14). However, the PV systems are classified into nonshading and shading group as Figure 4 since we explore whether there is any difference in trends between two groups. Therefore, the Equation (15) is rewritten as Equation (17) below: where A 0 and B 0 have the same meaning as A and B in Equation (15). The parameters A 1 and B 1 reflect any differences in the initial trend and in the decline rate between two PV groups respectively. The S i equals to 0 if PV station belongs to nonshading group and to 1 if it belongs to shading group as shown in Figure 4. By substituting Equation (17) to Equation (14), then Equation (14) is written as Equation (18) below: (18) In this case, the fix-effect term is [ Unlike the fix-effect term in unconditional model, it now separately represents the decreasing trends of non-shading PV stations and shading PV stations. In addition, if there exists any significant difference between two groups, then A 1 and B 1 identify how large the difference is. The resulting analysis of both unconditional model and conditional model are shown in Section 5.2.1.

RESULTS AND DISCUSSION
All algorithms and models in Section 4 were implemented using R programming version 3.4.1 [26] and mixed effects analysis from nlme package [30] or lme4 package [31]. The random process used the same number of generators to ensure the reproducibility. Table 2 shows the fix-effect results of three average models, respectively. From these results, we conclude the findings about the seasonal variation of PV stations as follows:

Average models based on MEM
• The common seasonal variation for the entire PV stations regardless of shading conditions and orientation is 2.86 (kWh/kW). Furthermore, this result is significant since the corresponding P value is much smaller than 0.05; (average model 1) • Taking into account the shading conditions, the PV stations which suffer from high shading have the largest seasonal variation of 3.38 (= 2.44 + 0.94) (kWh/kW). On the contrary, the non-shading PV station have the smallest seasonal variation of 2.44 (kWh/kW) and the seasonal variation tends to increase whenever the shading level (low, medium, high) increases; (average model 2) • The seasonal variation of high shading group is significantly different from the non-shading group since the corresponding P value is smaller than 0.05 (<0.0001). In addition, the amount of this difference is about +0.94 (kWh/kW). Meanwhile, there are not any significant difference in the seasonal variation of low shading and medium shading groups since its corresponding P values (0.16 and 0.66, respectively) are larger than 0.05; (average model 2) • The impact of orientation to seasonal variation is not significantly different as we expected. Since the P value of South West group is 0.58, which is larger than 0.05. Therefore, there is no significant difference in seasonal variation between two orientations of South and South West. (average model 3) The mean values and its corresponding 95% confidence intervals (CI) in Figure 5 confirms again our conclusions. The overlapping ranges of 95% CI indicate that any differences among PV groups are not significant, except the non-overlapping ranges between high shading group and non-shading group. Table 3 shows the random-effect results of three average models and the corresponding R-squared. This score, calculated as 2 ∕( 2 + 2 r ), represents the proportion of variance in the seasonal variation that is explained by the model. We interpret the meaningful insights about the effectiveness of models as follows: • The R-squared of the average model 2 is higher than that of the average model 3 indicates that the average model 2 is more fit to the dataset than the average model 3. • Based on both the variance and R-squared results, the impact of shading to the seasonal variation is higher than the impact of orientation. The mean values of seasonal variation and its 95% confidence intervals from average model 2 and average model 3 in Table 2 Finally, the Quantile-Quantile (Q-Q) plot in Figure 6 confirms that the assumption of normal distribution of residual terms is rational since if the two sets have the same distribution, the points should fall approximately along a reference line.  Table 4 shows the analysis results from ANOVA to investigate the effects of shading and orientation. In both cases, ANOVA reports that there are no group differences since the P values are larger than 0.05 (0.11 and 0.82). This result confirms the disadvantage of ANOVA in multiple groups comparison. Even if the P value is smaller than 0.05, we have to take multiple pairwise comparisons by using post-hoc test such as Bonferroni test [32] or Tukey's Test [33] to determine the difference between groups. Tables 5 and 6 show the resulting parameters of the fix-effect and random-effect terms of the decline trend modeling, respectively. From these results, the trend models of each PV group are rewritten as follow:

Linear decline model based on MEM
The Q-Q plots of residuals from average models in Table 3   TABLE 4 The results of ANOVA tests to find any difference in seasonal variation of PV groups in The trend model for all PV stations is: The trend model for non-shading PV group (S = 0): The trend model for shading PV group (S = 1): However, there is no significant difference between trend models in Equation (20) and Equation (21) since the obtained P values of A 1 and B 1 are 0.35 and 0.06 which are both greater than 0.05. In addition, the 95% CI. ranges of A 1 and B 1 in Table 5

FIGURE 7
The Q-Q plots of residuals of unconditional model and conditional model in Table 6 also cross the zero value. Therefore, we conclude that there is no significant difference in trend of monthly energy yield of micro-inverter PV systems under shading and non-shading conditions. This means that the common model in Equation (19) can be used to represent the decline trend of two PV groups. This result has confirmed the ability of the micro-inverter PV system to eliminate the effect of shading. Another result is the common decline rate, counted in a year in percentage (DR PV ) of micro-inverter PV systems calculated as Equation (22)  Comparing to the 95% CI. range of DR PV from 2% to 4.3% in our previous study [21] at yearly scale, the 95% CI. range of DR PV in this study is more accurate since its 95% CI. range is smaller, that is 1.7% (1.4% to 3.1%) as in Equation (23) compares to 2.3% (2% to 4.3%) in [21]. Finally, the Q-Q norm plot in Figure 7 confirms that the assumption of normal distribution of residual terms in linear decline model is rational.

Comparison with linear regression model
As mentioned in Section 1, using ANOVA for the time series datasets violates the independent variable assumption. Therefore, we applied the traditional linear regression model to com-pare the results with our proposed models based on MEM.
The models of single linear regression and multiple variable linear regression are shown in Equation (24) and Equation (25) respectively.
The outcome t is the decline trend of PV stations. , , and are the coefficients of models. The variable n is the index of month. The binary variable s represents shading condition, s = 0 for non-shading group and s = 1 for shading group.
The result analysis in Table 7 shows the same parameters found by the models based on MEM in Table 5. With the P value equals 0.1 (which is larger than 0.05), the multiple linear regression also indicates that there is no significant difference in the decline trend between two groups. Furthermore, Table 8 compares the goodness-of-fit of models for two approaches. Besides the MSE and R-squared score, we also applied the Akaike information criterion (AIC). AIC is a method for evaluating how well a model fits the dataset. AIC can be generated from [34,35]. The lower AIC score, the better the model fits the dataset. Our evaluation are shown as follows: • The MSE scores of two methods are about the same since the MEM is actually extended from the linear regression method; • The R-squared and AIC in Table 8 indicate that the models based on MEM are better fit to the datasets than the traditional linear regression models.

CONCLUSIONS
The mixed effects model has arisen as a suitable approach to handle the influence analysis of longitudinal study instead of using the conventional ANOVA method. In this study, we illustrated the applications of this model on the time-serial data of micro-inverter PV stations to evaluate the effect of shading and orientation to the seasonal variation and decline trend of monthly energy yield. By decomposing the observed variables into fixed-effect and random-effect terms, the analysis results from our proposed models provide more details about the impact of influential factors than ANOVA method without violating the assumption of independent variable. The proposed method via mixed effect model in our study could further be applied to other types of datasets in Smart Grid domain such as power consumption from smart meters or energy production from wind turbines.
In further study, we plan to use the mixed effects model to compare the long-term energy efficiency between microinverter PV and string inverter PV system, taking into account various conditions and regions. We will also integrate the mixed effects model into ensemble learning or machine learning models and evaluate the improvement for the forecasting tasks.