Condition monitoring for solder layer degradation in multi-device system based on neural network

: Power semiconductor devices (chips) are usually arranged in parallel to increase the power rating of the modules for high power applications like renewable energy. In multi-device systems uneven degradation of the devices is inevitable. The uneven solder layer degradation of the parallel chips translates into higher thermal resistances for the degraded chips and, according to the electrothermal properties of the devices, the current sharing and temperature distribution between the devices will be affected. This phenomenon will have implications on the global reliability of the power module. In this study, a two-stage neural network (NN) approach is proposed for the diagnosis of the degradation: the first stage NN estimates the power losses of the parallel devices, whose deviations from the reference values are then applied to the NN in the second stage to classify the health condition. This condition monitoring method has been evaluated in on-state experiments at different constant current values, indicating that it could be a suitable strategy for improving the operational reliability of converters employing multi-chip power modules.


Introduction
The demand for power electronic systems, such as in high-voltage direct current (HVDC), wind power and electrified transport, has pushed the development of power converter units to the MW scale consisting of several power modules in parallel and several chips packaged in a single power module to increase the current rating [1]. The insulated gate bipolar transistor (IGBT) and diode are the most widely used power semiconductor devices for switching and freewheeling, respectively. For the same voltage/current rating a diode die is thicker than an IGBT die and the area of a diode die is only about one-half of an IGBT chip, leading to the thermal resistance of a diode being about twice as high as that of an IGBT device [2]. Thus the silicon diodes could suffer from higher thermal fatigue because of the higher thermal resistance and inevitable power losses. For example, in the case of a doubly fed induction machine (DFIG) wind turbine working around the synchronous speed the large reactive current through the machineside converter may increase the diode junction temperature fluctuations sharply. This will reduce the reliability of the wind power converter and induce large operational expense, particularly in offshore systems.
In traditional solder-based power modules, the solder layer is one of the most vulnerable elements and it would age earlier than the bond wire connection [3]. In the case of multi-chip modules, given that the electrothermal characteristics of the devices cannot be absolutely identical, together with the non-uniformity of the cooling system and working conditions of each chip or module, some devices can be subjected to higher stresses. The solder layers of the more stressed devices will degrade first and the increased thermal resistance would accelerate the aging process, causing the uneven degradation in a multi-device system.
A thermal network of multi-chip/multi-module converter has heat sources coupled in a complex manner. Based on a conventional Cauer/Foster thermal network, the high-order system is difficult to solve in real-time applications, but a neural network (NN) is an effective tool of solving non-linear and strongly coupled problems, including the possibility of simulating the complex thermal network of the converter [4]. In order to establish a feasible condition monitoring strategy, which will reduce the maintenance cost for offshore wind converters [5,6], this paper evaluates the possibility of combining NN and pattern recognition methods to track the health state of multi-device systems.

1 Solder layer degradation
The traditional solder-based power module consists of power semiconductor chips attached to a direct-bonded-copper substrate via a solder layer, which is further attached to a baseplate using a second solder layer. This structure is then mounted on a heatsink using a thermal interface material (TIM) for sufficient cooling capability, as shown in Fig. 1. The multi-device system structure could be divided into two levels: several chips packaged in parallel into one high-power module; and more than one power modules integrated in a high power converter.
The die-attach solder layer or baseplate solder layer could be aged and degraded because of the thermomechanical fatigue during long time operation. The cracks generated along the solder layers, which are the common failure mechanism for the power device solder material [7], will reduce the effective heat dissipation path from the chips through the packaging structure to the water-cooled heatsink and, raise the junction temperature of the aged chip. In addition, the temperature difference between the aged device and outside environment makes the heat more likely to dissipate via the For the evaluation of this failure mechanism and considering that the majority of heat will still be dissipated through the watercooled heatsink [4], the thermal performance of the system can be evaluated by analysing the thermal distribution on the top and bottom surfaces of the heatsink, especially at the positions immediately below the centres of the chips, which are more sensitive for evaluating the degradation level [8]. As the thickness of the TIM is usually <100 μm using grease, the temperature at the top surface of the heatsink could be considered as the case temperature of the power module when using a thermocouple to measure this indicator as shown in Fig. 1. These temperatures (T c and T h corresponding to each chip) combined with the temperatures of the coolant at inlet (T in ) and outlet (T out ), will be used as the inputs of the NN for detecting the uneven degradation in a multi-device system.

Electro-thermal modelling for parallel devices
Evaluating the datasheets, it was found that most of the commercial Si PiN diodes have a high zero temperature coefficient (ZTC) position, close to the rated current, meaning that these diodes would be usually working in the negative temperature coefficient (NTC) region. A solder degraded device will have a higher junction temperature; hence the diode will conduct a larger current when working in the NTC region. Both the high current and high temperature extend the reverse recovery time and increase the power losses of the aged diode leading to accelerated degradation. On the other hand, the positive temperature coefficient (PTC) region is good for current sharing, but the forward voltage would be higher for all parallel devices meaning higher losses in the whole system. The electrothermal coupling phenomenon is analysed in this paper by simulation and experiments.
The electrothermal simulation is conducted in MATLAB/ Simulink, where the electrothermal characteristic of two parallel diodes are extracted from a Semikron power module (SKM50GB12T4) with a rating of 1200 V/60 A. The ZTC point of this diode is at 58 A slightly below its rated current as shown in Fig. 2, together with the forward characteristic and reverse recovery energy at 25°C and 125°C.
The half-bridge power module (SKM50GB12T4) consists of one IGBT and one diode chip on each leg as shown in an opened module in Fig. 3a. In the experiments, only one diode in each module will be used so that the whole module could be considered as a single diode module. In order to emulate the increase of the thermal resistance caused by the solder layer degradation, thermal pads are attached to the module baseplate as TIM, effectively increasing the thermal resistance of the device and enabling the emulation of a certain degradation level. Based on the dimensions and thermal conductivity of the pad from Bergquist (Gap Pad 1500), the increment of thermal resistance for adding one thermal pad can be calculated as: where h is the thickness, A is the area and λ is the thermal conductivity of the thermal pad. Different degradation levels (DL) can be emulated by varying the number of layers of the thermal pads attached to a power module to affect the equivalent junctionto-case thermal resistance as shown in Table 1.
In order to estimate the power losses of the parallel diodes, the forward characteristic and reverse recovery energy are represented in a look-up table with linear interpolation between different temperatures. In the electrothermal model, a controlled power source is connected to the two parallel diodes, whose currents and junction temperatures are determined using the look-up table. The power losses are exported to the thermal network (which represents the differences between the parallel diodes) for estimating their junction temperatures, while the look-up table also gives the onstate V-I curves of the two diodes at different junction temperatures to evaluate the current sharing as shown in Fig. 4.

Electro-thermal coupling of parallel devices with uneven degradation
The temperature distribution and forward on-state current sharing between two parallel diodes as affected by uneven solder layer degradation are analysed in this section. Considering that diode 1  as being less degraded (DL 1 according to Table 1) and assuming that the degradation of diode 2 changes from DL 1 to DL 5, respectively, the junction temperatures of the diodes have been computed using the electrothermal model presented in Section 2.2. Fig. 5 presents the temperature differences between the two diodes with uneven degradation levels as a function of the total current, which varies from 20 to 160 A, where DL 1-1 to DL 1-5 represent the five combinations of the uneven degradation conditions. The larger degradation differences contribute to the higher temperature of the aged diode by up to about 50°C in the DL 1-5 combination under 160 A total current. The current sharing between the diodes is presented in Fig. 6a. From an electrical point of view, the current difference reaches up to about 5 A within the NTC region while it is reduced to −11.5 A under 160 A total current in the seriously uneven degradation condition of DL1-5. When the parallel devices are working within the NTC region the aged device will carry a larger current, which leads to a higher conduction power loss, hence increasing the thermal stresses on the aged device. However, it is important to remark that although the aged diode conducts less current in the PTC region, the percentage of the power loss reduction is lower than the percentage of thermal resistance increase as shown in Fig. 6b. This still increases the temperature of the more degraded diode leading to its aging process being accelerated in the whole region.
The power loss of the aged diode is restrained significantly in the constant-on-state conditions as considered in this paper. It is clear that the reverse recovery energy of the high temperature diode would be still larger than that of the low temperature diode [9]. This will cause the thermal distribution to change much more significantly under different degradation levels and, also accelerates the aging process of the already aged devices when they are working in a switched condition as in a real converter.

Recognition of degradation levels based on NN
As analysed above, the power losses of parallel diodes depend on the degradation levels and operating point, and are also affected by the thermal distribution of the heatsink. Therefore, these electrical and thermal indicators are extracted from the constant DC current heating experiments for developing a condition monitoring method based on NN by firstly using thermal distribution to estimate the power losses whose results are combined with the operating condition to derive the indicators for pattern recognition of uneven degradation levels.
The experimental setup is shown in Fig. 7. Two power modules are positioned in the middle and rear parts of a water-cooled heatsink (Hi-Contact 416601U), where a tiny hole has been drilled below each of the diode chip positions for measuring the case temperature and the heatsink bottom temperature is also measured using another thermocouple. Small holes were machined in the heatsink for measuring the inlet/outlet water temperatures and precisely modelling the thermal response. The cooling water temperature was controlled using a chiller (Lauda WK 4600). During the experiments, the temperature of the cooling water and flow rate are set at 20°C and 80 l/s, respectively, for efficient cooling performance.
The two diodes are connected in parallel as shown previously in Fig. 4 and a defined constant DC current is used for heating up the diode chips. The current through the diodes and the on-state voltage of the diodes and together with temperature information are monitored and recorded with a sampling rate of 1/s, using current/ voltage probes and, data acquisition device (PICO TC-08) and Ktype thermocouples, respectively. After the multi-device system reaches electrical-thermal steady state, the temperature and electrical performance are logged for 170 s, for five degradation combinations (DL 1-1 to DL 1-5, as defined in Section 2.3).

Neural network structure
In order to evaluate the thermal dissipation behaviour, the temperature of the three positions on the heatsink top surface (T c1 , T c2 and T c3 ) and on the bottom surface (T h1 , T h2 and T h3 ) together with temperature of inlet water (T in ), outlet water (T out ) and ambient (T amb ) under the constant-on-state with total current varying from 20 to 160 A are extracted for training a NN with the power losses of two diodes as outputs as shown in Fig. 8.
The number of neurons in the hidden layer is the square root value of the product of the input and output number [10]. As the thermal capacitance will contribute to hysteresis and accumulated effect on the thermal performance of the multi-device system, the output power losses are also used as the other two inputs (feedback) for this network through a 1 s delay procedure. When the temperature information from a degraded power module is extracted into a network trained from a healthy device system, the output of power loss would be smaller than the value calculated from the same voltage/current measurements, because the heat dissipating capability of the aged device is restricted by the increase of thermal resistance. Hence the value of output error could reflect the difference level of the thermal behaviour between the unevenly degraded multi-device system and the non-degraded system.
Five NNs for estimating the power losses are established first based on the temperature and electrical measurements, each for one of the combinations DL 1-1 to DL 1-5. These are identified as the NN_DL1-1 to NN_DL1-5 in Fig. 9. The temperature information under each certain degradation level is applied to all of the five networks. Hence the output power loss deviations from the five networks together with the forward voltage, currents and temperature differences between the case and heatsink of the two diodes are used to train a second stage NN for classifying the five degradation levels as shown in Fig. 9.

Power loss estimation and degradation level recognition using NNs
The NNs of each degradation level combination were firstly trained based on the dataset of the whole current range used for power losses estimation. The output and error of the DL 1-3 network together with the measured power losses of the two diodes are shown in Fig. 10 for three different current levels, namely 60, 120 and 160 A. It can be seen that the accuracies of the output estimating power losses of each diode are better than 96%, despite the few measurement errors caused by the current and voltage probes used in the experiments. In different temperature coefficient conditions, the power loss distribution caused by the current sharing between the diodes agrees well with the modelling results presented in Section 2.
This indicates that the NN could be employed to identify the power losses of the devices. Therefore their output errors could be applied to recognise the degradation levels effectively as to be shown next.
When the temperature dataset of diode 2 at DL 1-3 under 60 A total current is imported into non-corresponding networks for power loss estimation, the output errors are significant as shown in Fig. 11. As serious solder degradation reduces the heat dissipation through the baseplate, the lower temperatures of DL 1-3 compared to that of DL 1-1 and DL 1-2 cause the NN_DL1-1 and NN_DL1-2 to underestimate the power loss of the diode leading to errors more than 20 and 10 W, respectively. In contrast, the NN_DL 1-4 and NN_DL 1-5 overvalue the power loss.
Employing the output power loss deviations from the five power-loss-estimation NNs and the operating point conditions of the whole current range as the input, the degradation levels of this multi-device system are classified by the second stage health state recognition network described in Fig. 9. The success rate of recognition is more than 99% in all the five uneven degradation combinations (DL 1-1 to DL 1-5) as shown in the confusion matrix in Fig. 12. Although the errors from measurement and NN algorithm would cause some failure of recognition in this classification, the whole success rate is adequately high to monitor the uneven degradation in the multi-device system.

Conclusion
This paper proposes a potential condition monitoring methodology for parallel devices in a multi-device system based on a two-stage NN approach. According to the electrical-thermal modelling the more degraded device will maintain the higher junction temperature compared with a less degraded one, which would accelerate the ageing process of the pre-damaged device due to suffering larger electrothermal stresses. The thermal network state is evaluated by the two-stage NN: the first NN is established to track the power losses of the parallel devices, whose output deviations are then used as indicators to recognise the heath states of the multi-device system using the second NN. The success rate of this condition monitoring method is higher than 99% according to constant DC current on-state tests. It is also possible to promote this method to AC/DC rectifiers and DC/AC inverters utilised for renewable energy applications.