High-efficient Reed–Solomon decoder design using recursive Berlekamp–Massey architecture
Abstract
This study presents a high-efficient Reed–Solomon (RS) decoder based on the recursive enhanced parallel inversionless Berlekamp–Massey algorithm architecture. Compared with the conventional enhanced parallel inversionless Berlekamp–Massey algorithm architecture, the proposed architecture consists of a single processing element and has very low hardware complexity. It also employs a new initialisation to reduce the latency. This architecture uses pipelined Galois–Field multipliers to improve the clock frequency. In addition, the proposed architecture also has the dynamic power saving feature. The proposed RS (255, 239) decoder has been developed and implemented with SMIC 0.18-μm CMOS technology. The synthesis results show that the decoder requires about 13K gates and can operate at 575 MHz to achieve the data rate of 4.6 Gb/s. The proposed RS (255, 239) decoder is at least 28.15% more efficient than the previously related designs.
1 Introduction
Due to exceptional capability of correcting both random and burst errors, Reed–Solomon (RS) codes are widely used in various communication systems such as wireless systems, space communication links, and digital subscriber loops as well as in memory and data storage systems. The main decoding methods for RS codes are divided into hard-decision decoding (HDD) and algebraic soft-decision decoding (ASD). Many efforts [1-6] are devoted to ASD because of its significant coding gain. However, the complexity of ASD is too high to implement compared with HDD. Hence, the HDD has broader practical applications than ASD in the high-speed low-complexity fields.
A conventional HDD RS decoder generally consists of three main blocks: the syndrome computation (SC) block, the key equation solver (KES) block and the Chien search and error evaluation (CSEE) block [7]. Fig. 1 illustrates the structure of the pipelined RS decoder. As the most critical and hardware-complicated block in the design of the RS decoder, the KES block is generally implemented with the modified Euclidean (ME) [8] algorithm or the Berlekamp–Massey (BM) [9] algorithm to find out both the error locator and the error evaluator polynomials. The conventional ME architecture includes the degree computation and comparison circuits, as well as polynomial computation circuit, which causes relatively huge hardware complexity and long latency. While, degree computationless ME (DCME) [10] uses a different initialisation to remove the degree computation and comparison circuits. Afterwards, the enhanced DCME [11] and the simplified-DCME [12] further reduce the complexity. Similarly, many algorithms are carried out on BM and corresponding architectures [13-15] are developed. Some of these BM architectures can reach lower complexity and simpler control unit, as well as similar throughput compared with the ME architecture.
Recently, Wu [16] presented a new reduced processing in high-speed RS decoding, which is based on a new error evaluation formula deduced from Horiguchi–Koetter formula. He also proposed an enhanced parallel inversionless Berlekamp–Massey algorithm (ePIBMA) architecture, which requires 2t + 1 processing elements (PEs) to determine the error locator after 2t iterations. However, the ePIBMA architecture is idle during most of the decoding cycles for its much shorter latency compared with the SC and CSEE blocks, which results in very low efficiency. In this paper, we propose a recursive ePIBMA (rePIBMA) architecture using a single PE unit to achieve high-speed area-efficient KES architecture. Moreover, this architecture uses a new initialisation to further reduce the latency.
The rest of this paper is organised as follows. Section 2 reviews the background and the algorithm related to this work. The proposed architecture is presented in Section 3. Implementation results and performance comparisons are described in Section 4. Finally, Section 5 concludes.
2 Background and existing algorithm
The ePIBMA introduces the polynomial and . Compared with the reformulated inversionless Berlekamp–Massey (RiBM) algorithm, the ePIBMA has the same way to calculate the discrepancy required for the next update via the polynomial . The update of has two modes in different conditions. Let L_{δ} and L_{θ} are the degree of and Θ(x), respectively. is updated to when L_{θ} > L_{δ}, while it remains the same at other times. The ePIBMA architecture removes t PE units which are used to calculate Ω(x), and requires 2t + 1 PEs to solve the key equation. After 2t iterations, the error locator polynomial Λ(x) and the scratch polynomial B(x) can be gotten from and Θ(x). In addition, the separate loop logic z is utilised to accumulate the α^{−(t+e−2)} in the ePIBMA. In CSEE block, z is multiplied by itself in each cycle to calculate the error evaluation.
In addition, the degree of B(x) is at most t–1 ignoring zeros [16]. On the other hand, when L_{θ} equals t–1, Λ(x) contains all error locations. Therefore, when the number of errors e is less than t, limiting the degree of B(x) to t − 1 can terminate the ePIBMA early, which is power efficient. As a result, after 2e iterations, we can get appropriate Λ(x) and B(x) for further computation.
This paper considers (n, k) RS codes defined in the Galois–Field (GF)(2^{m}), where n = 2^{m} − 1 for primitive codes, k is the number of m-bit message symbols. The decoder in [16] is the syndrome-based RS decoder which is pipelined in three stages. The decoding timing schedule is shown in Fig. 2. Compared with the SC and CSEE blocks, the KES block contains a large amount of idle time, which reduces the hardware utilisation.
3 Proposed RS decoder
3.1 SC block
Let R(x) = r_{n−1}x^{n−1} + · · · + r_{1}x + r_{0} be the received polynomial, where r_{n−1}, …, r_{1}, r_{0} are the received symbols. Therefore, the syndrome values are calculated by s_{i} = R(α^{i}), 0 ≤ i ≤ 2t − 1. The architecture of the SC block is shown in Fig. 3. After n clock cycles, 2t syndromes are computed and transmitted to the KES block.
3.2 Proposed KES block
As described above, the ePIBMA architecture requires 2t + 1 PEs to determine the error locator after 2t iterations. Hence, the recursive PE (rPE) is divided into 2t + 1 pipelining stages to update coefficients of and Θ(x), and this processing concurrently operates during 2t clock cycles. Therefore, a direct implementation of the ePIBMA using a single rPE requires about 2t × (2t + 1) clock cycles to solve the key equation.
Note that the latency is so long and hence reduces the throughput. Lu and Shieh [17] proposed different initial settings of the BM algorithm that can reduce hardware consumption for VLSI implementation. Furthermore, with the different settings, the first iteration (r = 0) can be easily removed when s_{0} = 0, since no additional operation is required for updating relevant polynomials. However, when s_{0} ≠ 0, the second iteration is not unique. Hence, iteration-reducing is not appropriate in all situations. In addition, the superiority of the new initialisation is not obvious in the ordinary ePIBMA architecture because the latency is decided by the longest stage in the pipelined RS decoder and still needs n clock cycles. We present a recursive implementation of the ePIBMA architecture as well as new initialisations. The proposed rePIBMA architecture removes one iteration and reduces the latency to (2t − 1) × 2t + 1 clock cycles, which increases the achievable throughput.
We reconstruct the conventional ePIBMA and the proposed rePIBMA with the new initialisations is described in the following pseudo-code see Fig. 4.
The proposed rePIBMM architecture is illustrated in Fig. 5. Moreover, the pipelined GF multiplier is utilised to improve the clock frequency. The rPE unit shown in Fig. 5a continuously updates each coefficient of and Θ(x) according to the proposed rePIBMA. At first, initial values are stored in and register arrays. Next, one coefficient of and one coefficient of Θ(x) are updated simultaneously in each clock cycle. The updated coefficients are then stored in and register arrays and properly fed back to the rightmost registers for the next update. The control signal M2 is used to set to 0 in the last clock cycle of each iteration.
According to the proposed rePIBMA, initialisations are not unique and they are dependent on s_{0}. Hence, the new initialisation of the rePIBMA architecture uses one clock cycle to select the appropriate initial values of some control signals as well as according to the value of s_{0}. The new initialisation of the proposed KES block is shown in Table 1. We use j in the control unit to check whether s_{0} = 0. Although has different initial values depending on s_{0}, register arrays still store s_{1} to s_{2t−2}, 0, 1 whatever s_{0} is. When s_{0} ≠ 0, j is set to 1 and in this case choose s_{1} to s_{2t−2}, 0, 1 stored in the register array as initial values. Correspondingly, the initial value of γ should be set to s_{0} and other control signals are shown in Table 1. Note that the control signal k = L_{δ} − L_{θ}. When s_{0} = 0, j is set to 0, which makes the MUX in the rPE unit set to into 0 as initial values in the first iteration. Moreover, the initial value of γ is set to 1 and the initial values of other signals are also changed. Hence, the first iteration can be removed and the initialisation takes up one clock cycle. Note that the bridge register 1 to 3 are timing registers to make fed to update in the proper time.
a | ||||||
rPE unit | ||||||
register | Bridge 1 to 3 | z | ||||
initial values | all 0 | |||||
b | ||||||
Control unit | ||||||
register | ||||||
initial values | 1 | 1 | 0 | 1 | ||
−1 | 1 | 0 |
In addition, the rePIBMA architecture should also provide the value of z for the CSEE block. The z-generating unit is shown in Fig. 5b. In the rePIBMA architecture, the signal M1 controls z to update once every 2t-1 clock cycles. The original initial value of z is 1 according to the ePIBMA. With the new initialisation scheme, z should be set as the value of the next update. In the rePIBMA, the updated z remains the same when L_{θ}(r) ≥ t − 1, which is impossible to happen in the first iteration. Therefore, the new initialisation scheme set z to α^{−1}.
From the above ePIBMA, limiting the degree of B(x) to t − 1 can terminate the ePIBMA early. Hence, in the proposed architecture, when the degree of B(x) gets to t − 1, the update of the discrepancy just invokes one multiplier, the adder and the other multiplier can be automatically disabled. That just says the architecture acts as a shift register array. When e < t, the proposed architecture can save power dynamically. As a result, the rePIBMA architecture with the new initialisation can always properly solve the key equation after 1 + (2t − 1) × 2t = 4t^{2} − 2t + 1 clock cycles.
The control unit illustrated in Fig. 6 is proposed to generate the required control signals for the rePIBMA architecture. Some signals that are not explained before are listed here: MC1 stands for , MC3 stands for L_{θ}(r) ≥ t − 1. In addition, in the proposed rePIBMA, should set to 0 in order to eliminate the calculation of Ω(x). Hence, in the rePIBMA architecture, the number of iteration r and the clock cycle of each iteration l are used to control MC2. MC2 becomes 1 in the previous cycle of each iteration in order to set to 0.
3.3 Pipelined CSEE block
After the KES block finishing its computation for each codeword, the error locator polynomial Λ(x) and the scratch polynomial B(x) are fed into the CSEE block to generate the error value and the CSEE block comes from [16]. To implement high speed RS decoder, the CSEE block is also applied pipelining technique. The pipelined CSEE block is illustrated in Fig. 7. In the first two clock cycles, the signal m_{1} is set to 1 and the signal m_{2} is set to select γ and γ · λ_{0}, respectively. After four clock cycles, the signal m_{3} toggles from 1 to 0. In conclusion, the unit that calculates the numerator of error value continuously outputs γ, γ · λ_{0}, γ · λ_{0} · z, γ · λ_{0} · z^{2}, · · · in each clock cycle. The inverter is also pipelined in five stages to shorten the path delay. Although the pipelined CSEE block leads to eight clock cycles delay before outputting corrected values for the first received codeword, the delay can occupy the idle time of the second stage. Hence, the CSEE block still takes only n clock cycles to calculate all error values for each codeword and the latency of our decoder is not increased.
Note that the input γ · λ_{0} and z^{2} should be calculated before the CSEE block is executed. In order not to increase the extra area and latency, γ · λ_{0} and z^{2} are calculated by the structure in the dashed box of Fig. 5a when the KES block finishes computing the coefficients of Λ(x) and B(x). Hence, the final KES block leads to the proper solution after 4t^{2} − 2t + 3 clock cycles.
4 Performance comparisons
4.1 Latency and hardware analyses
Fig. 8 illustrates the timing relationship among the three blocks of the proposed RS (255, 239) decoder. Since the total number of clock cycles in the KES unit is 243, which is less than the codeword length 255, there is no performance degradation in the proposed rePIBMA architecture. Compared with the original ePIBMA architecture, the idle time of the KES block could be greatly reduced. This implies that the hardware utilisation of the proposed decoder can be improved significantly.
Table 2 lists the hardware complexity of each component in the proposed decoder. Galois fields multiplier is described in [18] and the inverter is realised by means of [19]. A GF(2^{8}) multiplier and a GF(2^{8}) inverter are implemented by 267 and 395 gates, respectively.
GF(2^{8}) mult | GF(2^{8}) constant mult | GF(2^{8}) adder | Reg (Bit) | Mux 2-1 (Bit) | GF(2^{8}) inverter | |
---|---|---|---|---|---|---|
SC | 0 | 16 | 16 | 128 | 0 | 0 |
KES | 2 | 1 | 1 | 295 | 8 | 0 |
control unit | 0 | 0 | 0 | 57 | 17 | 0 |
CSEE | 4 | 15 | 15 | 308 | 4 | 1 |
Table 3 shows the comparison of the critical path delay, latency and the hardware complexity for the existing KES blocks of the various RS (255, 239) decoders. Lee [20] and Yuan et al. [21] implemented the conventional ME algorithm and the DCME algorithm with pipelined recursive architectures to achieve high speed and low hardware complexity, respectively. The authors in [22-24] presented three differently folded architectures based on the RiBM algorithm and significantly reduced the hardware complexity.
KES architecture | Critical path delay | Latency | Multipliers | Adders | Latches | Muxes |
---|---|---|---|---|---|---|
proposed rePIBMA | 3T_{xor2} + T_{mux2} | 243 | 2 + 1^{a} | 1 | 38 | 8 |
FRiBM [22] | T_{mult} | 224 | 4 | 2 | 61 | 5 |
PF_RiBM [23] | 2T_{xor2} + T_{nand2} + T_{mux2} | 225 | 4 | 2 | 61 | 5 |
rDCME [21] | – | 260 | 4 | 2 | 90 | 24 |
single-mode RiBM[24] | T_{muxt} + T_{mult} + T_{add} | 133 | 8 | 4 | 50 | 25 |
PrME [20] | 3T_{or2} + T_{xor2} + T_{mux2} | 255 | 4 | 2 | 170 | 30 |
- ^{a}This is a constant multiplier
It can be observed that the proposed KES block requires less number of multipliers, adders and latches compared with the existing architecture. In the rePIBMA architecture, the multiplier is pipelined. Hence, the critical path of the KES block lies in the z-generating unit. Compared with [22, 24], the proposed KES block reduces the critical path delay. Compared with [24], the proposed architecture has longer latency because of the recursive architecture. However, the hardware complexity of the proposed KES block is reduced more. In addition, compared with the architecture of [20-23] which has the similar latency, the CSEE block of the proposed architecture has three more multipliers. Although the proposed method reduces multipliers in KES block at the expense of more multipliers in CSEE block, the hardware complexity of our KES block decreases more. Hence, the proposed decoder is more area-efficient.
4.2 Implementation results
Architecture | proposed rePIBMA | FRiBM [22] | PF_RiBM [23] | rDCME [21] | Single-mode RiBM [24] | PrME [20] |
---|---|---|---|---|---|---|
tech. | 0.18 μm | 0.18 μm | 90 nm | 0.18 μm | 0.18 μm | 0.13 μm |
SC | 2,900 | – | 6,950 | 2,900 | 2,900 | 3,000 |
KES | 4,650 | – | 5,632 | 11,400 | 9,566 | 17,000 |
CSEE | 5,400 | – | 8,750 | 4,100 | 4,100 | 4,600 |
total gates | 12,950 | 12,668 | 21,332 | 18,400 | 165,66 | 24,600 |
f_{max}(MHz) | 575 | 425 | 750 | 640 | 400 | 625 |
throughput(Gb/s) | 4.6 | 3.4 | 6 | 5.1 | 3.2 | 5 |
TSNT | 491.83 | 371.62 | 194.72 | 383.78 | 267.46 | 203.25 |
TSNT value of the proposed decoder is 28.15%, 32.35%, 83.89%, 141.98% and 152.58% better than [21, 22, 24, 20, 23]. The result indicates that our design has higher efficiency compared with other existing designs under the same technology condition.
5 Conclusion
This paper presents the rePIBMA architecture using a rPE unit to reduce the hardware complexity. In addition, pipelining technologies render the RS decoder a very short critical path. A new initialisation is employed to reduce the long latency. This design also has the dynamic power saving feature. As a result, the proposed RS (255, 239) decoder is at least 28.15% more efficient than the previously related designs and better adapts to the requirements of modern high-speed low-complexity communication systems.
6 Acknowledgments
The authors thank the National Natural Science Foundation of China (grant no. 61474080) and the Program for New Century Excellent Talents in University of China for supporting this work.