Volume 10, Issue 4 p. 381-386
Research Articles
Free Access

High-efficient Reed–Solomon decoder design using recursive Berlekamp–Massey architecture

Wenjie Ji

Wenjie Ji

School of Electronic Information Engineering, Tianjin University, Tianjin, 300072 People's Republic of China

Search for more papers by this author
Wei Zhang

Wei Zhang

School of Electronic Information Engineering, Tianjin University, Tianjin, 300072 People's Republic of China

Search for more papers by this author
Xingru Peng

Xingru Peng

School of Electronic Information Engineering, Tianjin University, Tianjin, 300072 People's Republic of China

Search for more papers by this author
Yanyan Liu

Corresponding Author

Yanyan Liu

College of Electronic Information and Optical Engineering, Nankai University, Tianjin, 300071 People's Republic of China

Search for more papers by this author
First published: 01 March 2016
Citations: 12


This study presents a high-efficient Reed–Solomon (RS) decoder based on the recursive enhanced parallel inversionless Berlekamp–Massey algorithm architecture. Compared with the conventional enhanced parallel inversionless Berlekamp–Massey algorithm architecture, the proposed architecture consists of a single processing element and has very low hardware complexity. It also employs a new initialisation to reduce the latency. This architecture uses pipelined Galois–Field multipliers to improve the clock frequency. In addition, the proposed architecture also has the dynamic power saving feature. The proposed RS (255, 239) decoder has been developed and implemented with SMIC 0.18-μm CMOS technology. The synthesis results show that the decoder requires about 13K gates and can operate at 575 MHz to achieve the data rate of 4.6 Gb/s. The proposed RS (255, 239) decoder is at least 28.15% more efficient than the previously related designs.

1 Introduction

Due to exceptional capability of correcting both random and burst errors, Reed–Solomon (RS) codes are widely used in various communication systems such as wireless systems, space communication links, and digital subscriber loops as well as in memory and data storage systems. The main decoding methods for RS codes are divided into hard-decision decoding (HDD) and algebraic soft-decision decoding (ASD). Many efforts [1-6] are devoted to ASD because of its significant coding gain. However, the complexity of ASD is too high to implement compared with HDD. Hence, the HDD has broader practical applications than ASD in the high-speed low-complexity fields.

A conventional HDD RS decoder generally consists of three main blocks: the syndrome computation (SC) block, the key equation solver (KES) block and the Chien search and error evaluation (CSEE) block [7]. Fig. 1 illustrates the structure of the pipelined RS decoder. As the most critical and hardware-complicated block in the design of the RS decoder, the KES block is generally implemented with the modified Euclidean (ME) [8] algorithm or the Berlekamp–Massey (BM) [9] algorithm to find out both the error locator and the error evaluator polynomials. The conventional ME architecture includes the degree computation and comparison circuits, as well as polynomial computation circuit, which causes relatively huge hardware complexity and long latency. While, degree computationless ME (DCME) [10] uses a different initialisation to remove the degree computation and comparison circuits. Afterwards, the enhanced DCME [11] and the simplified-DCME [12] further reduce the complexity. Similarly, many algorithms are carried out on BM and corresponding architectures [13-15] are developed. Some of these BM architectures can reach lower complexity and simpler control unit, as well as similar throughput compared with the ME architecture.

Details are in the caption following the image

RS decoder structure

Recently, Wu [16] presented a new reduced processing in high-speed RS decoding, which is based on a new error evaluation formula deduced from Horiguchi–Koetter formula. He also proposed an enhanced parallel inversionless Berlekamp–Massey algorithm (ePIBMA) architecture, which requires 2t + 1 processing elements (PEs) to determine the error locator after 2t iterations. However, the ePIBMA architecture is idle during most of the decoding cycles for its much shorter latency compared with the SC and CSEE blocks, which results in very low efficiency. In this paper, we propose a recursive ePIBMA (rePIBMA) architecture using a single PE unit to achieve high-speed area-efficient KES architecture. Moreover, this architecture uses a new initialisation to further reduce the latency.

The rest of this paper is organised as follows. Section 2 reviews the background and the algorithm related to this work. The proposed architecture is presented in Section 3. Implementation results and performance comparisons are described in Section 4. Finally, Section 5 concludes.

2 Background and existing algorithm

Define the syndrome polynomial S(x) = s0 + s1x + s2x2 + · · · + s2t−1x2t−1, the error locator polynomial Λ(x) = Λ0 + Λ1x + Λ2x2 + · · · + Λexe and the error evaluator polynomial Ω(x) = Ω0 + Ω1x + Ω2x2 + · · · + Ωe−1xe−1, where t is the error correction capability and e is the number of errors. The KES block is used to solve the key equation urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0002. Conventional BM algorithm calculates both the error locator polynomial Λ(x) and the error evaluator polynomial Ω(x), and then gives input up to the CSEE block. Deduced from Horiguchi–Koetter formula, Wu [16] defined a new error evaluation approach that avoided the computation of the error evaluator polynomial Ω(x). Through a series of proof, the computation of error magnitudes can be simplified to be
Where the superscript “()” stands for the number of iterations, Λ0 is the first coefficient of Λ(x), Xi is the ith error locator and B(x) is the scratch polynomial. According to the simplified error evaluation formula, the ePIBMA is presented without calculating Ω(x), and the ePIBMA architecture can cooperate with the enhanced parallel Chien search and error evaluation (ePCSEE) architecture to achieve an efficient RS decoder.

The ePIBMA introduces the polynomial urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0006 and urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0008. Compared with the reformulated inversionless Berlekamp–Massey (RiBM) algorithm, the ePIBMA has the same way to calculate the discrepancy urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0010 required for the next update via the polynomial urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0012. The update of urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0014 has two modes in different conditions. Let Lδ and Lθ are the degree of urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0016 and Θ(x), respectively. urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0018 is updated to urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0020 when Lθ > Lδ, while it remains the same at other times. The ePIBMA architecture removes t PE units which are used to calculate Ω(x), and requires 2t + 1 PEs to solve the key equation. After 2t iterations, the error locator polynomial Λ(x) and the scratch polynomial B(x) can be gotten from urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0022 and Θ(x). In addition, the separate loop logic z is utilised to accumulate the α−(t+e−2) in the ePIBMA. In CSEE block, z is multiplied by itself in each cycle to calculate the error evaluation.

In addition, the degree of B(x) is at most t–1 ignoring zeros [16]. On the other hand, when Lθ equals t–1, Λ(x) contains all error locations. Therefore, when the number of errors e is less than t, limiting the degree of B(x) to t − 1 can terminate the ePIBMA early, which is power efficient. As a result, after 2e iterations, we can get appropriate Λ(x) and B(x) for further computation.

This paper considers (n, k) RS codes defined in the Galois–Field (GF)(2m), where n = 2m − 1 for primitive codes, k is the number of m-bit message symbols. The decoder in [16] is the syndrome-based RS decoder which is pipelined in three stages. The decoding timing schedule is shown in Fig. 2. Compared with the SC and CSEE blocks, the KES block contains a large amount of idle time, which reduces the hardware utilisation.

Details are in the caption following the image

Timing schedule of the syndrome-based RS (n, k) decoder

3 Proposed RS decoder

3.1 SC block

Let R(x) = rn−1xn−1 + · · · + r1x + r0 be the received polynomial, where rn−1, …, r1, r0 are the received symbols. Therefore, the syndrome values are calculated by si = R(αi), 0 ≤ i ≤ 2t − 1. The architecture of the SC block is shown in Fig. 3. After n clock cycles, 2t syndromes are computed and transmitted to the KES block.

Details are in the caption following the image

SC block

3.2 Proposed KES block

As described above, the ePIBMA architecture requires 2t + 1 PEs to determine the error locator after 2t iterations. Hence, the recursive PE (rPE) is divided into 2t + 1 pipelining stages to update coefficients of urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0024 and Θ(x), and this processing concurrently operates during 2t clock cycles. Therefore, a direct implementation of the ePIBMA using a single rPE requires about 2t × (2t + 1) clock cycles to solve the key equation.

Note that the latency is so long and hence reduces the throughput. Lu and Shieh [17] proposed different initial settings of the BM algorithm that can reduce hardware consumption for VLSI implementation. Furthermore, with the different settings, the first iteration (r = 0) can be easily removed when s0 = 0, since no additional operation is required for updating relevant polynomials. However, when s0 ≠ 0, the second iteration is not unique. Hence, iteration-reducing is not appropriate in all situations. In addition, the superiority of the new initialisation is not obvious in the ordinary ePIBMA architecture because the latency is decided by the longest stage in the pipelined RS decoder and still needs n clock cycles. We present a recursive implementation of the ePIBMA architecture as well as new initialisations. The proposed rePIBMA architecture removes one iteration and reduces the latency to (2t − 1) × 2t + 1 clock cycles, which increases the achievable throughput.

We reconstruct the conventional ePIBMA and the proposed rePIBMA with the new initialisations is described in the following pseudo-code see Fig. 4.

Details are in the caption following the image

Proposed rePIBMA

The proposed rePIBMM architecture is illustrated in Fig. 5. Moreover, the pipelined GF multiplier is utilised to improve the clock frequency. The rPE unit shown in Fig. 5a continuously updates each coefficient of urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0026 and Θ(x) according to the proposed rePIBMA. At first, initial values are stored in urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0028 and urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0030 register arrays. Next, one coefficient of urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0032 and one coefficient of Θ(x) are updated simultaneously in each clock cycle. The updated coefficients are then stored in urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0034 and urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0036 register arrays and properly fed back to the rightmost registers for the next update. The control signal M2 is used to set urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0038 to 0 in the last clock cycle of each iteration.

Details are in the caption following the image

rePIBMA architecture

a rPE unit

b z-generating unit. The expressions of some control signals are M1: l = 2t − 1, urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0040, where l is employed to count the past clock cycles in a certain iteration and r is the count of the iteration and urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0042. The rest signals can be found in the control unit

According to the proposed rePIBMA, initialisations are not unique and they are dependent on s0. Hence, the new initialisation of the rePIBMA architecture uses one clock cycle to select the appropriate initial values of some control signals as well as urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0044 according to the value of s0. The new initialisation of the proposed KES block is shown in Table 1. We use j in the control unit to check whether s0 = 0. Although urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0046 has different initial values depending on s0, urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0048 register arrays still store s1 to s2t−2, 0, 1 whatever s0 is. When s0 ≠ 0, j is set to 1 and in this case urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0050 choose s1 to s2t−2, 0, 1 stored in the register array as initial values. Correspondingly, the initial value of γ should be set to s0 and other control signals are shown in Table 1. Note that the control signal k = LδLθ. When s0 = 0, j is set to 0, which makes the MUX in the rPE unit set urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0052 to urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0054 into 0 as initial values in the first iteration. Moreover, the initial value of γ is set to 1 and the initial values of other signals are also changed. Hence, the first iteration can be removed and the initialisation takes up one clock cycle. Note that the bridge register 1 to 3 are timing registers to make urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0056 fed to update urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0058 in the proper time.

Table 1. New initialisation of the proposed KES block
rPE unit
register urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0060 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0062 Bridge 1 to 3 z
initial values urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0064 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0066 all 0 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0068
Control unit
register urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0070 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0072 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0074
initial values urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0076 1 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0078 1 0 1
urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0080 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0082 urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0084 −1 1 0

In addition, the rePIBMA architecture should also provide the value of z for the CSEE block. The z-generating unit is shown in Fig. 5b. In the rePIBMA architecture, the signal M1 controls z to update once every 2t-1 clock cycles. The original initial value of z is 1 according to the ePIBMA. With the new initialisation scheme, z should be set as the value of the next update. In the rePIBMA, the updated z remains the same when Lθ(r) ≥ t − 1, which is impossible to happen in the first iteration. Therefore, the new initialisation scheme set z to α−1.

From the above ePIBMA, limiting the degree of B(x) to t − 1 can terminate the ePIBMA early. Hence, in the proposed architecture, when the degree of B(x) gets to t − 1, the update of the discrepancy urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0086 just invokes one multiplier, the adder and the other multiplier can be automatically disabled. That just says the architecture acts as a shift register array. When e < t, the proposed architecture can save power dynamically. As a result, the rePIBMA architecture with the new initialisation can always properly solve the key equation after 1 + (2t − 1) × 2t = 4t2 − 2t + 1 clock cycles.

The control unit illustrated in Fig. 6 is proposed to generate the required control signals for the rePIBMA architecture. Some signals that are not explained before are listed here: MC1 stands for urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0088, MC3 stands for Lθ(r) ≥ t − 1. In addition, in the proposed rePIBMA, urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0090 should set to 0 in order to eliminate the calculation of Ω(x). Hence, in the rePIBMA architecture, the number of iteration r and the clock cycle of each iteration l are used to control MC2. MC2 becomes 1 in the previous cycle of each iteration in order to set urn:x-wiley:17518628:media:cmu2bf01546:cmu2bf01546-math-0092 to 0.

Details are in the caption following the image

Control unit

3.3 Pipelined CSEE block

After the KES block finishing its computation for each codeword, the error locator polynomial Λ(x) and the scratch polynomial B(x) are fed into the CSEE block to generate the error value and the CSEE block comes from [16]. To implement high speed RS decoder, the CSEE block is also applied pipelining technique. The pipelined CSEE block is illustrated in Fig. 7. In the first two clock cycles, the signal m1 is set to 1 and the signal m2 is set to select γ and γ · λ0, respectively. After four clock cycles, the signal m3 toggles from 1 to 0. In conclusion, the unit that calculates the numerator of error value continuously outputs γ, γ · λ0, γ · λ0 · z, γ · λ0 · z2, · · · in each clock cycle. The inverter is also pipelined in five stages to shorten the path delay. Although the pipelined CSEE block leads to eight clock cycles delay before outputting corrected values for the first received codeword, the delay can occupy the idle time of the second stage. Hence, the CSEE block still takes only n clock cycles to calculate all error values for each codeword and the latency of our decoder is not increased.

Details are in the caption following the image

Pipelined ePCSEE architecture

Note that the input γ · λ0 and z2 should be calculated before the CSEE block is executed. In order not to increase the extra area and latency, γ · λ0 and z2 are calculated by the structure in the dashed box of Fig. 5a when the KES block finishes computing the coefficients of Λ(x) and B(x). Hence, the final KES block leads to the proper solution after 4t2 − 2t + 3 clock cycles.

4 Performance comparisons

4.1 Latency and hardware analyses

Fig. 8 illustrates the timing relationship among the three blocks of the proposed RS (255, 239) decoder. Since the total number of clock cycles in the KES unit is 243, which is less than the codeword length 255, there is no performance degradation in the proposed rePIBMA architecture. Compared with the original ePIBMA architecture, the idle time of the KES block could be greatly reduced. This implies that the hardware utilisation of the proposed decoder can be improved significantly.

Details are in the caption following the image

Timing schedule of the rePIBMA architecture

Table 2 lists the hardware complexity of each component in the proposed decoder. Galois fields multiplier is described in [18] and the inverter is realised by means of [19]. A GF(28) multiplier and a GF(28) inverter are implemented by 267 and 395 gates, respectively.

Table 2. Hardware complexity of the proposed decoder for RS (255, 239) code
GF(28) mult GF(28) constant mult GF(28) adder Reg (Bit) Mux 2-1 (Bit) GF(28) inverter
SC 0 16 16 128 0 0
KES 2 1 1 295 8 0
control unit 0 0 0 57 17 0
CSEE 4 15 15 308 4 1

Table 3 shows the comparison of the critical path delay, latency and the hardware complexity for the existing KES blocks of the various RS (255, 239) decoders. Lee [20] and Yuan et al. [21] implemented the conventional ME algorithm and the DCME algorithm with pipelined recursive architectures to achieve high speed and low hardware complexity, respectively. The authors in [22-24] presented three differently folded architectures based on the RiBM algorithm and significantly reduced the hardware complexity.

Table 3. Comparison of critical path delay, latency and hardware complexity for KES blocks of RS (255, 239) decoders
KES architecture Critical path delay Latency Multipliers Adders Latches Muxes
proposed rePIBMA 3Txor2 + Tmux2 243 2 + 1a 1 38 8
FRiBM [22] Tmult 224 4 2 61 5
PF_RiBM [23] 2Txor2 + Tnand2 + Tmux2 225 4 2 61 5
rDCME [21] 260 4 2 90 24
single-mode RiBM[24] Tmuxt + Tmult + Tadd 133 8 4 50 25
PrME [20] 3Tor2 + Txor2 + Tmux2 255 4 2 170 30
  • aThis is a constant multiplier

It can be observed that the proposed KES block requires less number of multipliers, adders and latches compared with the existing architecture. In the rePIBMA architecture, the multiplier is pipelined. Hence, the critical path of the KES block lies in the z-generating unit. Compared with [22, 24], the proposed KES block reduces the critical path delay. Compared with [24], the proposed architecture has longer latency because of the recursive architecture. However, the hardware complexity of the proposed KES block is reduced more. In addition, compared with the architecture of [20-23] which has the similar latency, the CSEE block of the proposed architecture has three more multipliers. Although the proposed method reduces multipliers in KES block at the expense of more multipliers in CSEE block, the hardware complexity of our KES block decreases more. Hence, the proposed decoder is more area-efficient.

4.2 Implementation results

The proposed RS (255, 239) decoder has been modelled with Verilog HDL. After functional verification, this RS decoder is synthesised with SMIC 0.18-μm CMOS technology library by Synopsys design tools. Table 4 presents the implementation results of the various RS (255, 239) decoders. The synthesis results show that the proposed design can operate up to 575 MHz with a total gate count of 12,950. It can be observed that the proposed decoder has very low hardware complexity. Table 4 also evaluates the different design performances with different fabrication technology using the technology-scaled normalised throughput (TSNT) index [25]
Table 4. Implementation results of RS (255, 239) decoders
Architecture proposed rePIBMA FRiBM [22] PF_RiBM [23] rDCME [21] Single-mode RiBM [24] PrME [20]
tech. 0.18 μm 0.18 μm 90 nm 0.18 μm 0.18 μm 0.13 μm
SC 2,900 6,950 2,900 2,900 3,000
KES 4,650 5,632 11,400 9,566 17,000
CSEE 5,400 8,750 4,100 4,100 4,600
total gates 12,950 12,668 21,332 18,400 165,66 24,600
fmax(MHz) 575 425 750 640 400 625
throughput(Gb/s) 4.6 3.4 6 5.1 3.2 5
TSNT 491.83 371.62 194.72 383.78 267.46 203.25

TSNT value of the proposed decoder is 28.15%, 32.35%, 83.89%, 141.98% and 152.58% better than [21, 22, 24, 20, 23]. The result indicates that our design has higher efficiency compared with other existing designs under the same technology condition.

5 Conclusion

This paper presents the rePIBMA architecture using a rPE unit to reduce the hardware complexity. In addition, pipelining technologies render the RS decoder a very short critical path. A new initialisation is employed to reduce the long latency. This design also has the dynamic power saving feature. As a result, the proposed RS (255, 239) decoder is at least 28.15% more efficient than the previously related designs and better adapts to the requirements of modern high-speed low-complexity communication systems.

6 Acknowledgments

The authors thank the National Natural Science Foundation of China (grant no. 61474080) and the Program for New Century Excellent Talents in University of China for supporting this work.