Higher precision range estimation for context-based adaptive binary arithmetic coding

: The Lagrangian rate distortion optimisation is widely employed in modern video encoders, such as high-efficiency video coding (H.265/HEVC). In this work, the authors propose a more accurate context-based adaptive binary arithmetic coding look-up table that can enhance compression quality and provide substantially better accuracy of range estimation, by employing one-more bit with 64 probability states. For the hardware implementation, they propose a higher precision look-up table instead of the HEVC Test Model (HM) standard table. The authors also define a new finite-state machine to handle the probability changing in real-time. The significant BD-RATE gain of the proposed context modelling is up to 6.0% for all-intra mode and 13.0% for inter mode. This finite state machine offers no divergence from the H.265/HEVC standards and can be used in the current systems.


Introduction
Advanced video coding (H.264/AVC) [1] is a widely used video compression system that provides good image quality.The next generation coding standard, high-efficiency video coding (H.265/HEVC) [2], supports enhanced compression efficiency and higher resolution pictures.In order to reduce the flag sign in H.264/AVC, H.265/HEVC makes recursive use of a quad-tree structure to provide flexibility and freedom in selecting the coding modes for each cell or block of a frame.The flexibility of the H.265/HEVC unit size improves coding efficiency when compared with H.264/ AVC, achieving better image quality at the same coding bit-rate, but at the expense of greater complexity in the encoding process.Rate distortion optimisation (RDO) is the best way to determine the optimal coding partition for lower computational cost [3].The role of the video encoder is to select a possible best coding mode in order to maximise image quality and minimise bit rate.Therefore, the optimisation of the hybrid video encoder will minimise the Lagrangian cost function in (1) for all blocks of an entire sequence [4].This minimisation follows a raster scanning order: where J is the Lagrangian cost, D is the distortion between the coded and the original samples, R is the sum of the generated code bits, and λ is proportional to the quantisation parameter given by an empirical formula [5].In addition to determining cost J, video compression also uses context-based adaptive binary arithmetic coding (CABAC) as the next step in encoding the defined syntax elements.CABAC uses binarisation, context modelling and binary arithmetic coding (BAC).The well-known BAC engine (M-coder [6]) includes a regular mode and a bypass mode for encoding bits.
The BAC engine uses an adaptive probability estimator to estimate the probability of the incoming binary bits.Alternatively, the bypass mode uses another model that will start from probability 0.5 for faster coding.The probability P of the least probable symbol (LPS) in CABAC is represented by 64 discrete values with a set of pre-calculated values as illustrated in Fig. 1.
For software implementation, range coders are an alternative to arithmetic coders.They use bytes as output bitstream elements and perform byte renormalisation in one step.CABAC can be used efficiently for video coding as well as data compression.In this work, we define a more accurate CABAC look-up table for enhancing the compression results.The complexity and time used for each sub-process of CABAC are kept unchanged, even though the rate/distortion modules have been replaced by our proposed estimator.

Related work
For the CABAC context modelling, probability estimation is introduced to estimate the probability of the current symbol values [7].The estimation model affects probability updates and range subdivisions in BAC.Process recursive interval subdivision, parameter derivation and relative update are based on the principle of arithmetic coding.As an alternative to arithmetic coders, range coders use bytes as the output bitstream element and do byte renormalisation at a time [8].Much of the literature in this area focus on how to optimise the BAC.A look-up-table-free method was proposed in [9] because of its coding efficiency: a simple rule with a window length is used to calculate the probability estimate.This is achieved by assigning a specific window length according to the statistical properties of the corresponding binary source.
The main disadvantage of the HEVC approach is that it requires considerable multiplication in one step.A significant improvement proposed in previous work can be achieved by modification of the context modelling in CABAC [10,11].A better compression rate is obtained by using a look-up-table, index-based entropy coder [7].In [12] the results show that the adaptive binary range encoder leads to a reduction in computational complexity.However, the range encoder is not efficient for short binary sequences and still uses multiplication in interval partitioning.Hence, there is motivation to find a balance between performance and the accuracy of probability estimation for each binary source [13].This problem can be solved by state machines with adaptive speeds and probability estimation accuracy [14].However, this introduces a set of additional look-up tables leading to increased memory consumption.An additional, interesting direction for decreasing the computational complexity can be based on the development of a CABAC with a parallelism in data processing [15].
In this paper, we present a range estimation of higher precision for CABAC.Our method aims at low complexity while achieving a better data-compression ratio.Following the CABAC coding procedure, we extend the prediction bits to approximate more accurate mapping between the original domain and a precalculated domain, avoiding multiplication and a critical long operation delay.Moreover, we propose a new finite-state machine and corresponding state transfer table.This table can handle the state changing through a current state index.Only additions and shift operations are required in the coding process and the range estimation.These are hardware-efficient operations and so achieve good coding performance.The primary contributions and features of the proposed schemes are listed below: (i) The experimental results achieve a compression performance better than that of CABAC in H.265/HEVC with low hardware complexity.The BD-RATE gain of the proposed context modelling is 3.0-6.0%for all intra mode, and 3.0-13.0%for intermode, respectively.(ii) The proposed scheme does not use multiplication, only requiring addition and shift operations.It has low complexity and a hardware-efficient structure.(iii) The proposed finite-state machine is highly compatible with standard CABAC in H.265/HEVC, achieving a good trade-off between accuracy and speed in range estimation.This section discusses the related work in high precision probability estimation for CABAC and the rest of this paper is organised as follows.Section 3 gives the details of the proposed high precision range estimation for CABAC processing by applying a new look-up table for the new range LPS estimation.We also explain the implementation of these algorithms in H.265/ HEVC reference software HM 16.15 in Section 4. We then show and discuss the experimental results and compare them with previous work in Section 5, and finally Section 6 concludes this paper.

Higher precision range estimation
The idea of multiplication-free arithmetic coding is based on the assumption that the estimated probability of each context model can be represented by a finite set of representative values [16].In its standard form, the received symbol will be estimated by two probabilities P MPS and P LPS , satisfying P MPS ≥ P LPS and P MPS + P LPS = 1.0.The update of the probabilities in the context model is based on the rule There are 64 values used to represent an accurate estimation.Each value assigns one of 64 representative values of LPS from P 0 to P 63 dividing the range of 0.01875, 0.5 .This α = P 63 /P 0 63 = 0.01875/0.563 ≃ 0.949 is the ratio.P LPS and P LPS ′ are the probabilities of LPS updated before and after bin coding.The probability update shown in (2) has been implemented in the look-up state tables NextStateLPS and NextStateMPS in HEVC Test Model (HM).Thus, the state transfer will be handled by the index update between both state tables directly without multiplication.

Estimation table construction
In the CABAC implementation, probabilities 0.0 ≤ P ≤ 1.0 will be scaled and projected to an integer range as [0, 510].All the probability updates will be approximated by an integer call range R ≃ P ⋅ 510 , also satisfying R MPS ≥ R LPS and R MPS + R LPS = 510.For the R updating, in order to maximise the computational efficiency, these products are also achieved by a look-up table that has pre-calculated values of R LPS .The R of the coding interval is represented by 9 bits 256 ≤ R ≤ 510 < 2 9 in the CABAC standard, and the products are of limited precision because only the higher-order bits of R (bit-7 and bit-6) will be used in the search.Therefore, we propose to extend these available bits to lead to more accurate results.
As shown in Fig. 2, there are two bits used for estimation resulting in four possible cases in the CABAC standard.However, the R estimation is based on the state transition of a finite-state machine to combine with the 64 probability states, which can be pre-calculated with the product results in a two-dimensional (2D) look-up table (called LPSTable in HM) with size 64 × 4 .Instead, we make use of three bits (bit-7, bit-6 and bit-5) to build a new 2D look-up table that takes into account more preceding bins values during estimation and so provide more accurate prediction for R LPS .In addition to 2 8 ≤ R < 2 9 , it would be subdivided into eight subranges with demarcation points 256, 288, 320, 352, 384, 416, 448, 480, 510 (corresponding to Figs. 2 and 3).
As shown in Fig. 3, there are eight sub-areas caused by the three bits used for estimation.In order to pre-calculate the product and avoid multiplication, we approximate the next R LPS by a summarised value to represent a specific subrange on the area [256,510].We specify them as c 0 = 272, c 1 = 304, c 2 = 336, c 3 = 368, c 4 = 400, c 5 = 432, c 6 = 464 and c 7 = 496 that approximate the R LPS estimation.In fact, they are the middle points of each subrange.Meanwhile, there are three bits for our estimation that will result in 2 3 = 8 possible cases.With 64 probability states, the new 2D look-up table size will become 64 × 8 and each table element can be calculated by e i, j = roundup P i ⋅ c j (3) where e i, j is the integer result, P i with i ∈ 0, 1, …, 63 are the 64 LPS probability states in (2) and c j with j ∈ 0, 1, …, 7 are the eight summarised values as shown in Fig. 3. Our proposed newLPSTable can replace the HM's LPSTable and is able to improve the accuracy of range estimation about the next R LPS .For a complete table illustration, please see the Appendix (Section 9.1).Meanwhile, we have to ensure that the transferring state for the R LPS corresponds correctly to the new look-up table.The condition for the transferring state can be written as where σ′ and σ are the indices of the next and current states, respectively, the reverse function R LPS −1 R finds the state index of least R LPS that satisfies R LPS ≥ R. Considering (2) here R can be rewritten as 1 − α ⋅ 1 − R LPS σ where R LPS σ is the range of LPS searching from a state index σ.If the received bin is MPS then we keep using the HM reference table NextStateMPS, otherwise the received bin is LPS and we change to use our proposed table newNextStateMPS.This table is a set of precalculated indices for the state transfers based on rule (4) (see the Appendix (Section 9.2)), and a new finite-state machine with transition rules as illustrated in Fig. 4.

Table-based probability and probability state updating
For the whole interval diagram in regular mode, Fig. 5 illustrates the binary arithmetic encoding process for a given bin value bin running in the regular mode.There are several usual parameters: the range R and its lower endpoint L of the current interval state.R LPS and R MPS are the estimated ranges that will update to the next R LPS .Encoding of the given binary value bin is observed in a context with the corresponding state index σ and range index ρ.
As shown in Fig. 5, this interval subdivision process involves three major operations for the current interval.It is subdivided according to the given R estimates: (i) The current interval R is approximated by the summarised value using equipartition of the whole range [256, 510] into eight areas.However, instead of using the corresponding representative quantised range values c 0 , c 1 , …, c 7 explicitly, the summarised value can be mapped by its quantiser index, which can be efficiently computed by a combination of a shift and bit-masking operations: ρ = R ≫ 5 and 7.
(ii) This range index ρ and the state index σ are used as entries in our proposed newLPSTable to determine the (approximate) LPS related subinterval range R LPS .As discussed in the above section, this table newLPSTable contains all 64 × 8 pre-computed product values as (3).(iii) The regular arithmetic encoding process to update the probability states is performed.Note that the MPS must be renewed while the current state index at P 0 .This means that the probability of the next predicted LPS or MPS will be the same if P 0 = 0.5 (defined in (2)).Furthermore, as shown in the left branch of Fig. 5, since the current bin is LPS, then the probability of the next predicted LPS will be increased.The symbols of LPS and MPS must be exchanged because, by definition, the probability of LPS cannot be >0.5.
The following step consists of the renormalisation of the encoded results as described below.

Implementation
In order to guarantee the performance of approximation, we follow the LPS/MPS concept that maximises the coding performance, independent of the size of the look-up table.(It can retain the O 1 time complexity in table searching.)Compared to the reference HM software, our proposed estimation table would require extra memory for the precalculated states.The size of newLPSTable is 1.0 Kilobyte and the LPSTable is 0.5 Kilobyte, equal to the H.265/ HEVC standard and trivial to implement nowadays.

Table-Free renormalisation
There is a renormalisation phase in which range is multiplied by a factor 2, as many times as required until it becomes ≥256.There is one bit that will produce the video bitstream for each multiplication step.
The process as shown in Algorithm 1 (Fig. 6) is good for readability but lowers performance.In order to avoid the time consumed while-Loop instruction in HM, the calculation of the number of bit shifts is achieved by a look-up table technique as shown in the Appendix (Section 9.3) called RenormTable.This table can index the number of shifts directly based on the current range R (see Algorithm 2 (Fig. 7)).
However, such a lookup table will require additional areas in the hardware, and we would like to reduce the size of memory allocation.Hence, our implementation introduces a table-free renormalisation skill: modern processors provide a set of bit manipulation instructions that we can use in the latest HM.There is a bit scan instruction called _BitScanReverse() that searches the value from most significant bit to least significant bit for a set of bits, then returns the index of the leftmost (1) 2 bit in the current set.This bit scan instruction is a built-in function embedded in modern hardware and requires very few clock cycles for operation.Thus, the number of bit shifts can be obtained by Computation of numBits is easily carried out by a concatenation of a bit shift and bit-masking operations, where the latter can be interpreted as modulo operations.

Algorithm and pseudocode
In order to embed our modification in the HM reference software, we must follow the LPS/MPS concept used in CABAC.Our approach can be summarised as follows: (i) For each syntax element, R is updated by bitwise estimation on the relative codewords.
(ii) The look-up index is made using three bits (bit-7, bit-6 and bit-5) estimation of range and is achieved by our proposed lookup table.
(iii) To update the state index, use the following rule (4), which is achieved by two look-up tables called NextStateLPS and newNextStateMPS.
(iv) Renormalisation is considered if R < 256, and then the bits equal to the number of bit-shifting steps are written out to the bitstream.
The above shows the outline of the proposed processing, which is a coding method for binary symbols belonging to either MPS or LPS.Note that the results of the bit shift in (5) do not cause an arithmetic overflow in type BYTE, so numBits ≥ 0. The implementation of our proposed method is integrated into Algorithm 3 (Fig. 8).
Corresponding to the three-bit proposal, some parts of the decoder for new R LPS estimation will also be modified.The decoder has a structure similar to that of the encoder in Algorithm 4 (Fig. 9).
Furthermore, the decoder decodes each bin value by an overlapping condition between the coordinates represented by the codeword and probability interval.Note that this avoids the use of the division operator as well as the context-adaptive encoder.The R of the decoder is also updated by searching the new look-up table.

Experimental results and discussion
In order to evaluate the gains provided by the proposed checksum validation processes, we encoded the first 64 frames of sequences with different resolutions using the H.265/HEVC reference profile of the HM software, version 16.15.These sequences are coded in All-intra, Inter-IPPPP and Inter-IBBBP format at a frame rate of 32 Hz.For performance evaluation, all the sequences are encoded with QP ∈ 20, 24, 28, 32, 36, 40 , in which the coder in H.265/ HEVC is used as the benchmark to compute the BD-RATE.These experimental results are for GOP: IIIII, IPPPP and IBBBP structures, respectively.
Table 1 indicates the BD-RATE and BD-PSNR results of the proposed method.On average, the proposed context modelling can achieve better coding performance than the reference configurations for sequences of different resolutions.These experiments demonstrate that our method provides, on average, higher Luma gain compared to the standard.The BD-RATE gain of As expected, the gain offered by the higher precision range estimation is smaller in intra mode on average.It does not affect video quality much in terms of BD-PSNR for the intra pictures.This is because the inter mode only encodes the motion vector and the difference of the best match in the reference frame, which results in relatively smaller bitstreams after DCT.Thus, the coding of inter P and B frames can be more efficient in terms of inter predictions than coding of I frames.It should also be noted that the BD-RATE is shown to offer more gain in the higher resolution video sequence coded with inter mode.This is because the reduction of bitrate gained by the number of bit shifts depends on the range estimation: an accuracy estimation table can gain a little at each state transfer.Therefore, the percentage of bitrate reduction will decrease when the resolution increases.

Conclusion
This paper has explored some of the potentials of the RDO algorithm for video coding.Improvements are achieved by increasing the accuracy of rate estimation and also by defining a more accurate range look-up table that can enhance the compression results for the CABAC processes.For accurate bit estimation, a practical formula is proposed to apply the finite-state machine of the higher precision range for the probability estimation model.Simulations show that using these techniques, representation in the RDO results in a better optimised mode decision without incurring a significant complexity overhead.The improvements are shown to offer a more significant gain in the H.265/HEVC encoder, while the proposed algorithms offer no divergence from the H.265/HEVC standards and can be used in the current devices or systems.

Acknowledgment
This work was supported by the Macao Science and Technology Development Fund through Project 138/2016/A3.

Fig. 1
Fig. 1 Pre-calculated results and transferring for MPS and LPS probabilities in CABAC

Fig. 2 Fig. 3
Fig. 2 Illustration of the interval subdivision and selection of regular bin of R employed to estimate the probabilities of the symbols

Fig. 4 Fig. 5
Fig. 4 Proposed finite-state machine.These states would be transferred following the rule (4) with one receiving bit.Receive bin = MPS ↻ blue and bin = LPS ↺ orange