Method for Viterbi decoder implementation专利检索- .目的和实施方面专利检索查询-专利查询网

Method for Viterbi decoder implementation

申请号

EP12195599.1

申请日

2012-12-05

公开(公告)号

EP2621092A1

公开(公告)日

2013-07-31

申请人

IMEC; Samsung Electronics Co., Ltd.;

发明人

Catthoor, Francky; Naessens, Frederik; Raghavan, Praveen;

摘要

The present invention relates to a method for Viterbi decoder implementation, said implementation being constrained with respect to energy efficiency and having requirements related to throughput and area budget. The method comprises the steps of
- deriving a set of design options from a Viterbi decoder model with given design specification by differentiating one or more design parameters, said one or more design parameters at least comprising a first value for a look-ahead parameter, said look-ahead parameter indicative of the number of trellis stages combined in a single step of the Viterbi decoding process, and evaluating the various design options in a multi-dimensional design space, whereby said evaluating comprises performing a logic synthesis towards a technology dependent description in terms of connected logic gates,
- selecting from the set of design options a design option satisfying the energy efficiency constraint, said selected design option yielding at least a second value for the look-ahead parameter, said second value being greater than or equal to the first value and in agreement with the area budget,
- implementing a Viterbi decoder with the selected design option.

权利要求

Method for Viterbi decoder implementation, said implementation being constrained with respect to energy efficiency and having requirements related to throughput and area budget, the method comprising the steps of- deriving a set of design options from a Viterbi decoder model with given design specification by differentiating one or more design parameters, said one or more design parameters at least comprising a first value for a look-ahead parameter, said look-ahead parameter indicative of the number of trellis stages combined in a single step of the Viterbi decoding process, and evaluating the various design options in a multi-dimensional design space, whereby said evaluating comprises performing a logic synthesis towards a technology dependent description in terms of connected logic gates,

- selecting from said set of design options a design option satisfying said energy efficiency constraint, said selected design option yielding at least a second value for the look-ahead parameter, said second value being greater than or equal to said first value and in agreement with said area budget,

- implementing a Viterbi decoder with the selected design option.

Method for selecting a Viterbi decoder implementation as in claim 1, wherein said various design options are evaluated in terms of area, throughput and energy.

Method for selecting a Viterbi decoder implementation as in claim 2, wherein said set of design options is derived by retaining the best trade-offs in terms of area, throughput and energy when exploring said multi-dimensional design space.

Method for selecting a Viterbi decoder implementation as in any of claims 1 to 3, wherein in said evaluating a functional verification is performed on said technology dependent description in terms of connected logic gates.

Method for selecting a Viterbi decoder implementation as in any of the previous claims, wherein trace-back length is taken into account as design parameter.

Method for selecting a Viterbi decoder implementation as in any of the previous claims, whereby said Viterbi decoder model is a pipeline model.

Method for selecting a Viterbi decoder implementation as in claim 6, wherein said pipeline model is part of an application-specific instruction set processor.

Method for selecting a Viterbi decoder implementation as in any of the previous claims, wherein in the step of selecting from said set of design options the difference between said second and said first value of the look-ahead parameter is exploited for a clock frequency reduction and/or logic core voltage reduction.

说明书全文

Field of the invention

The present invention is related to the field of Viterbi decoder implementations.

Background of the invention

Convolutional encoding is widely used in many communication standards, e.g. WLAN (Wireless Local Area Network or Wi-fi or 802.11a/b/g/n). As with other error correction mechanisms, redundancy is added to the data so that data can be recovered in case it gets corrupted by noise, channel conditions or receiver non-idealities. In a convolutional encoder, an input bit stream is applied to a shift register. Input bits are combined using a binary single bit addition (XOR) with several outputs of the shift register cells. The bit streams obtained at the output form a representation of the encoded input bit stream. Each input bit at the input of the convolutional encoder results into n output bits. The coding rate is defined as 1/n (or k/n if k input bits are used). These output bits are function of the current input bit and the K previous input bits, where K is called the constraint length. In general a convolutional code is identified via the following characteristics:

• Constraint length K: determines the number of memory elements in the shift register. It is defined as the shift register length plus one.

• Number of output branches n, whereby each branch outputs one bit.

• Polynomial G_x for each output branch: defines the relation of the output bit with respect to the current input bit and K previous input bits. Each output bit is modulo-2 addition (or XOR-operation) of some of the input bits. The polynomial indicates which bits in the input stream have to be added to form the output.

An encoder is completely characterised by n polynomials of degree K.

The encoder can have different states, represented by the K input bits in the shift register. Every new input bit processed by the encoder leads to a state transition. The state diagram can be unfolded in time to represent transitions at each stage in time. Such representation is called a trellis diagram.

In a convolutional encoder data bits are being fed into delay line (with length K), from which certain branches are XOR-ed and fed to the output. Considering WLAN (Wireless Local Area Network or Wi-Fi or 802.11a/b/g/n) as an example, the throughput is being stressed towards decoder output rates of 600 Mbps (in IEEE 802.11n standard), while keeping the energy efficiency as high as possible. Of course, there is also the usual need to keep the area footprint as low as possible. For example, a Viterbi decoder implemented in a handheld device should satisfy these requirements.

Viterbi decoding is a well-known method to decode convolutional error codes. Viterbi decoding is a near-optimal decoding of convolutional encoded data. It has a strongly reduced complexity and memory requirement compared to optimal decoding. In general, during decoding the most probable path over the trellis diagram is reconstructed using the received (soft) bits, and results in determining the original data. Specifically in Viterbi decoding, a window (with so called trace-back length) is considered before taking a decision on the most probable path and corresponding decoded bit. Constraining the decision over a window, instead of the complete data sequence, gives considerable reduction in complexity without sacrificing decoding performance significantly. A high-level view on the decoding operation is depicted in Fig.1.

Starting from the input Log Likelihood Ratios (LLRs), path metrics are being calculated for each of the S = 2^K-1 paths. One of these paths is selected to be optimal and the result of this decision is stored into the trace-back memory. Once trace-back depth number of path metrics has been calculated, an output bit can be produced for every incoming pair of input LLRs.

Viterbi decoding is performed in a streaming fashion and the main bottleneck is situated in the state memory update. In order to boost the throughput, this iterative loop needs to be avoided or optimized. The principle of breaking down iterative loops into parallel computations is a known technique and the higher-level concept behind it has been applied in other domains already since the 80's. They have mainly worked on DSP algorithms, but also some iterative control algorithm kernels have been treated this way. The idea of parallelizing Viterbi decoding has already been described extensively in the art. The principle of Viterbi decoding parallelization is also known as radix-2^z or Z-level look-ahead (LAH) decoding. Look-ahead techniques combine several trellis steps into one trellis step in time sequence through parallel computation. The number of combined trellis steps defines the look-ahead factor Z.

Based on the techniques explained above, many contributions have been made to offer high-speed Viterbi decoding. Some papers only address solutions for limited number of states and have a clear focus of boosting performance without taking into account a possible trade-off with area and energy. Other solutions exploit the look-ahead to allow extra pipelining inside the decoding loop, resulting in throughputs which are equal or lower than a single bit per clock cycle.

The paper "Design Space Exploration of Hard-Decision Viterbi Decoding: Algorithm and VLSI Implementation" (Irfan Habib et al., IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 5, May 2010) presents an extensive design space exploration for performing Viterbi decoding, taking into account area, throughput and power. At a top level a Viterbi decoder consists of three units, namely the branch metric unit, the path metric unit and the survivor memory unit. For each unit the paper explores the design space. The branch metric unit (BMU) calculates the distances from the received (noisy) symbols to all code words. The measure calculated by the BMU can be the Hamming distance in case of the hard input decoding or the Manhattan/Euclidean distance in case of the soft input decoding (e.g., every incoming symbol is represented using several bits). The path metric unit (PMU) accumulates the distances of the single code word metrics produced by the BMU for every state. Under the assumption that zero or one was transmitted, corresponding branch metrics are added to the previously stored path metrics which are initialized with zero values. The resulting values are compared with each other and the smaller value is selected and stored as the new path metric for each state. In parallel, the corresponding bit decision (zero or one) is transferred to the SMU while the inverse decision is discarded. Finally, the survivor memory unit (SMU) stores the bit decisions produced by the PMU for a certain defined number of clock cycles (referred as trace-back depth, TBD) and processes them in a reverse manner called backtracking. Starting from a random state, all state transitions in the trellis will merge to the same state after TBD (or less) clock cycles. From this point on, the decoded output sequence can be reconstructed.

The Habib paper mentions that the PMU is a critical block both in terms of area and throughput. The key problem of the PMU design is the recursive nature of the add-compare-select (ACS) operation (path metrics calculated in the previous clock cycle are used in the current clock cycle). In order to increase the throughput or to reduce the area, optimizations can be introduced at algorithmic, word or bit level. Word level optimizations work on folding (serialization) or unfolding (parallelization) the ACS recursion loop. In the folding technique, the same ACS is shared among a certain set of states. This technique trades off throughput for area. This is an area efficient approach for low throughput decoders, though in case of folding, routing of the PMs becomes quite complex. With unfolding, two or more trellis stages are processed in a single recursion (this is called look ahead, as already defined above). If look ahead is short then the area penalty is not high. Radix-4 look ahead (i.e. processing two bits at a time, Z=2) is a commonly used technique to increase decoder's throughput.

Although the Habib paper mentions that look-ahead can be used to enhance throughput, it states in section IV.F that use of look ahead is to be dissuaded, as the authors consider look ahead techniques extremely expensive in terms of area and power consumption. Therefore the design space exploration results do not consider the look-ahead option as an optimal trade-off point in the area versus power trade-off dimension. It is important to stress that the Habib paper only considers maximal power consumption and not energy consumption for executing the Viterbi decoder task.

Summary of the invention

It is an object of embodiments of the present invention to provide for an approach wherein a Viterbi decoder implementation is determined as a result of a design space exploration wherein at least a look ahead parameter is considered.

The above objective is accomplished by a method according to the present invention.

In a first aspect the invention relates to a method for Viterbi decoder implementation, whereby the implementation is constrained with respect to energy efficiency and has requirements related to throughput and area budget. The method comprises the steps of

deriving a set of design options from a Viterbi decoder model with given design specification by differentiating one or more design parameters, said one or more design parameters at least comprising a first value for a look-ahead parameter, said look-ahead parameter indicative of the number of trellis stages combined in a single step of the Viterbi decoding process, and evaluating the various design options in a multi-dimensional design space, whereby said evaluating comprises performing a logic synthesis towards a technology dependent description in terms of connected logic gates,

selecting from the set of design options a design option satisfying the energy efficiency constraint, said selected design option yielding at least a second value for the look-ahead parameter, said second value being greater than or equal to the first value and in agreement with the area budget,

implementing a Viterbi decoder with the selected design option.

The problem of selecting an implementation for the Viterbi decoder is solved, given a constraint concerning the energy efficiency and requirements related to area and throughput.

Starting from a Viterbi decoder model and a defined design specification a set of possible implementations is derived by differentiating one or more of the design parameters. These design parameters at least comprise a first value for a look-ahead parameter. Various design options are evaluated in a multi-dimensional design space, whereby said evaluating comprises performing a logic synthesis towards a technology dependent description in terms of connected logic gates. In practice the multi-dimensional design space can be explored by evaluating a set of cost functions.

From the set of design options one selects a design option satisfying the energy efficiency constraint, whereby the selected design option yields at least a second value for the look-ahead parameter, said second value being at least equal to the first value and in agreement with the area budget. It may be that this second value equals the first, in case the search for the optimal design option results in a solution with a look-ahead parameter having said first value.

Finally, a Viterbi decoder is implemented according to the selected design option.

In a most preferred embodiment the various design options are evaluated in terms of area, throughput and energy. In other words, the design space then has as dimensions area, throughput and energy efficiency. The set of design options is then advantageously derived by retaining the best performance trade-offs in terms of area, throughput and energy when exploring said multi-dimensional design space.

Preferably in the step of evaluating a functional verification is performed on the technology dependent description in terms of connected logic gates.

In an advantageous embodiment next to the look ahead parameter also the trace-back length is taken into account.

In one embodiment the Viterbi decoder model is a pipeline model. The pipeline model is advantageously part of an application-specific instruction-set processor.

In an advantageous embodiment the excess look-ahead factor, i.e. the difference between the second look-ahead value and the first look-ahead value selected to meet the throughput requirement in the first phase, is exploited to reduce the clock frequency and/or logic core voltage, which will lead to further energy efficiency gain.

For purposes of summarizing the invention and the advantages achieved over the prior art, certain objects and advantages of the invention have been described herein above. Of course, it is to be understood that not necessarily all such objects or advantages may be achieved in accordance with any particular embodiment of the invention. Thus, for example, those skilled in the art will recognize that the invention may be embodied or carried out in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other objects or advantages as may be taught or suggested herein.

The above and other aspects of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

Brief description of the drawings

The invention will now be described further, by way of example, with reference to the accompanying drawings, in which:

Fig.1 represents a high level overview of Viterbi decoding.

Fig.2 represents a view on the Viterbi decoding pipeline.

Fig.3 represents a 4-input max reworked towards multiple 2-input max operations.

Fig.4 represents an area comparison for a number of Viterbi decoder instances.

Fig.5 represents a design exploration flow.

Fig.6 illustrates a trade-off between area/energy efficiency/throughput.

Detailed description of illustrative embodiments

The present invention will be described with respect to particular embodiments and with reference to certain drawings but the invention is not limited thereto but only by the claims.

Furthermore, the terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequence, either temporally, spatially, in ranking or in any other manner. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

It is to be noticed that the term "comprising", used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. It is thus to be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, or groups thereof. Thus, the scope of the expression "a device comprising means A and B" should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

Similarly it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.

It should be noted that the use of particular terminology when describing certain features or aspects of the invention should not be taken to imply that the terminology is being redefined herein to be restricted to include any specific characteristics of the features or aspects of the invention with which that terminology is associated.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

In the present invention multiple Viterbi decoder implementations are derived given a number of constraints with respect to area, throughput and energy efficiency. An optimal Viterbi decoder solution is then selected within a trade-off analysis space with following dimensions: 1) logic area, 2) achievable throughput and 3) energy efficiency, the 'optimal' implementation being the best implementation from the set of possible implementations corresponding to the specific requirements. In contrast to the Habib paper look-ahead is not ruled out as a parameter affecting the determination of an optimal implementation and energy is considered instead of power consumption.

The proposed approach does not use power as metric, because with a practical throughput requirement (which might be significantly lower than the maximum possible), a maximum power number doesn't give an objective measure for comparison. In contrast, energy is considered as only viable efficiency metric because it takes into account the maximum achievable throughput (which possibly may be higher than the required throughput) and offers a fair comparison. The proposed solution clearly identifies multiple design options among which a look-ahead solution with a certain look-ahead parameter value is one.

Many different configuration options are available for implementing a Viterbi decoder. Table 1 illustrates the impact on area, maximum power consumption, throughput and BER performance of increasing values of two possible parameters, namely the look-ahead parameter and the trace back length.

Table 1

LAH

Trace-back depth

Area

⇑

Throughput

⇑

Max. Power Consumption

⇑

BER performance

⇑

In the present invention at least the look-ahead parameter (LAH) is available for optimization. Optionally, the trace-back depth and other parameters (e.g. word width quantization, SMU memory organisation, ...) are taken into account in the design space exploration as well. The invention presents a qualitative exploration of the design space for area and throughput whereby in addition energy efficiency is considered in the trade-off space. In the proposed approach power is clearly not used as metric, because with a practical throughput requirement, a maximum power number doesn't give an objective measure for comparison. Energy is considered as only viable efficiency metric because it takes into account the achievable throughput and offers a fair comparison.

It is now explained how the various implementation options can be derived. In order to derive various Viterbi decoder implementations and compare them with respect to area, power and energy, the actual decoding is modelled into a processor pipeline. The choice for such a modelling helps to derive multiple implementation instantiations and explore them into the desired trade-off analysis space. Note however that processor pipeline modelling is not mandatory. An alternative may be e.g. a dedicated Register Transfer Level (RTL) implementation.

The Viterbi decoding pipeline can be modelled for example inside an ASIP architecture. That approach offers the advantage that the pipeline control overhead can be handled automatically by existing tools (e.g. Target). A view on the decoding pipeline is depicted in Fig.2b starting from Fig.2a, which in essence shows a pipeline implementation of the scheme shown in Fig.1. Per clock cycle two input LLRs can be retrieved and a single (hard-decision) output bit is generated. The decoding itself is decomposed into four distinct pipeline stages:

• Stage 1: Registering of input LLRs

• Stage 2: Calculate intermediate values of path metrics

• Stage 3: Deduct path metrics and next state memory value, together with a maximum index which will be used for trace-back memory update

• Stage 4: Update trace-back memory and determine hard decision output bit

There is no need for an external memory. The required data storage, inside the decoder pipeline, is taken care of by means of flip-flops instead of memory macros. Instead of connecting a program memory to the ASIP, the instruction bits are derived from valid LLR input bits and Viterbi state reset. The instruction set decoding is depicted in Table 2, showing the usage of the control signals with priority for Viterbi state reset.

Table 2 : Instruction set decoding for stand-alone Viterbi

Bits

Syntax

Semantic

vit

Triggers viterbi decoding pipeline

vit_init

Resets state of viterbi decoder engine

nop

no operation

In the case of a look-ahead Viterbi decoder the pipeline structure and instruction set decoding remain fully identical. The only differences come from the fact that the look-ahead implementation is applied and that, if e.g. a look-ahead factor of 2 is considered, four input LLRs are retrieved. Hence, the maximum over four values needs to be derived. The 4-input max is replaced by six 2-input max operations, which can be conducted in parallel and the maximum can be found by logical combinations of these 2-input max operations as depicted in Fig.3.

With this implementation of the 4-input max one tries to keep the latency close to the straightforward radix-2 solution (i.e. without look-ahead, Z=1), allowing doubling the throughput for the same target clock frequency. Increasing the clock too much for a given technology leads to increased area and power consumption. In order to meet the high demanding latency and throughput constraints it is necessary to consider optimization techniques like look-ahead while maintaining the clock constraints like in the straightforward radix-2 solution.

An advantageous way to implement look-ahead is as follows. As already mentioned, the main critical path located inside the state memory calculation loop as indicated in Fig.1. The calculation of the next stage state memory values can be written as $\begin{array}{l} γ_{1, k + 1} = \max (φ_{1, 1, k} + γ_{1, k}, φ_{1, 2, k} + γ_{2, k}) \\ γ_{2, k + 1} = \max (φ_{2, 1, k} + γ_{3, k}, φ_{2, 2, k} + γ_{4, k}) \\ \dots \\ γ_{S, k + 1} = \max (φ_{S, 1, k} + γ_{S - 1, k}, φ_{S, 2, k} + γ_{S, k}) \end{array}$

where

γ_x,y: state memory value x at iteration y

ϕ_n,m,y: Intermediate path metric value {n,m} at iteration y (containing itself sum/subtraction of two LLR input values)

When re-writing using convention max → ⊕ and add → ⊗, one can state $γ_{1, k + 1} = (φ_{1, 1, k} \otimes γ_{1, k} \oplus φ_{1, 2, k} \otimes γ_{2, k})$

For number of states S equal to 64, the matrix form notation can be ${[\begin{matrix} γ_{1} \\ γ_{2} \\ \dots \\ γ_{33} \\ γ_{34} \\ \dots \\ γ_{64} \end{matrix}]}_{k + 1} = [\begin{matrix} φ_{1, 1} & φ_{1, 2} & 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & φ_{2, 1} & φ_{2, 2} & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ φ_{33, 1} & φ_{33, 2} & 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & φ_{34, 1} & φ_{34, 2} & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & 0 & \dots & φ_{64, 1} & φ_{64, 2} \end{matrix}] \otimes {[\begin{matrix} γ_{1} \\ γ_{2} \\ \dots \\ γ_{33} \\ γ_{34} \\ \dots \\ γ_{64} \end{matrix}]}_{k}$

or: $Γ_{k + 1} = Λ_{k} \otimes Γ_{k}$

which allows $Γ_{k + 2} = Λ_{k + 1} \otimes Λ_{k} \otimes Γ_{k}$

Due to the special form of the A matrix, one can write $Λ_{k + 1} \otimes Λ_{k} = [\begin{matrix} β_{1, 1} & β_{1, 2} & β_{1, 3} & β_{1, 4} & 0 & 0 & 0 & 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & 0 & 0 & β_{2, 1} & β_{2, 2} & β_{2, 3} & β_{2, 4} & 0 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ β_{33, 1} & β_{33, 2} & β_{33, 3} & β_{33, 4} & 0 & 0 & 0 & 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & 0 & 0 & β_{34, 1} & β_{34, 2} & β_{34, 3} & β_{34, 4} & 0 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \dots & β_{64, 3} & β_{64, 4} \end{matrix}]$

This principle is applicable to multiple levels of look-ahead. A summary of the computational effort both for straightforward as look-ahead implementation is given in Table 3 below. With increasing look-ahead factor Z, the throughput increases linearly, while the complexity increases quadratic with respect to the number of additions and maximum calculation inputs. Note that one can exploit (part of) the throughput increase to add extra pipelining inside the computational loop.

Table 3

No look-ahead

Look-ahead factor Z (radix-2^Z)

Number of LLR inputs

2 x Z

Additions

2xS

2^ZxS

Max operations

Input ports to max operation

2^Z

By way of example, Fig.4 presents an area comparison of four Viterbi decoder instances implemented in a commercial CMOS technology: with a look-ahead factor of 2 or 1 (i.e. without look-ahead, i.e. a radix-2 solution) and a trace-back length of 64 or 48. The figure clearly illustrates that a look-ahead implementation indeed adds complexity to the path metric calculation, which results in an increased area. It roughly requires a doubling in complexity. Note that next to the path metric calculation the trace-back memory is also taking a considerable area. The choice of the trace-back length affects the implementation in the trace-back memory and path metric calculation.

A high level flow chart of the approach is depicted on Fig.5. Starting point is the high-level model which allows to instantiate towards a more specific Viterbi decoder model identified with the defined design specification (i.e. the polynomials). Next, by taking into account the design parameters (e.g. look-ahead, varying trace-back depth and soft LLR input quantization), different Register Transfer Level (RTL) versions of a Viterbi decoder defined based on the processor pipeline model, are derived and then verified. Tools can be used for obtaining the set of possible RTL implementations, and verifying them functionally. The RTL implementation serves as input for an evaluation of area, throughput and energy. An evaluation of these results within a three-dimensional trade-off space will allow determining the 'optimal' Viterbi decoding solution.

The necessary steps to characterize a Viterbi decoder implementation within this three-dimensional trade-off space are as follows:

1) logic synthesis towards a technology dependent net list (resulting in an area and throughput characterization),

2) functional verification combined with logging of signal toggle information (comprising the signal transitions over time during decoding operation) and

3) power estimations using net list and toggle information obtained in previous steps which allows determining the power and calculating the energy afterwards.

These steps are further detailed in the next paragraphs.

In a first step, synthesis towards a commercially available technology allows deriving the area occupation of the investigated solution. This synthesis step transforms the RTL code into a technology dependent net list. These commercial synthesis tool takes next to the RTL code also technology library information into account. With clock frequency as input of the synthesis tool together with timing reporting as output, one can derive the throughput of the Viterbi decoding implementation. Note that this throughput is determined by the achieved clock frequency on the one hand and on the possible usage of look-ahead on the other hand. For example to perform the analysis, a commercial 40nm technology with frequency target of 800MHz may be selected. The selected frequency target matches what the selected technology maximally has to offer taking into account the logical depth in the critical path of the architecture. A higher frequency would lead to high area and power penalties, whereas a lower frequency would lead to underutilization of the technology.

Secondly, based on the RTL code a simulation and verification is performed allowing a validation of the functional correctness. Next to RTL functional verification, data signal logging is performed, which will serve as input towards the power analysis performed in the next step.

Finally, power estimations are based on simulations taking into account the net list as output of the synthesis with activity diagrams of functional simulation. These power estimations are carried out using commercial tools and technology library information. Based on activity diagram (which includes logic transitions), the power consumed inside each of the gates of the net list can be determined. Once the power number is obtained, one can derive the energy by dividing the power with the throughput. As already mentioned before, energy is the only viable global metric to objectively compare different Viterbi decoding solutions. Due to the relative small component of leakage power in the technology nodes targeted here, compared to switching/dynamic power, in this embodiment only the latter one is considered for the conversion towards energy. However, in other embodiments the same methodology can be applied in technologies where leakage power is not negligible compared to switching/dynamic power. In this latter case, the only difference is that both components should be computed and added up prior to calculating the energy. This will also influence the exploration step though, because the design options will be located at different positions. However, the principles of the effective exploration are still reusable.

During the exploration, design options can be changed and the flow to obtain area, timing and power, as described before is followed. Conversion from power towards energy is performed by dividing power by throughput. This energy together with the area report from the synthesis and the achievable throughput yield a point in the trade-off analysis space. An example of which Viterbi decoder implementations are to be analysed within the trade-off space, is described below.

In order to select a Viterbi decoder implementation from the set of design options according to this invention, a first step involves determining a first value for the look-ahead parameter. This is based on the throughput requirement. The following example is given as an illustration. Within the WLAN scenario the maximum throughput requirement per Viterbi decoder instance is equal to 600Mb/s. An acceptable implementation of a Viterbi decoder (without look-ahead) can achieve an output bit per clock cycle. Within the selected technology node (e.g. 40nm) this may be achievable even without applying look-ahead techniques. Such a result would not incite the skilled person to further explore the design space with respect to look-ahead. However, as will become apparent below, considering look-ahead in the further design exploration may indeed lead to a more energy efficient solution.

As already set out above, the design options explored in the considered example involve look-ahead factor and trace-back depth. Area numbers mapped onto a commercial 40nmG technology are given for the logic part only. All memories and registers linked to the Viterbi decoding are synthesized, no memory macros are used. For these different Viterbi decoder implementations, with different design options, the analysis described above and summarized in Fig. 5 is applied. The steps include RTL generation, synthesis, functional verification and gate-level simulation. This results in an area, throughput and power number for each of the decoder implementations. As mentioned before, energy is the only objective global comparison metric and hence it is derived based on throughput and power consumption. An overview of the results for the different implementation forms, in a commercial 40nm technology, can be found in Table 4. Here, the clock assumption for each decoder implementation is equal to 800MHz, as motivated earlier already for the 40 nm technology assumption in this illustration.

Table 4

implementation

cell area [squm]

leakage [mW]

dynamic [mW]

throughput [Mbps]

energy [nJ/bit]

Vit no LAH TB64

34305

1.22

249.52

800

0.312

Vit no LAH TB48

27454

0.97

146.27

800

0.183

Vit LAH TB64

55481

2.05

283.83

1600

0.177

Vit LAH - TB48

44406

1.61

171.09

1600

0.107

The results shown in Table 4 can be summarized in a table similar to Table 1, as shown below in table 5. Applying an increased level of look-ahead is shown to be beneficial with respect to energy. This observation could not be made by only considering power consumption.

Table 5

LAH

Trace-back depth

Area

⇑

Throughput

⇑

Energy

⇓

⇑

BER performance

⇑

A graphical representation of the trade-off based on area, energy and throughput with normalized axis is depicted on Fig.6. It is clear from the exploration that there are multiple interesting implementation options present in the solution space. The best implementation may be chosen based on the relative importance of different optimization criteria like : area, energy, BER, throughput and flexibility.

When considering the trade-off analysis depicted on Fig.6 some optimal points can be identified. Some trade-offs present in these solutions are now explained. In case a solution is highly area constrained and the achievable throughput can be reached without look-ahead, the optimal solution may be a traditional streaming radix-2 implementation. In this case applying look-ahead can merely be seen as a possible way to boost the throughput performance. For this particular solution energy then is not of high importance. An example of the reasoning of a highly area constrained mode, based on the results depicted on Fig.6, shows that solution without look-ahead identified with triangle ABC gives a better trade-off over solution with look-ahead identified with triangle DEF.

The resulting energy efficiency and area is still dependent on the trace-back depth, which is an extra trade-off that can be made depending on the targeted BER performance.

On the other hand, when area can be sacrificed in order to achieve better energy efficiency, a look-ahead implementation is clearly an advantageous option. In case the targeted throughput is achieved anyway, it may not be required to have the look-ahead implementation for the sake of throughput performance. This is clearly shown in Fig.6, where the solution with look-ahead identified with triangle DEF has a clear energy advantage over solution without look-ahead identified with triangle ABC. Note that look-ahead as technique was not required to merely enhance the throughput.

The throughput increase offered by the look-ahead principle could be utilized in many forms. One possibility is to employ the throughput increase in order to meet the target standard. Further, the increase can be exploited to shut down the decoder quicker, saving on leakage. A second possibility is to lower the clock frequency accordingly in case the throughput is not desired at all. This however leads to almost identical points in the trade-off analysis space. The synthesis can then be done with lower target frequency. However, going for a lower target clock would make sense in case of a further parallelization of the decoder architecture and removal of more pipelining to increase the logic between two pipeline-stages. The complexity (hence area) would then increase more than linear. A third option is to lower the frequency target combined with a lower logic voltage. In contrast to the second possibility, the possible underutilization of the technology (through selecting a lower frequency target), is used to apply a lower logic voltage. Clearly the possible area gain will be lower than with the second possibility. Energy efficiency, on the other hand, is influenced in a quadratic fashion with the lower logic voltage. Overall a reduction of the Trace Back Depth leads to a solution which has a lower area and better energy efficiency, although there is a lower bound for this TBD length based on the desired BER performance.

Viterbi decoding is present in many of the existing wireless standards (considering WLAN and WiMax as two discrete examples). Given the amount of standards and modes which need to be supported, flexible implementations are becoming a viable option, for which Application Specific Instruction Processors (ASIPs) are commonly proposed. The selected Viterbi decoder implementation, certainly because of the specific pipeline implementation form, can be part of such a flexible ASIP architecture for multi-standard, multi-mode error decoder engine.

The above explanation focussed onto the Viterbi decoder requirements driven by the WLAN standard. However, the proposed approach can readily be generalised, as the skilled person will immediately recognize. Hence, conclusions can be applicable to other Viterbi decoding requirements.

While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. The foregoing description details certain embodiments of the invention. It will be appreciated, however, that no matter how detailed the foregoing appears in text, the invention may be practiced in many ways. The invention is not limited to the disclosed embodiments.

Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Method for Viterbi decoder implementation

该功能需要专业版企业版VIP权限，您可以：