Wireless Base-Station Receivers
Sridhar Rajagopal Srikrishna Bhashyam Joseph R.Cavallaro Behnaam Aazhang
Rice University
Center for Multimedia Communication
Department of Electrical and Computer Engineering MS-366
6100Main St.,Houston,TX77005-12.
sridhar,skrishna,cavallar,aaz@rice.edu
Abstract
A real-time VLSI architecture is designed for Multiuser Channel Estimation,one of the core baseband processing operations in wireless base-station receivers.Future wireless base-station receivers will need to use sophisticated algorithms to support extremely high data rates and mul-timedia.Several features in these algorithms that can help meet real-time requirements are not utilized effectively in DSPs.These features,such as bit level arithmetic and parallel structure,can be revealed and well exploited by task partitioning the algorithms.We modify the channel estima-tion algorithm for a reduced complexityfixed-point hardware implementation.We show the com-plexity and hardware required for three different area-time tradeoffs:an area-constrained,a time-constrained and an area-time efficient architecture.The area-constrained architecture achieves low data rates with minimum hardware,which may be used in’picocell’base-stations.The time-constrained solution exploits the entire available parallelism and determines the maximum theo-retical data rates.The area-time efficient architecture meets real-time requirements with minimum area overhead.The orders-of-magnitude difference between area and time constrained solutions reveals significant inherent parallelism in the algorithm.All VLSI solutions exhibit better time performance than a previous DSP implementation.
Next generation wireless communication systems[11]have been designed to integrate features such as high data rates varying up to2Mbps,Quality-of-Service(QoS)guarantees and multimedia in the existing communication framework.This requires the implementation of highly sophisti-cated and complex algorithms in real-time.There is a strain on existing hardware resources to meet the requirements of these algorithms.Many algorithms proposed for next generation communica-tion receivers,which are designed to give good performance in terms of error rates,are discarded for a direct implementation as they have high computational complexity.A typical DSP or gen-eral purpose processor implementation[5]is unable to fully exploit the parallelism and bit level computations available in these algorithms.Hence,there is a need to efficiently decompose these algorithms into different tasks and study their mapping on hardware to accelerate their implemen-tation.The algorithms are also computationally expensive,involving subroutines such as matrix inversions,which requirefloating point accuracy.
We develop VLSI architectures for multiuser channel estimation,one of the most computation-ally challenging baseband tasks in the base-station receiver.There have been several hardware implementations for multiuser detection[4].Most implementations either assume perfect channel estimation or assume single user estimation.This assumes that channel estimation can be done in real-time and the data rates are considered to be dependent only on the detector.However,many advanced channel estimation schemes[3,6]have high computational complexity due to matrix in-versions involved and cannot be performed in real-time and typically,simpler single-user sliding correlator structures are used for channel estimation[13].Jointly performing multiuser channel estimation and detection is shown to have lower computational complexity and better error rate performance than performing estimation and detection separately.We modify the previous chan-nel estimation algorithm[15],based on the maximum likelihood principle and develop an iterative scheme[12],which is computationally effective,suitable for afixed point implementation and is equivalent to matrix inversion in terms of error rate performance.
Architectures for the mobile handset have similar algorithms for implementation.’Blind’ver-sions[16]of these algorithms are available for the mobile handset.In this case,the channel is synchronous and only a single user has to be detected.However,the architecture design consider-ations for the mobile handset has to be power-efficient[9]and this also needs to be accounted for in the design as a critical parameter.In this paper,we concentrate on the base-station receiver and neglect power issues in our considerations.
The organization of this paper is as follows.The next section provides an introduction to Multiuser channel estimation and its real-time requirements.The algorithm is modified for a re-duced complexityfixed-point solution,without loss in error rate performance.Section3shows the task partitioning of the algorithm and the various area-time trade-offs.Area-constrained,time-constrained and area-time efficient architectures are presented.A comparison is also made with a previous DSP implementation.The conclusions and future directions are presented in Section4.
The next generation wireless communication systems[1]use a Wideband Code-Division Mul-tiple Access(W-CDMA)scheme as the multiple access protocol for communication.This scheme uses spread spectrum signaling,where each active user uses a unique signature sequence(spread-ing code)to modulate the data bits.The base-station receives a’mixed’version of signals of all active users after they travel through different paths in the channel.These channel paths induce different delays,attenuations and phase-shifts to their signals and the mobility of the users causes these parameters to change over time(called fading).Moreover,the signals from different users in-terfere with each other(Multiple Access Interference)adding to the Additive White Gaussian noise present in the channel.Multiuser channel estimation refers to the joint estimation of these unknown parameters for all users to mitigate these undesirable effects and accurately detect the received bits of different users.
2.1.Real-time requirements
Data transmission in the next generation wireless systems[11]is done in frames of10ms.The data transmission can be done in variable rates depending on the spreading factor(),as shown in Table1.The table gives an example of the number of bits in a frame for spreading factors ofBase-station Receiver
Figure1.Simplified view of the base station receiver
4,32and256chips per data bit.We assume Binary Phase Shift Keying(BPSK)modulation.To support real-time,the number of bits detected per frame should be at the rate of transmission.We implement our design assuming a spreading factor of32chips per data bit.This implies that the real-time requirement of the joint estimation and detection scheme is to detect input data bits at a rate of128Kbps i.e.a bit of every user has to be estimated and detected in less than7.8125s, assuming that the estimation and detection blocks will be pipelined.A spreading factor()of32
can support32users()and we shall use and in our design specifications.
2.2.Channel model
The model for the received signal at the output of the channel[15]can be expressed as
(1) where are the received signal for one bit duration due to the bits of all asynchronous users,spread with a spreading factor,are the bits of users to be detected,is the actual channel,containing information about the spreading codes,attenuation and delays from the various paths,is the noise,which is assumed to be Additive White Gaussian Noise(AWGN)and is the time index.The channel is to be estimated and used for accurately detecting the received data bits of different users.
2.3.Maximum likelihood channel estimation
The channel estimation and detection block in the base-station receiver is shown in Figure1.The channel information is obtained by transmission of a pilot signal,which is a sequence of bits that is known at the receiver.The received pilot signal()is compared with the known bits to form an estimate of the channel.The decisions from the multiuser detection block are fed back to the
Table1.Proposed data rates for next generation communication systems
Spreading Factor Data Rates
(N)(Bits per second)
10240
1280
160channel estimation block along with the received data bits(),delayed by the time required for detection,for tracking the channel estimates when the pilot signal is absent.
The derivation of the joint estimation and detection algorithm is detailed in[15].The multiuser channel estimation algorithm is based on the maximum likelihood principle,where the probability of received input given the transmitted bits is maximized.The computations that occur during the estimation phase[15]are:
(3) where is the length of the pilot sequence,is the cross-correlation matrix between the synchronization bits and the received signal and is the auto-correlation matrix.The correlation matrices are averaged over a window of length.The channel estimate can be obtained by solving
(4) The channel estimate is fed to the detection block for detecting the unknown bits.The detected bits,,which are obtained at the detection stage,are fed back to the estimation block for tracking purposes for a fading channel and to the rest of the processing blocks in the base-station receiver.
It is difficult to maintain numerical stability for matrix inversion,using decomposition tech-niques such as QR or LU,withfixed-point.Also,tracking requires the rebuilding of the correlation matrices and computing the inverse every time.This is computationally inefficient as this implies a matrix inversion for every update.Hence,a better scheme is needed for meeting the real-time requirements.
2.4.Iterative scheme for channel estimation
A direct computation of the exact maximum likelihood channel estimate involves the compu-tation of the correlation matrices and,and then the computation of at the end of the preamble(pilot).The computation of the inverse at the end of the preamble is computationally expensive and delays the start of detection beyond the length of the preamble until the estimate has been computed and this delay limits the data rate.In our iterative algorithm,we approximate the maximum likelihood solution based on the following ideas.
1.The product can be directly approximated using iterative algorithms such as the
gradient descent algorithm[7].
2.The iterative algorithm can be modified to update the estimate as the preamble is being re-
ceived rather than waiting till the end of the preamble.This means that the computation per bit can be reduced by spreading it over the entire preamble.
We present an iterative scheme based on the method of steepest descent for the matrix inversion. The channel estimate is updated iteratively every bit and is available immediately after the end of the pilot sequence.The updating of the estimate is done using the iterative scheme as shown
Tracking Window
Figure 2.Task decomposition of multiuser channel estimation algorithm
below:
(5)
(6)
(7)
This scheme is suitable for tracking,which is shown by the removal of the oldest bit in the window of length L as the new bit is received.Tracking is simpler in this iterative scheme as the channel estimates and correlation matrices are updated iteratively.During the initial pilot phase,tracking is absent and the equations for correlation (equations 6,7)reduce to the previous estimation scheme using inversion (equations 2,3).Another advantage of this scheme is that it lends itself to a simple fixed-point solution,which was difficult to achieve using the previous inversion scheme.There are no divisions except for the multiplication by the convergence parameter,,which can be imple-mented as a right-shift,by making it a power of two.This can be done as the algorithm is not highly dependent on the exact value of .This algorithm shows good convergence behavior as
is a symmetric,positive definite matrix and has a small condition number.The iterative scheme gives the same error rate performance as that of the original scheme and has been verified using simulations [12].
The task partitioning of the algorithm into sub-blocks is carried out for pipelining and for uti-lizing the inherent parallelism present in the algorithm.We implement different mappings of the multiuser channel estimation algorithm to hardware to study the complexity and hardware require-ment tradeoffs.We discuss an area-constrained architecture,a time-constrained architecture and an area-time efficient solution.
3.1.Task decomposition
The task decomposition of the algorithm is as shown in Figure 2.The blocks that are pipelined are shown on the horizontal time axis while the blocks that have coarse-grained parallelism are
b
r
r0
Figure3.Area-constrained VLSI architecture
shown along the vertical axis.Thefigure shows that the correlation matrices can be formed in parallel and this can be pipelined with the iteration of the channel estimate matrix.The two mul-tiplexers shown are for selecting between the known pilot and the received pilot signal during the training mode and the detected bits and the received data signal in the tracking phase.The tracking window is the history buffer and keeps the most recent samples of the bits as well as the re-ceived signal.The sizes of the sub-blocks are shown along with their word lengths in Figure2.The dynamic range of the input is dependent on Signal-to-Noise ratio(SNR),the Multiple Access Inter-ference(MAI)and the number of users in the system.A detailed analysis is required to determine the word-length of the input.For our design,we assume that the received signal is quantized by an A/D converter to have afixed precision word-length of8bits as a similar dynamic range analysis [8,18]for detection shows the input range to be8bits.However,the analysis of the algorithm pre-sented here is independent of the word-length.Also note that the blocks,and are complex-valued while and are real-valued.For the sake of convenience,we henceforth represent the current inputs,as,and,as,.
A typical architecture has window length L=150,spreading gain N=32and the number of users
Table2.Hardware requirements for an area-constrained architecture Blocks Full Adder Cells Total
1*8-
1*8*2
*2
Total Full Adder Cells
Memory/Reg Usage Total
r,r0*2
*2
Net Memory Reqd.(in Bits)N=K=32
Figure4.Time-constrained VLSI architecture block diagram
K=32.All the architectures shown here assume a single-cycle8-bit multiplication and addition. We assume that a Wallace or Dadda multiplier tree[17]is used for multiplication requiring
1-bit Full Adders for a n-bit multiplication.Since the multiplication by in equation7results in truncation of the output and need not be highly accurate for numerical stability,a truncated multiplication using significantly less hardware[14]can be used.The delays of blocks such as multiplexers and gates are assumed to be included in the single-cycle delay.For an area estimate of the architectures,we consider the number of1-bit Full Adder Cells in the design.We also assume all blocks can be pipelined effectively.It can be observed from Figure2that the bottleneck in the pipeline is the matrix multiplication in equation7and we shall concentrate on this part in our architectures.
3.2.Area-constrained architecture
An area-constrained architecture of the multiuser channel estimation scheme is as shown in Fig-ure3.The architecture is shown for computing only the real part of the channel estimate.Since there are no multiplications between two complex numbers,the architecture can be assumed to be replicated for the imaginary part.In this architecture,all matrix elements are computed an el-ement at a time.The word lengths of the various blocks are as shown in Figure3.The dotted lines indicate the parts corresponding to the equations5-7.The left part shows the calculation of the auto-correlation and cross-correlation matrices(equations5-6)whereas the right part shows the calculation of the iteration loop(equation7).
To form the outer product update,we take advantage of the single bit nature of the data and replicate the bits such that for forming the element of,the and bit of are XNORed(multiplication between+1and-1is an XNOR operation)and sent to a counter loaded with the previous value of which increments or decrements by one.The element of the outer product update is calculated,negated and sent to the counter,which again increments or decrements by1(Up/Down).The multiplexer also has an enable signal such that the output is tristated during the pilot phase,when(equation6)is not computed.The matrix is then updated with a store signal.
A MAC(Multiply and Accumulate)unit is used to compute the inner product of the matrix multiplication.If we design a MAC unit such that the multiplication and addition are
b (2K)
Rbb(i,j)
bb T (i,j)
Rbb(i,i)
1Figure 5.Elements in the auto-correlation matrix block
pipelined with the other blocks in the figure,computing an element of takes or
cycles.The corresponding element of is updated similarly with an adder.The multiplication by is then carried out with the help of a right shift and the new element of comes out
of the pipeline every
cycles.The MUX-DEMUX circuit loads from and stores in for every or 128,000cycles (the time taken to compute the entire matrix)and then switches.The hardware requirements for an area-constrained architecture are as shown in Table 2.The design requires an 8-bit counter,an 8-bit multiplier,three 8-bit adders and two 16-bit adders (for the MAC
and the subtraction by
),about 112,000bits of memory and cycles.3.3.Time-constrained architecture
The block diagram of a time-constrained architecture is as shown in Figure 4.In this architecture,
the available parallelism in the algorithm is exploited to the maximum extent.Hence,all the ele-ments needed to perform a parallel multiplication are computed simultaneously and are pipelined.
Now,the entire matrices
and are multiplied using an array of multipliers.The entire prod-uct matrix is subtracted by the auto-correlation matrix,
,shifted and a new channel estimate is formed.Thus,as the time taken by the other computations is pipelined with the time for the multiplication,the output can be formed every or 6cycles.
We exploit the bit-level arithmetic and parallel structure of the correlation matrices to form the correlation matrices simultaneously within a cycle.The sub-blocks for the formation of the auto-correlation matrix and cross-correlation matrix are shown in Figures 5and 6.Since the auto-correlation matrix update is a symmetric matrix and all the diagonal elements are 1’s (
Table3.Hardware requirements for a time-constrained architecture
Blocks Full Adder Cells Total
-
*2
*2
Total Full Adder Cells N=K=32
Memory/Reg Usage Total
r,r0*2
32,000
Total Time(Cycles)6
3.4.Area-Time Efficient Architecture
From comparing the above two architectures in Table5,we see that the area-constrained ar-chitecture does not meet real-time requirements while the time-constrained architecture is highly aggressive in area.So,a tradeoff point in the design space needs to be found,which meets the real-time requirements with minimum additional area.This can be done by observing that the major part of the chip area is used the array of multipliers.Hence,instead of computing the entire matrix product in parallel,the product should be computed element by element by doing the inner product in parallel.This would imply or128multipliers.If this was done row by row or column by column,it would require or multipliers,requiring about3600multipliers,which may not be available just for channel estimation.Since the output is computed element-by-element,this would require or2000cycles for the complete channel estimate.The block diagram of the area-time efficient architecture is shown in Figure7.
The hardware requirements for an efficient area-time architecture are as shown in Table4.This design requires Multipliers to compute an element every cycle and16-bit adders for recursive doubling.This design requires about10,000Full Adder Cells andfinds the estimate in cycles.
3.5.Comparisons with DSPs
An architecture comparison of the different VLSI architectures with a DSP is evaluated in this section.Though DSPs and general purpose processors with MMX-enhanced instruction sets can exploit byte-length parallelism,they are inefficient for bit level parallelism.Storage of bits on such a processor is either inefficient as it is stored as bytes or a large overhead is involved in packing and unpacking these bits.Also,the compiler may not take advantage of the fact that most multiplications are with bits and replace them with additions or subtractions.Using a control structure instead also limits the utilization of available parallelism.Also,formation of bit-level matrix updates as seen in the different VLSI architectures is much more effective and simpler to
b (2K*1)
Rbr (2KN*8)
Rbr(i,j)
b(i)Figure 6.Elements in the cross-correlation matrix block
build in hardware with XNOR gates,giving performance with or 1000XNOR gates,while it may take or 1000cycles on a DSP and takes or 1K bytes in memory.
Assuming a 500MHz clock for the VLSI architectures,the projected time required to compute the channel estimate along with the hardware required for 32users and a spreading code of length 32is as shown in Table 5.This is compared with the implementation of the previously existing algorithm (equation 4),on a TI TMS320C6701Evaluation Module,operating at 166MHz.The DSP implementation of the Multiuser Channel estimation algorithm using the previously existing schemes is shown to require 763401cycles [5],which corresponds to 4.56ms for 15users.Assum-ing that the channel estimate is updated for every block of 10bits,and extending it linearly to 32users,this corresponds to a time requirement of 0.97ms or 1.02Kbps.
The inherent parallelism present in the algorithm can be seen from the ratio of time taken for computation by the area-constrained and time-constrained architectures.The area estimates are compared using the number of Full Adder Cells needed in the design,as shown in Table 5.The time difference between the DSP and the VLSI architectures is due to the improvements in the algorithm modifications and the fact that the bit-level and byte-level parallelism are not exploited on the DSPs and the additional memory references.The difference in the processor speed does not play a major role in the time differences.We can observe that the area-constrained architecture does not satisfy real-time constraints of 7.8125s while the time-constrained architecture is far too aggressive.The area-time efficient architecture meets the next generation real-time constraints by designing the area-time tradeoff in 4s,which is twice the target data rate of 128Kbps.Hence,the clock speed could be reduced by half to 250MHz for power efficiency.From Table 4,it is seen that the time required is directly proportional to the number of users (K)in the system and the spreading factor (N),which are also dependent on each other as seen from Table 1.Hence,the system design also meets real-time requirements for various data rates,such as 1Mbps for 4users with a spreading factor of 4.
We show that a custom VLSI architecture for baseband signal processing in a wireless base-station receiver can be extremely efficient in meeting real-time requirements of the receiver.We discuss three different architectural mappings of multiuser channel estimation,one of the core baseband signal processing algorithms in the receiver.We develop a fixed-point,computationally effective version of the algorithm for a real-time VLSI architecture.The area-constrained architec-ture with minimum hardware could be mapped on 100K gate FPGAs as it requires only 248adder
Figure7.Area-time efficient VLSI architecture
Table4.Hardware requirements for area-time efficient architecture Blocks Full Adder Cells Total
*8-
*8*2
*2
Total Full Adder Cells N=K=32
Memory/Reg Usage Total r,r0*2
*2
Net Memory Reqd.(in Bits)N=K=32Table5.Comparisons between different architectures Architecture Memory(Bytes)Data Rates
2480.262ms
20,000,00012ns
10,0004s
-0.97ms
7.8125s