VLSI Architectures For Communications And Signal

2y ago
11 Views
3 Downloads
6.49 MB
62 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Alexia Money
Transcription

VLSI Architectures for Communicationsand Signal ProcessingKiran GunnamIEEE SSCS Distinguished Lecturer1

OutlinePart I Trends in Computation and Communication Basics Pipelining and Parallel Processing Folding, Unfolding, Retiming, Systolic Architecture DesignPart II LDPC Decoder Turbo Equalization, Local-Global Interleaver and Queuing Error Floor Mitigation (Brief) T-EMS (Brief)2

VLSI Architectures for Communications andSignal ProcessingBase data processingalgorithmHardware friendlyalgorithmVLSI/HardwareArchitecture and MicroarchitectureA systematic design technique is needed to transform the communication and signalprocessing algorithms to practical VLSI architecture.– Performance of the base algorithm has to be achieved using the new hardwarefriendly algorithm– Area, power, speed constraints govern the choice and the design of hardwarearchitecture.– Time to design is increasingly becoming important factor: Configurable and run-timeprogrammable architectures– More often, the design of hardware friendly algorithm and corresponding hardwarearchitecture involves an iterative process.7/21/20133

Communication and Signal Processingapplications Wireless Personal Communication – 3G,B3G,4G,.etc.– 802.16e,802.11n,UWB,.etc.Digital Video/Audio Broadcasting– DVB-T/H, DVB-S,DVB-C, ISDB-T,DAB,.etc.Wired Communications– DSL, HomePlug, Cable modem, etc.Storage-Magnetic Read Channel, Flash read channelVideo CompressionTV setup box7/21/20134

Convergence of Communications andSemiconductor technologiesHigh system performance – Increase Spectrum efficiency of modem (in bits/sec/Hz/m 3) Multi-antenna diversity Beamforming Multi-user detection Multi-input Multi-output (MIMO) Systems Etc. High silicon integrations– Moore’s Law– High-performance silicon solutions– Low power and cost- Mobile devices getting more computation, vision and graphicscapabilities 7/21/20135

Challenges in VLSI for Communication andSignal Processing How to bridge the gap between communication algorithms andIC capabilities.Efficient and Flexible DSP VLSI methods consideringcommunication algorithmic requirements– High performance– Flexibility– Low energy– Low cost (design)– Low cost (area)While chip performance isincreasing, algorithmcomplexity for newsystems is outpacing it.Courtesy: Ravi Subramanian (Morphics)7/21/20136

Single Processor Performance TrendsFIGURE S.1 Historical growth in singleprocessor performance and a forecast ofprocessor performance to 2020, based on theITRS roadmap.The dashed line represents expectations ifsingle-processor performance had continued itshistorical trend.The vertical scale is logarithmic. A break in thegrowth rate at around 2004 can be seen.Before 2004, processor performance wasgrowing by a factor of about 100 per decade;since 2004, processor performance has beengrowing and is forecasted to grow by a factor ofonly about 2 per decade.In 2010, this expectation gap for singleprocessor performance is about a factor of 10;by 2020, it will have grown to a factor of 100.Courtesy: NAE Report, “The Future of Computing Performance:Game Over or Next Level?”7/21/2013Note that this graph plots processor clock rateas the measure of processor performance.Other processor design choices impactprocessor performance, but clock rate is adominant processor performance determinant.7

Scaling TrendsCourtesy: NAE Report, “The Future of Computing Performance:Game Over or Next Level?”7/21/20138

Why Dedicated Architectures?Energy EfficiencyCourtesy: NAE Report, “The Future of Computing Performance:Game Over or Next Level?”7/21/20139

Why Dedicated Architectures?Area EfficiencyNAE Report Recommendation: “Invest in research and development of parallel architecturesdriven by applications, including enhancements of chip multiprocessor systems and conventionaldata-parallel architectures, cost effective designs for application-specific architectures, andsupport for radically different approaches.”7/21/2013Courtesy: NAE Report, “The Future of Computing Performance:Game Over or Next Level?”10

Basic Ideas Parallel processingPipelined processing d3d4P4Less inter-processor communicationComplicated processor hardwareCourtesy: Yu Hen Hu7/21/2013a1b1c1d1a2b2c2d2a3b3c3d3a4b4c4d4More inter-processor communicationSimpler processor hardwareColors: different types of operations performeda, b, c, d: different data streams processedCan combine parallel processing and pipelining-will have16 processors instead of 4.11

Basic IdeasBasic micro-architectural techniques: reference architecture (a), and its parallel (b) andpipelined (c) equivalents. Reference architecture (d) for time-multiplexing (e). Areaoverhead is indicated by shaded blocks.7/21/2013Bora et. al, “Power and Area Efficient VLSI Architectures for CommunicationSignal Processing”, ICC 200612

Data Dependence Parallel processing requiresNO data dependencebetween processorsP1P1P2P2P3P3P4P4time Pipelined processing willinvolve inter-processorcommunicationtimeCourtesy: Yu Hen Hu7/21/201313

FoldingConcept of folding: (a) time-serial computation, (b)operation folding. Block Alg performs some algorithmicoperation.Bora et. al, “Power and Area Efficient VLSI Architectures for CommunicationSignal Processing”, ICC 200614

Unfoldingtransform the dfg of 1 input and 1 output into dfg that receives 2inputs and produce 2 outputs at each time.Courtesy: Yu Hen Hu7/21/201315

Block Processing One form of vectorizedparallel processing of DSPalgorithms. (Not the parallelprocessing in most generalsense)Block vector: [x(3k) x(3k 1)x(3k 2)]Clock cycle: can be 3 timeslongerOriginal (FIR filter):y (n) a x(n) b x(n 1) Rewrite 3 equations at atime: y (3k ) x(3k ) x(3k 1) x(3k 2) y (3k 1) a x(3k 1) b x(3k ) c x(3k 1) y (3k 2) x(3k 2) x(3k 1) x(3k ) x(3k ) Define block vectorx(k ) x(3k 1) x(3k 2) Block formulation: a 0 0 0 c b y (k ) b a 0 x(k ) 0 0 c x(k 1) c b a 0 0 0 c x(n 2)Courtesy: Yu Hen Hu16

Systolic ArchitecturesFigure by RainierMatrix-like rows of data processing units called cells.Transport Triggered.Matrix multiplication C A*B.A is fed in a row at a time from the top of the array and is passed down thearray,B is fed in a column at a time from the left hand side of the array and passesfrom left to right.Dummy values are then passed in until each processor has seen one whole rowand one whole column.The result of the multiplication is stored in the array and can now be output arow or a column at a time, flowing down or across the array.17

LDPC DECODER7/21/201318

Requirements for Wireless systems andStorage SystemsMagnetic Recording systems Data rates are 3 to 5 Gbps. Real time BER requirement is 1e-10 to 1e-12 Quasi real-time BER requirement is 1e-15 to 1e-18 Main Channel impairments: ISI data dependent noise (jitter) erasures Channel impairments are getting worse with the increasing recording densities.Wireless Systems: Data rates are 0.14 Mbps (CDMA 2000) to 326.4 Mbps (LTE UMTS/4GSM) . Real time BER requirement is 1e-6 Main Channel impairments: ISI (frequency selective channel) time varying fading channel space selective channel deep fades Increasing data rates require MIMO systems and more complex channel estimation and receiveralgorithmsIn general the algorithms used in wireless systems and magnetic recording systems are similar. Theincreased complexity in magnetic recording system stems from increased data rates while theSNR requirements are getting tighter.For ISI channels, the near optimal solution is turbo equalization using a detector and advanced ECCsuch as LDPC.19

Introduction to Channel Coding7/21/201320

Some Notation and TerminologyCourtesy: Dr. Krishna Narayanan (Texas A&M)7/21/201321

Shannon Capacity and Channel Codes The Shannon limit or Shannon capacity of a communicationschannel is the theoretical maximum information transfer rate of thechannel, for a particular noise level.Random and long code lengths achieve channel capacity.To construct a random code, pick 2k codewords of length n atrandom. Code is guaranteed to be good as k!Decoding random codes, require storage of 2k codewordsThere are only about 1082 ( 2276) atoms in the universe.Encoding/Decoding complexities don’t increase drastically with kStorage does not increase drastically with kRandomness Vs StructureRandom codes are goodBut structure is needed to make it practicalCourtesy: Dr. Krishna Narayanan (Texas A&M)7/21/201322

Coding Theory AdvancesThere are two kinds of codes: Block Codes and Convolutionalcodes Block Codes: In an (n,k) block code, k bits are encoded in to n bits.Block code is specified by k x n generator matrix G or an (n-k) x nparity check matrix H Examples: Hamming, BCH, Reed Solomon Codes. Hard decisiondecoding is used. Soft decoding possible- but complex. Convolutional codes: Can encode infinite sequence of bits usingshift registers. Soft decision decoding such as viterbi can achieveoptimal maximum likelihood decoding performance. Turbo Codes (1993): Parallel concatenated convolutional codes. Rediscovery: LDPC Block code(1962, 1981, 1998). Near shannonlimit code, Efficient soft decoding (message passing) and withiterations. 7/21/201323

Progress in Error Correction Systems7/21/201324

LDPC Decoding, Quick RecapVariable nodes correspond to the soft information of received bits.Check nodes describe the parity equations of the transmitted bits.eg. v1 v4 v7 0; v2 v5 v8 0 and so on.The decoding is successful when all the parity checks are satisfied (i.e. zero).25

LDPC Decoding, Quick Recap There are four types of LLR messages Message from the channel to the n-th bit node, L n(i)Message from n-th bit node to the m-th check node Q n m or simply Message from the m-th check node to the n-th bit node R m n or simply R m nOverall reliability information for n-th bit-node Pn(i)Q nm(i )m 0m 1(i )m 2(i)(i)R0 3R 2 3Q 3(i ) 1P6n 0n 1n 2n 3n 4n 5n 6L3ChannelChannelDetectorDetectorCourtesy: Ned Varnica26

Decoder Architectures Parallelization is good-but comes at a steep cost for LDPC.Fully Parallel Architecture:All the check updates in one clock cycle and all the bit updatesin one more clock cycle.Huge Hardware resources and routing congestion.7/21/201327

Decoder Architectures, Serial Check updates and bit updates in a serial fashion.Huge Memory requirement. Memory in critical path.7/21/201328

Decoder Architectures, Serial7/21/201329

Semi-parallel Architectures Check updates and bit updates using several units.Partitioned memory by imposing structure on Hmatrix.Practical solution for most of the applications.There are several semi-parallel architecturesproposed.Complexity differs based on architecture andscheduling.7/21/201330

On-the-fly ComputationOur previous research ([1-13]) introduced the following concepts to LDPC decoder implementation1.Block serial scheduling2.Value-reuse,3.Scheduling of layered processing,4.Out-of-order block processing,5.Master-slave router,6.Dynamic state,7.Speculative Computation8.Run-time Application Compiler [support for different LDPC codes with in a class of codes.Class:802.11n,802.16e,Array, etc. Off-line re-configurable for several regular and irregularLDPC codes]All these concepts are termed as On-the-fly computation as the core of theseconcepts are based on minimizing memory and re-computations by employing justin-time scheduling. For this presentation, we will focus on concept 4.31

Layered Decoder ArchitectureOptimized Layered Decoding with algorithm transformations for reduced memory and computationsrrrRl(,0n) 0, Pn L(n0)[Initialization for each new received data frame], i 1,2,L, itmax[Iteration loop], l 1, 2,L , j[Sub-iteration loop], n 1,2,L, k[Block column loop],r S ( l ,n )r S (l ,n ) r (i 1)Ql(,in) Pn Rl ,n ,r S (l ,n′ )rRl(,in) f Ql(,in)′, n′ 1,2,L, k ,[ ][Pr ]S ( l ,n )n[ ]([ ]r Ql(,in)[ ]S ( l ,n ))r Rl(,in) ,(9)(10)(11)(12)rQl(,in) represent all the R and Q messages in each p p block of the H matrix,thths (l , n) denotes the shift coefficient for the block in l block row and n block column of the H matrix.[Qrl(,in) ]S (l,n) denotes that the vector Qrl(,in) is cyclically shifted up by the amount s(l , n)rwhere the vectors Rl(,in) andk is the check-node degree of the block row.A negative sign on s (l , n) indicates that it is a cyclic down shift (equivalent cyclic left shift).f ( ) denotes the check-node processing, which embodiments implement using, for example, a Bahl-CockeJelinek-Raviv algorithm (“BCJR”) or sum-of-products (“SP”) or Min-Sum with scaling/offset.32

Layered Decoder ArchitectureOur work proposed this for H matrices with regularmother matrices.Compared to other work, this work has several advantages1) No need of separate memory for P.2) Only one shifter instead of 2 shifters3) Value-reuse is effectively used for both Rnew and Rold4) Low complexity data path design-with no redundant dataPath operations.5) Low complexity CNU design.r (i ) S(l ,n′) Q r (i ), Rl ,n f l ,n′ n′ 1,2,L, k [ ]r (i )Ql ,n[ ]S ( l ,n )r Pnr (i 1)S ( l ,n ) Rl ,n[ ]rPn[ ]S ( l ,n )r (i ) Ql ,n[ ]S ( l ,n )r (i ) Rl ,nQ P-Rold33

Layered Decoder ArchitectureAdvantages1) Q memory (some times we call this as LPQ memory) can be used to store L/Q/P instead of 3 separate memoriesmemory is managed at circulant level as at any time for a given circulant we need only L or Q or P.2) Only one shifter instead of 2 shifters3) Value-reuse is effectively used for both Rnew and Rold4) Low complexity data path design-with no redundant dataPath operations.5) Low complexity CNU design.6) Out-of-order processing at both layer and circulant level for all the processing steps such as Rnew and PS processingto eliminate the pipeline and memory access stall cycles.34

Out-of-order layer processing for RSelectionNormal practice is to compute R new messages for each layer after CNU PS processing.However, here we decoupled the execution of R new messages of each layer with the execution of correspondinglayer’s CNU PS processing. Rather than simply generating Rnew messages per layer, we compute them on basisof circulant dependencies.R selection is out-of-order so that it can feed the data required for the PS processing of the second layer. Forinstance Rnew messages for circulant 29 which belong to layer 3 are not generated immediately after layer 3CNU PS processing .Rather, Rnew for circulant 29 is computed when PS processing of circulant 20 is done as circulant 29 is adependent circulant of circulant of 20.Similarly, Rnew for circulant 72 is computed when PS processing of circulant 11 is done as circulant 72 is adependent circulant of circulant of 11.Here we execute the instruction/computation at precise moment when the result is needed!35

Out-of-order block processing for PartialStateRe-ordering of block processing . While processing the layer 2,the blocks which depend on layer 1 will be processed last to allow for the pipeline latency.In the above example, the pipeline latency can be 5.The vector pipeline depth is 5.so no stall cycles are needed while processing the layer 2 due to the pipelining. [Inother implementations, the stall cycles are introduced – which will effectively reduce the throughput by a hugemargin.]Also we will sequence the operations in layer such that we process the block first that has dependent dataavailable for the longest time.This naturally leads us to true out-of-order processing across several layers. In practice we wont do out-of-orderpartial state processing involving more than 2 layers.3636

Block Parallel Layered DecoderCompared to other work, this work has several advantages1) Only one memory for holding the P values.2) Shifting is achieved through memory reads. Only onememory multiplexer network is needed instead of 2 to achievedelta shifts3) Value-reuse is effectively used for both Rnew and Rold4) Low complexity data path design-with no redundant dataPath operations.5) Low complexity CNU design with high parallelism.6) Smaller pipeline depth7) Out-of-order row processing to hide the pipeline latencies.Here M is the row parallelization(i.e. number of rows in H matrixProcessed per clock).37

Cyclic Shifter38

Benes Network7/21/201339

Proposed Master-slave Router7/21/2013Gunnam, KK; Choi, G. S.; Yeary, M. B.; Atiquzzaman, M.; “VLSIArchitectures for Layered Decoding for Irregular LDPC Codes ofWiMax,” Communications, 2007. ICC '07. IEEE InternationalConference on 24-28 June 2007 Page(s):4542 - 454740

Master-slave router7/21/201341

System Model for Turbo Equalizationukxkyk ′Lk ( xk ))xkEk ( xk )42

TURBO EQUALIZATION7/21/201343

Proposed System Level Architecture for Turbo e-InterleaverLEQueueG/G/1/1 mHDQueueG/D/1/1 G/1/1 nyk′InterleaverFS QueueG/G/1/1 a)xkSISODetector(NP-MAP/NP-ML)2xHard DecisionDe-Interleaverxk′SISO LDPC DecoderPacket quality metrics fromFront End signal processingblocksHard DecisionLDPC Decoder1 iterationD/D/1/1 minaryhard decisionsto TimingLoopsGunnam et. al, “Next generation iterative LDPC solutions for magnetic recording storage,” Signals, Systemsand Computers, 2008 42nd Asilomar Conference on, Publication Year: 2008 , Page(s): 1148 – 115244

Local-Global InterleaverRow-Column interleavers need to have memory organized such that it can supply the data samples for both row andcolumn access.Low latency memory efficient interleaver compared to traditional row-column interleaver. Only one type of access (i.erow access) is needed for both detector and decoder.45

Data flow in Local-Global Interleaver46

Why Statistical Buffering? The innovation here is the novel and efficient arrangement of queue structures such thatwe would get the performance of a hardware system that is configured to run h (which isset to 20 in the example configuration) maximum global iterations while the systemcomplexity is proportional to the hardware system that can 2 maximum global iterations.D/G/1/1 n is Kendall's notation of a queuing model. The first part represents the inputprocess, the second the service distribution, and the third the number of servers.D- DeterministicG- General47

Primary Data eaverLEQueueG/G/1/1 mHDQueueG/D/1/1 hQueueSchedulingProcessorHDPing-pongMemoryHard DecisionDe-Interleaverxk′SISO LDPC DecoderPacket quality metrics fromFront End signal processingblocks YQueueD/G/1/1 nyk′InterleaverFS QueueG/G/1/1 a)xkSISODetector(NP-MAP/NP-ML)2xHard DecisionLDPC Decoder1 iterationD/D/1/1 minaryhard decisionsto TimingLoopsThe primary data-path contains one SISO LDPC decoder and one SISO detector.The LDPC decoder is designed such that it can handle the total amount of averageiterations in two global iterations for each packet.The SISO detector is in fact two detector modules that operate on the same packet butdifferent halves of the packet thus ensuring one packet can be processed in 50% of theinter-arrival time T. Each detector processes 4 samples per clock cycle.Thus both the detector and the LDPC decoder can sustain maximum of two globaliterations per each packet if no statistical buffering is employed.48

Secondary Data path The secondary data path contains the low complexity detector based on hard decisionViterbi algorithm, a hard decision interleaver followed by hard decision LDPC decoderthat is sized for doing only one iteration. The secondary path thus does one reduced complexity global iteration and operates onthe incoming packets immediately.1) it can generate prelim

VLSI Architectures for Communications and Signal Processing 7/21/2013 3 A systematic design technique is needed to transform the communication and signal processing algorithms to practical VLSI arch

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

VLSI IC would imply digital VLSI ICs only and whenever we want to discuss about analog or mixed signal ICs it will be mentioned explicitly. Also, in this course the terms ICs and chips would mean VLSI ICs and chips. This course is concerned with algorithms required to automate the three steps “DESIGN-VERIFICATION-TEST” for Digital VLSI ICs.

VL2114 RF VLSI Design 3 0 0 3 VL2115 High Speed VLSI 3 0 0 3 VL2116 Magneto-electronics 3 0 0 3 VL2117 VLSI interconnects and its design techniques 3 0 0 3 VL2118 Digital HDL Design and Verification 3 0 0 3 VL2119* Computational Aspects of VLSI 3 0 0 3 VL2120* Computational Intelligence 3 0 0 3

VLSI Design 2 Very-large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by combining thousands of transistors into a single chip. VLSI began in the 1970s when complex semiconductor and communication technologies were being developed. The microprocessor is a VLSI device.

Dr. Ahmed H. Madian-VLSI 3 What is VLSI? VLSI stands for (Very Large Scale Integrated circuits) Craver Mead of Caltech pioneered the filed of VLSI in the 1970’s. Digital electronic integrated circuits could be viewed as a set

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

engineering materials, tools, equipments, manufacturing processes, basic concepts of electro-mechanical controls of machine tools, production criteria’s, characteristics and uses of various testing instruments and measuring or inspecting devices for checking components or products manufactured in various manufacturing shops in an industrial environment. It also describes and demonstrates the .