CS250 VLSI Systems Design Lecture 8: Memory

2y ago
26 Views
3 Downloads
1.25 MB
33 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Joanna Keil
Transcription

CS250 VLSI Systems DesignLecture 8: MemoryJohn Wawrzynek, Krste Asanovic,withJohn LazzaroandYunsup Lee (TA)UC BerkeleyFall 2010Lecture 8, MemoryCS250, UC Berkeley, Fall 2010

CMOS BistableDD“1”“0”Flip StateDD“0”“1”Cross-coupled inverters used to hold state in CMOS“Static” storage in powered cell, no refresh neededIf a storage node leaks or is pushed slightly away from correctvalue, non-linear transfer function of high-gain inverterremoves noise and recirculates correct valueTo write new state, have to force nodes to opposite stateLecture 8, Memory2CS250, UC Berkeley, Fall 2010

CMOS Transparent LatchLatch transparent (output follows input) when clock ishigh, holds last value when clock is lowClkDOptionalInput BufferQClkTransmission gate switch withboth pMOS and nMOS passesboth ones and zeros wellSchematicSymbolsLecture 8, MemoryDClkOptionalOutput BufferClkClkDQ3ClkQTransparent on clock lowCS250, UC Berkeley, Fall 2010

Latch Operation0D1D1QD10001Clock HighLatch TransparentLecture 8, MemoryQQClock LowLatch Holding4CS250, UC Berkeley, Fall 2010

Flip-Flop as Two LatchesSampleClkDHoldClkQQClkClkClkellcdrad iltnatu outsClkb/wernoiah psasritxsoThi flip-fl ith e s)w fferyllauubs(uClkSchematicSymbolsDQClkLecture 8, MemoryClkClk5CS250, UC Berkeley, Fall 2010

Small Memories from Stdcell LatchesRead AddressData held intransparent-lowlatchesRead Address DecoderWrite Address DecoderWrite Address Write DataClkWrite byclocking latchAdd additional ports by replicatingread and write port logic (multiplewrite ports need mux in front of latch)Combinational logic forread port (synthesized)ClkOptional read output latchExpensive to add many portsLecture 8, Memory6CS250, UC Berkeley, Fall 2010

6-Transistor SRAM (Static RAM)WordlineBitBitLarge on-chip memories built from arrays of static RAMbitcells, where each bit cell holds a bistable (crosscoupled inverters) and two access transistors.Other clocking and access logic factored out intoperipheryLecture 8, Memory7CS250, UC Berkeley, Fall 2010

%)*%! ,-%."/ %012#Intel’s 22nm SRAM cell0.092 um2 SRAM cellfor high density applications0.108 um2 SRAM cellfor low voltage applications[Bohr, Intel, Sept 2009]!"!# %&' ()%* ,%)'-.,)*%/012%3,.%8CS250, UC Berkeley, Fall 2010(4%5678(49%3(73&(*)%7,:67*,;%*6%;-*,Lecture 8, Memory

General SRAM StructureBitline y maximum of128-256 bits per rowor columnAddressClkWrite EnableLecture 8, MemoryDifferential Read Sense AmplifiersDifferential Write DriversWrite Data9Read DataCS250, UC Berkeley, Fall 2010

Address Decoder StructureWord Line 0Word Line 1Unary 1-of-4encodingWord Line 152:4PredecodersA0Lecture 8, MemoryA1A2Address10A3Clocked WordLine EnableCS250, UC Berkeley, Fall 2010

PrechargersClkRead CycleClkBit/BitFromDecoderBitline differentialWordlineStorage CellsWordlineClockSenseBitSenseBitData/DataSense Amp1)2) 3)Full-rail swing1) Precharge bitlines and senseampDataOutput Set-ResetLatchLecture 8, MemoryData2) Pulse wordlines, develop bitline differential voltage3) Disconnect bitlines from senseamp, activatesense pulldown, develop full-rail data signalsPulses generated by internal self-timed signals, oftenusing “replica” circuits representing critical paths11CS250, UC Berkeley, Fall 2010

Write orage Cells1)2)1) Precharge bitlinesWordlineClock2) Open wordline, pull down one bitline full railBitBitWrite EnableWrite DataLecture 8, MemoryWrite-enable can be controlled on aper-bit level. If bit lines not drivenduring write, cell retains value (lookslike a read to the cell).12CS250, UC Berkeley, Fall 2010

Column-Muxing at Sense se AmpDifficult to pitch match sense amp to tight SRAM bit cell spacingso often 2-8 columns share one sense amp. Impacts powerdissipation as multiple bitline pairs swing for each bit read.Lecture 8, Memory13CS250, UC Berkeley, Fall 2010

Building Larger MemoriesDBit cells e Bit cellscI/OI/OI/OI/ODBit cells e Bit cellscDBit cells e Bit cellscDBit cells e Bit cellscDBit cells e Bit cellscI/OI/ODBit cells e Bit cellscLarge arrays constructed bytiling multiple leaf arrays, sharingdecoders and I/O circuitryDBit cells e Bit cellscI/Oe.g., sense amp attached toarrays above and belowLeaf array limited in size to128-256 bits in row/column dueto RC delay of wordlines andbitlinesI/ODBit cells e Bit cellscAlso to reduce power by onlyactivating selected sub-bankIn larger memories, delay andenergy dominated by I/O wiringLecture 8, Memory14CS250, UC Berkeley, Fall 2010

Adding More PortsWordlineBWordlineADifferentialRead or WriteportsBitBBitBBitABitAWordlineRead BitlineOptional Single-endedRead portLecture 8, Memory15CS250, UC Berkeley, Fall 2010

Memory CompilersIn ASIC flow, memory compilers used to generate layoutfor SRAM blocks in designOften hundreds of memory instances in a modern SoCMemory generators can also produce built-in self-test (BIST)logic, to speed manufacturing testing, and redundant rows/columns to improve yieldCompiler can be parameterized by number of words,number of bits per word, desired aspect ratio, number ofsub banks, degree of column muxing, etc.Area, delay, and energy consumption complex function ofdesign parameters and generation algorithmWorth experimenting with design spaceUsually only single read or write port SRAM and oneread and one write SRAM generators in ASIC libraryLecture 8, Memory16CS250, UC Berkeley, Fall 2010

Small MemoriesCompiled SRAM arrays usually have a high overhead dueto peripheral circuits, BIST, redundancy.Small memories are usually built from latches and/or flipflops in a stdcell flowCross-over point is usually around 1K bits of storageShould try design both waysLecture 8, Memory17CS250, UC Berkeley, Fall 2010

Memory DesignPatternsLecture 8, Memory18CS250, UC Berkeley, Fall 2010

Multiport Memory Design PatternsOften we require multiple access ports to a commonmemoryTrue Multiport MemoryAs describe earlier in lecture, completely independent readand write port circuitryBanked Multiport MemoryInterleave lesser-ported banks to provide higher bandwidthStream-Buffered Multiport MemoryUse single wider access port to provide multiple narrowerstreaming portsCached Multiport MemoryUse large single-port main memory, but add cache to serviceLecture 8, Memory19CS250, UC Berkeley, Fall 2010

True Multiport MemoryProblem: Require simultaneous read and write access by multipleindependent agents to a shared common memory.Solution: Provide separate read and write ports to each bit cell for eachrequesterApplicability: Where unpredictable access latency to the sharedmemory cannot be tolerated.Consequences: High area, energy, and delay cost for large number ofports. Must define behavior when multiple writes on same cycle to sameword (e.g., prohibit, provide priority, or combine writes).Lecture 8, Memory20CS250, UC Berkeley, Fall 2010

True Multiport Example: Itanium-2 RegfileIEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 11, NIntel Itanium-2 [Fetzer et al, IEEE JSSCC 2002]Lecture 8, Memory21CS250, UC Berkeley, Fall 2010

Itanium-2 Regfile TimingLecture 8, Memory22Fig. 1. IEU state and timing diagram.CS250, UC Berkeley, Fall 2010

Banked Multiport MemoryProblem: Require simultaneous read and write access by multipleindependent agents to a large shared common memory.Solution: Divide memory capacity into smaller banks, each of which hasfewer ports. Requests are distributed across banks using a fixed hashingscheme. Multiple requesters arbitrate for access to same bank/port.Applicability: Requesters can tolerate variable latency for accesses.Accesses are distributed across address space so as to avoid “hotspots”.Consequences: Requesters must wait arbitration delay to determine ifrequest will complete. Have to provide interconnect between eachrequester and each bank/port. Can have greater, equal, or lesser number ofbanks*ports/bank compared to total number of external access ports.Lecture 8, Memory23CS250, UC Berkeley, Fall 2010

Banked Multiport MemoryPort APort BArbitration and CrossbarBank 0Lecture 8, MemoryBank 1Bank 224Bank 3CS250, UC Berkeley, Fall 2010

Banked Multiport Memory ExamplePentium (P5) 8-way interleaved data cache, with two portscache, allprocessors that theore band6 CPU cans that thewith datapipeline.eferences. Third, inor design,and dataa tedcache tags7bII7Singe-ported andinterleavedcache dataIFigure 7. Dual-access data cache.[Alpert et al, IEEE Micro, May 1993]8-Kbyte, Lecture 8, Memory25CS250, UC Berkeley, Fall 2010frequency.) For the Pentium microprocessor, with its higher

Stream-Buffered Multiport MemoryProblem: Require simultaneous read and write access by multipleindependent agents to a large shared common memory, where eachrequester usually makes multiple sequential accesses.Solution: Organize memory to have a single wide port. Provide eachrequester with an internal stream buffer that holds width of data returned/consumed by each access. Each requester can access stream buffer withoutcontention, but arbitrates with others to read/write stream buffer.Applicability: Requesters make mostly sequential requests and cantolerate variable latency for accesses.Consequences: Requesters must wait arbitration delay to determine ifrequest will complete. Have to provide stream buffers for each requester.Need sufficient access width to serve aggregate bandwidth demands of allrequesters, but wide data access can be wasted if not all used by requester.Have to specify memory consistency model between ports (e.g., providestream flush operations).Lecture 8, Memory26CS250, UC Berkeley, Fall 2010

Stream-Buffered Multiport MemoryStream Buffer APort AStream Buffer BPort BArbitrationWide MemoryLecture 8, Memory27CS250, UC Berkeley, Fall 2010

Stream-Buffered Multiport ExamplesIBM Cell microprocessor local store[Chen et al., IBM, 2005]The SPUs SIMD support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bitintegers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of perfo28up to Lecture51.2 billion8-bit integer operations or 25.6GFLOPsin single precision.CS250,Figureshows Fallthe2010main func8, MemoryUC3Berkeley,units in an SPU: (1) an SPU floating point unit for single-precision, double-precision, and integer multiplies

Cached Multiport MemoryProblem: Require simultaneous read and write access by multipleindependent agents to a large shared common memory.Solution: Provide each access port with a local cache of recently touchedaddresses from common memory, and use a cache coherence protocol tokeep the cache contents in sync.Applicability: Request streams have significant temporal locality, andlimited communication between different ports.Consequences: Requesters will experience variable delay depending onaccess pattern and operation of cache coherence protocol. Tag overhead inboth area, delay, and energy/access. Complexity of cache coherenceprotocol.Lecture 8, Memory29CS250, UC Berkeley, Fall 2010

Cached Multiport MemoryPort APort BCache ACache BArbitration and InterconnectCommon MemoryLecture 8, Memory30CS250, UC Berkeley, Fall 2010

Replicated State Multiport MemoryProblem: Require simultaneous read and write access by multipleindependent agents to a small shared common memory. Cannot toleratevariable latency of access.Solution: Replicate storage and divide read ports among replicas. Eachreplica has enough write ports to keep all replicas in sync.Applicability: Many read ports required, and variable latency cannot betolerated.Consequences: Potential increase in latency between some writers andsome readers.Lecture 8, Memory31CS250, UC Berkeley, Fall 2010

Replicated State Multiport MemoryWrite Port 0Write Port 1Copy 0Copy 1Read PortsLecture 8, Memory32Example: Alpha 21264Regfile clustersCS250, UC Berkeley, Fall 2010

Memory Hierarchy Design PatternsUse small fast memory together large slow memory toprovide illusion of large fast memory.Explicitly managed local storesAutomatically managed cache hierarchiesLecture 8, Memory33CS250, UC Berkeley, Fall 2010

Lecture 8, Memory CS250, UC Berkeley, Fall 2010 Memory Compilers In ASIC flow, memory compilers used to generate layout for SRAM blocks in design Often hundreds of memory instances in a modern SoC Memory generators can also produce built-in self-test (BIST) logic, to speed manufactur

Related Documents:

Lecture 8, Hardware Design Patterns CS250, UC Berkeley, Fall 2012 Logic to Squeeze Bubbles 7 Can move one stage to right if Ready asserted, or if there are any bubbles in stages to right of current stage Ready? Enable? Valid?!Fan-in of number of valid signals grows with number of stages!F

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

VL2114 RF VLSI Design 3 0 0 3 VL2115 High Speed VLSI 3 0 0 3 VL2116 Magneto-electronics 3 0 0 3 VL2117 VLSI interconnects and its design techniques 3 0 0 3 VL2118 Digital HDL Design and Verification 3 0 0 3 VL2119* Computational Aspects of VLSI 3 0 0 3 VL2120* Computational Intelligence 3 0 0 3

VLSI Design 2 Very-large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by combining thousands of transistors into a single chip. VLSI began in the 1970s when complex semiconductor and communication technologies were being developed. The microprocessor is a VLSI device.

VLSI IC would imply digital VLSI ICs only and whenever we want to discuss about analog or mixed signal ICs it will be mentioned explicitly. Also, in this course the terms ICs and chips would mean VLSI ICs and chips. This course is concerned with algorithms required to automate the three steps “DESIGN-VERIFICATION-TEST” for Digital VLSI ICs.

Dr. Ahmed H. Madian-VLSI 3 What is VLSI? VLSI stands for (Very Large Scale Integrated circuits) Craver Mead of Caltech pioneered the filed of VLSI in the 1970’s. Digital electronic integrated circuits could be viewed as a set

Principles of VLSI Design Introduction CMPE 315 Principles of VLSI Design Instructor Chintan Patel (Contact using email: cpatel2@cs.umbc.edu). Text CMOS VLSI Design: A Circuits and Systems Perspective, Third Edition. by Neil H.E. Weste and David Harris. ISBN: 0-321-14901-7, Addison Wesl

1000 days during pregnancy and the first 2 years of life, as called for in the 2008 Series. One of the main drivers of this new international commitment is the Scaling Up Nutrition (SUN) movement.18,19 National commitment in LMICs is growing, donor funding is rising, and civil society and the private sector are increasingly engaged. However, this progress has not yet translated into .