Memory Hierarchy Design - ICL UTK

3y ago
48 Views
3 Downloads
271.02 KB
9 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Victor Nelms
Transcription

Chapter 2CPU vs. Memory: Performance vs LatencyMemory Hierarchy Design4IntroductionMemory Hierarchy Design Considerations Goal: unlimited amount of memory with low latency Fast memory technology is more expensive per bit thanslower memory–Solution: organize memory system into a hierarchy––Aggregate peak bandwidth grows with # cores: Entire addressable memory space available in largest, slowestmemoryIncrementally smaller and faster memories, each containing asubset of the memory below it, proceed in steps up toward theprocessor––– Temporal and spatial locality insures that nearly allreferences can be found in smaller memories–Intel Core i7 can generate two references per core per clockFour cores and 3.2 GHz clockDRAM bandwidth is only 6% of this (25 GB/s)Requires:–––Gives the illusion of a large, fast memory being presented tothe processor25.6 billion 64-bit data references/second 12.8 billion 128-bit instruction references 409.6 GB/s!Multi-port, pipelined cachesTwo levels of cache per coreShared third-level cache on chip25Memory HierarchiesCPURegistersPerformance and Power for CachesCPU1000 B300 psL164 kB1 nsL2256 kB3-10 nsRegisters500 B500 psL164 kB2 nsL2256 kB10-20 nsOn-die Memory hierarchy design becomes more crucial with recentmulti-core processors:–Use principle of locality (spatial and temporal) L32-4 MB10-20 nsHigh-end microprocessors have 10 MB on-chip cache–In package– MemoryDiskStorage4-16 GB 50-100 ns4-16 TB 5-10 msServerMemoryFlashStorage256-512 MB50-100 ns4-6 GBPersonal Mobile25-50 µs–3Consumes large amount of area and power budgetBoth static (idle) and dynamic power is an issuePersonal/mobile devices have–Off-chip 20-50x smaller power budget25%-50% is consumed by memory6

Handling of Cache MissesCalculating Miss RateMissesMiss rate Memory accessesMiss rate Memory accesses InstructionInstruction countInstructionAverage memory access time Hit time Miss rate Miss penaltyWhen a word is not found in the cache, a miss occurs: –––Fetch word from lower level in hierarchy, requiring a higherlatency referenceLower level may be another cache or the main memoryAlso fetch the other words contained within the block –Note that speculative and multithreaded processors mayexecute other instructions during a miss –Takes advantage of spatial locality–Place block into cache in any location within its set,determined by address This reduces performance impact of missesBut complicates analysis and introduces runtime dependenceblock address MODULO number of sets710Cache Associativity and Writing StrategiesCache and TLB Mapping Illustratedtag 64n sets n-way set associative–7 6 10Direct-mapped cache one block per setFully associative one settagWriting to cache: two strategies–Physical address spaceImmediately update lower levels of hierarchyWrite-back –Only update lower levels of hierarchy when an updated block isreplacedTLBBoth strategies use write buffer to make writes asynchronousHardware regs. Fraction of cache access that result in a missLarger block size–Reduced compulsory misses–Slightly reduced static power(smaller tag)–Sometimes increased capacityand conflict missesReasons for cache misses and their namesCompulsory Cache block is referenced for the first time––Solution: build bigger cache or reorganize the codeConflict Program makes repeated references to multiple addresses fromdifferent blocks that map to the same location in the cache––Solution: hardware and software prefetchersCache block was discarded and later retrieved–– Capacity 11Cache Design ConsiderationsMiss rate is.–stackVirtual address spaceMiss Rate and Types of Cache Misses–OS kernelheap8 setsWrite-through –ca che–set cache location blockSolution: add padding or stride to code; change size of dataCoherency9 More levels of cache– Bigger cachePrioritize read misses overwrite misses–Reduced miss rate––Increased hit time––Increased static & dynamicpowerHigher associativity–Reduced conflict miss rate–Increased hit time–Increased powerReduced miss penaltyAccess time Hit timeL1 Miss rate L1 (Hit timeL2 Miss rateL2 Miss penalty L2 ) Reduced miss penaltyNeed for write buffer andhazard resolutionAvoid address translationduring indexing of cache––Reduced hit timeLimited size and structureof cache12

Categories of Metrics for Cache OptimizationOptimization 3: Pipelined Cache AccessReduce hit time Small and simple first-levelcaches“way-prediction”Side effect: reduction inpower consumption–––Increase cache bandwidth Pipelined cachesMultibanked cachesNonblocking caches––– Reduce miss penalty and/orrate via parallelism––––––Pentium (1993)Pentium Pro (1995)Pentium III (1999)Pentium 4 (2000)Intel Core i7 (2010)––“Critical word first”Merging write buffers–HistoryCompiler optimizationSide effect: reduced powerReduce miss penalty Improves bandwidth Reduce miss rate– ––Hardware prefetchingSoftware and compilerprefetchingSide effect: increasedpower due to unused data1 cycle2 cycles2 cycles4 cycles4 cycles Interaction with branch prediction Load instructions are longer Pipeline cache cycles make high degrees of associativityeasier to implementIncreased penalty for mispredicted branches–Waiting for cache pipeline to finish–1316Optimization 1: Small/Simple 1st-level Caches 1st level cache should match the clock cycle of CPU Cache addressing is a three step process––– Address tag memory with index portion of addressCompare the found tag with address tagChoose cache set–NO–Call it “hit under miss” What about two outstanding misses?–– Make it “hit under multiple miss”“miss under miss”Next level cache has to be able to handle multiple misses Is faster Overlap tag check with data transmission10% or more for each doubling of set countBut you have design a nonblocking cache (lockup-free cache)Why stop at two instructions?–Address aliasingDealing with TLB and multiprocessing conflicts –– Low associativity.–If one instruction stalls on a cache miss, should the followinginstruction stall if its data is in cache? High associativity helps with.– Optimization 4: Nonblocking Caches–Consumes less powerRarely the case in contemporary CPUsHow long before our caches become–Out-of-order, superscalar, Moving the CPU innovation into the memory hierarchy?1417Optimization 2: “way prediction” Optimization 5: Multibanked CachesIdea: predict “the way”– Main memory has been organized in banks for increasedbandwidthwhich block within set will be accessed next Index multiplexor (mux) starts working early Caches can do this too Implemented as extra bits kept in cache for each block Each cache block is evenly spread across banks Prediction accuracy (simulated)––– ––Sequential interleaving On misprediction– –more effective of instruction caches 90% for two-way associative 80% for four-way associative–ARM Cortex-A8–Intel Core i7 Examples: 1st use MIPS R10000 in 1990s, ARM Cortex-A815 Bank 1: blk[1]Bank 1: blk[5]Bank 2: blk[2]Bank 2: blk[6]Bank 3: blk[3]Bank 3: blk[7]Modern use Try the other blockChange the prediction bitsIncur penalty (commonly 1 cycle for slow CPUs)Bank 0: blk[0]Bank 0: blk[4]1-4 banks in L2 cache4 banks in L1 (2 memory accesses/cycle)8 banks in L2Reduced power usage18

Optimization 6: Critical Word 1st ,Early RestartOptimization 8: Compiler OptimizationsForget cache blocks (lines) deal with words –Start executing instruction when its word arrives, not the entireblock where the word residesCritical word first –What if the instruction needs the last word in the cache block?– – –Requires 2 or more loop nestsThe order of loops is changed to walk through memory in amore cache-friendly mannerLoop blocking Don't change the order of words, but supply the missing wordas soon as it arrivesAdditional loop nests are introduced to deal with small portion ofan array (called a block but also a tile)it won't help if the last word in block is needed Useful for large cache blocks, depends on data-stream Loop interchange Early restart Two main techniquesOut-of-order loading of cache block wordsAs soon as the word arrives pass it on to the CPUContinue fetching the remaining words No hardware changes required –Go ahead and request the last word from memory beforerequesting others 19Optimization 7: Merging Write Buffer (Intro)22Optimization 8: Loop Interchange/* Before: memory stride 100 */ for (j 0; j 100; j)for (i 0; i 5000; i)x[i][j] 2 * x[i][j];Write buffer basicsWrite buffer sits between cache and memoryWrite buffer stores both: data and its addressWrite buffer allows for the store instruction to finishimmediately––– /* After: memory stride 1* Uses all words in a single cache block */Unless the write buffer is fullEspecially useful for write-through caches– Write-back caches will benefit for when block is replacedfor (i 0; i 5000; i)for (j 0; j 100; j)x[i][j] 2 * x[i][j];20Optimization 7: Merging Write BufferOptimization 8: Blocking Merging write buffer: a buffer that merges write requests When storing to a block that is already pending in the writebuffer, only update the write buffer Another way to look at it: the write buffer with merging isequivalent to a larger buffer but without mergingMerging buffer reduces stalls due to the buffer being fullShould not be used for I/O addresses (special validDatavalid1001M[100]1after merge validM[108]1validM[116]123M[124]21/* Before: memory stride: A 1, B N */for (i 0; i N; i)for (j 0; j N; j) { x 0;for (k 0; k N; k) // dot productx A[i][k] * B[k][j];C[i][j] x;}/* After: blocking factor B */for (jj 0; jj N; jj B)for (kk 0; kk N; kk B)for (i 0; i N; i)for (j jj; j min(jj B,N); j) { x 0;for (k kk; k min(kk B,N); k)x A[i][k] * B[k][j];C[i][j] x;}24

Optimization 9: Hardware PrefetchingOptimization 10: Compile Prefetch Example/* Before */for (i 0; i 5000; i)x[i] 2 * x[i];Prefetch asks for data before it's needed Reduction of latency (miss penalty)–Overlap the fetch from memory with /* After */for (i 0; i 5000; i) {// get ready for next iterationprefetch(x[i 1]); // better use cache blockExecution of instructions–Principal of locality On a miss: request not just one (cache) block but alsothe next block–}The prefetched data might not end up in cache It might go to a special prefetch buffer that is cheaper toaccess than memory–Intel Core i7 can prefetch both L1 and L2 Pairwise prefetching (get the current and the next sequentialblock)May be turned off in BIOS––25x[i] 2 * x[i];/* Even better */for (i 0; i 5000; i) {// better prefetch the next cache blockprefetch(x[i 8]);x[i] 2 * x[i];}Optimization 9: Hardware Prefetching contd.Memory Technology Optimization Memory sits between caches and I/O devices– Hardware prefetching may waste bandwidth– –– When the prediction accuracy is poor and most of theprefetched data goes unusedBridges the gap of few tens of cycles cache latency andmillions of cycles latency to diskMemory performance is measured with–Latency Aggressive and sophisticated prefetchers–Use extra silicon areaConsume powerIncrease chip complexity– Cache designers are concerned with that because this is themain factor in miss penaltyBandwidth Compilers may reorganize the code to increase hardwareprefetch accuracyEver faster I/O devices (think solid disk storage) and increasingcore counts stress the memory bandwidthFaster and larger cache cannot eliminate the need for mainmemory––Even with principle of locality in mind there are codes that waitfor memoryUse Amdahl's law to see how important is faster memory2629Optimization 10: Compiler Prefetching –Memory responsiveness is measured with– Data loaded into a register–Cache prefetches– Faulting prefetches cause virtual address faultsNonfaulting prefetches do not cause such faults Because the hardware ignores the prefetch instruction rightbefore the fault––SRAM–DRAM It is called nonbinding27Minimum time between unrelated memory requestsComes from the way data is stored in a single memory cellTwo types of memory The most common prefetch is into cache and nonfaultingThe time between read request and data arrivalThis could also be called memory latencyCycle time Data loaded into cacheInteraction with the virtual memory:–Access time Register prefetches Two flavors of prefetching instructions Memory Technology OverviewCompiler may insert (special) instructions that fetch datafrom memory but do not influence the current code–28Used caches (since 1975)Used for main memory (since 1975)30

Memory Technology: SRAMImproving Memory PerformanceThe main pressure comes from the Moore's law Faster single core CPUs need more data fasterMore cores per chip need more data faster– SRAM Static RAM No need for refresh (unlike DRAM)–Multiple accesses to the same row Typically, 6 transistors to store a single bit Minimum power required to retain the charge in standbymode–Prevents loss of information after read–Synchronous DRAM Memory controller and the SDRAM use a clock forsynchronization (faster than asynchronous operation)Also, SDRAM has burst mode–Most SRAM memories (caches) is not integrated into thechip Utilizes the row buffer that holds the data from the row whenthe column portion of the address arrivesMultiple column addresses can nowadays utilize the rowbuffer–Hence, access time cycle time–– 31 DRAM Dynamic RAM Only one transistor used to store a bit–––––DRAM is getting wider Required due to every increasing density (Get more data outat once)– Single data rate: data transferred on one edge of the clocksignalDouble data rate: data transferred on both edges of the clocksignalEffectively doubles the bandwidth with the same clock–Row refreshed at once (usually part of the mem. controller)––––Address lines are multiplexedRow Access Strobe (RAS) is the first half of the addressColumn Access Strobe (CAS) - the second half of the addressInternally, DRAM is a 2D matrix and each half of the addressworks in one of the dimensions–Multiple banks Helps due to interleaving (one word spread across banks)Smaller power consumptionAdds delay because bank has to get opened before access–––32Memory Technology: DRAM (contd.) –– DDR is now a standard–This helps interoperability, competition and results in lowerprices DDR (2000) DDR2 (2004)Data transfer time (related to CAS) improves at over 10% ayear DDR3 (2010)DRAM packaging unit: DIMM DDR4 (expected late 2013?)The expected rate (from the CPU designer perspective) is 55%a year– DDR Standards and Technologies1000 MB for every 1000 MIPSBut Moore's law made that impossible: CPU speed grewfasterCapacity grew only about half as fast in recent years 35Amdahl suggested linear growth of capacity with CPU speed– Old DDR: 4-bit busDDR 2, DDR 3: 16-bit bus (2010)DDR Double Data Rate Number of pins required to address a single item is an issue–34Improving Memory Performance (contd.)After a read, the information is goneSo a read must be followed by a write to restore the state!Hence cycle time is longer than access timeUsing multiple banks helps hiding the rewrite of dataAlso, over time (8ms) data is lost: periodic refresh required Supports “Critical Word First” cache optimization–Memory Technology: DRAMIn this mode, SDRAM delivers multiple data items without newaddress requestsPossible due to an extra length register in SDRAMOr fourfold improvement in capacity every three yearsMemory latency (measured as row access time) improvesonly 5% per year–––––Dual Inline Memory Module8 bytes wide ECC (Error-Correcting Code)–332.5 V; 133, 150, 200 MHz; 100 nm1.8 V; 266, 333, 400 MHz; 80-90 nm1.5 V; 533, 666, 800 MHz; 32-40 nm1-1.2 V; 1066-1600 MHz; 2x nm36

GPU MemoryFlash Memory PropertiesFlash memory must be erased before overwritten GPU memory – Flash memory is staticGPUs demand more bandwidth because of higherperformance due to greater parallelism Flash has limited number of times (100k) each block can bewritten Greater bandwidth achieved with: Cheaper than SDRAM, more expensive than disk Much slower than SDRAM, much faster than diskWider interface: 32-bits (vs. 4,8,16 for CPUs)Higher clock rate allowed by soldering into GPU board rathersnap in sockets for CPU DIMMs––No continuous refreshes, almost no power draw when inactive–GDDR5 based on DDR3, earlier GDDRs based DDR2– Flash: 2 / GB, SDRAM: 30 / GB, Magnetic disk: 0.09 / GB–For reads: 4 times slower than SDRAM, 1000x faster thandiskWrites to flash are 2x-20x slower than reads–In practice, GDDR is 2-5x faster than DDR This becomes a software (OS kernel) problem to assemble datain blocks and clean up old blocks GDDR NAND Flash (higher density Flash) erasing is done by blocks–GDRAM Graphics DRAMGSDRAM Graphics Synchronous DRAM––3740SDRAM Power OptimizationsMemory Dependability Higher clock rate means greater (static and dynamic) powerdraw of SDRAM Optimizing power becomes important as SDRAMs grow Common techniques include––––– In this mode the memory modules ignore the clockHowever, need to keep refreshing the clock–After fabrication memory modules are configured one bank opens at a time to deliver subsequent wordDefective rows are disabled and spare rows replace themDynamic errors are detected with parity bits and correctedwith ECC (Error-Correcting Codes)––Usually implemented as an internal refresh circuitryCosmic rays (high energy particles)Manufacturing defects are accommodated with spare rows–Recent SDRAMs enter power-down mode Permanent errors during fabricationDynamic errors during operation soft errors Reduce operating voltage (from 1.5 to 1.35)Introduce multiple banks Errors in memory systems–Instruction caches are read-only so the parity bits sufficeParity bit does not detect multi-bit errors: 1 parity 8 data bitsECC detects 2 and corrects 1 error for 8-bit ECC 64-bit data3841Flash Memory IntroductionMemory Dependability: Chipkill Flash memory is a type of EEPROM–––––Electronically Erasable Programmable Read-Only MemoryRead-only under normal operationErasable when special voltage applied – The memory module is flashed erasedSuitable for long term storage– Used in mobile form factors: phones, tables, laptops High end servers and warehouse clusters need ChipkillDeveloped by IBM, Intel calls it SDDCAccording to IBM analysis, 10,000 processor server with4GB/CPU has unrecoverable failures in 3 years:–May be used as another level of memory hierarchy due thesmall size of RAM in these devices––39Chipkill is like RAID for DisksParity bits and ECC are not kept together with data but aredistributedComplete failure of a single memory module can be handled90000 when only parity is used3500 with ECC only6 with Chipkill42

Virtual Memory Basics Virtual memory is–––– Protection mechanism to keep process memory data privateContext switch changes the process which the CPU isexecuting Virtual Memory Caveats–Cooperation between hardware (TLB translation look-asidebuffer, segmented memory, segmentation fault mechanism)and software (OS kernel)The same address space for each running process– If complexity is the problem than a simpler system has abetter chance of being secure– Make some CPU state read-only–– – Only the kernel is allowed certain operations Read-only: Am I in kernel mode? vs. Change to kernel mode.Entering the kernel occurs usually with a system call and/or aspecial assembly instructionThere is an instruction to return to the previous mode byrestoring the register file saved before the system callCheck memory protection for every memory access–– Rationale Behind Virtual MachinesProvide programmable way of changing modes–In fact, it could be a malicious OS and the VM system will keepit isolated46Hardware Requirements for Virtual MemoryProvide user and kernel modes of executionVirtual Machines have smaller “code base” in terms ofhardware and softwareOS kernel no longer has to be trusted Flushing all caches takes too many cycles for each switchThis becom

Chapter 2 Memory Hierarchy Design 2 Introduction Goal: unlimited amount of memory with low latency Fast memory technology is more expensive per bit than slower memory –Use principle of locality (spatial and temporal) Solution: organize memory system into a hierarchy –Entire addressable memory space available in largest, slowest memory –Incrementally smaller and faster memories, each .

Related Documents:

Iota Phi Theta Trevor Bass tbass3@vols.utk.edu Kappa Alpha Psi Cordarius Duncan cdunca18@vols.utk.edu Omega Psi Phi Jakob Johnson jjohn320@vols.utk.edu Phi Beta Sigma RJ Little rlittl1@vols.utk.edu Sigma Gamma Rho Jamesha Westfield hfh825@vols.utk.edu Zeta Phi Beta Maya Johnson mjohn203@vols.utk.edu

1. Chapter 2/Appendix B: Memory Hierarchy General Principles of Memory Hierarchies Understanding Caches and their Design Main Memory Organization Virtual Memory 2. Memory Hierarchy – What Is It Key idea: Use layers of increasingly large, cheap and slow storage: – Try to keep as much access as possible in small, fast levels

21-07-2017 2 Chap. 12 Memory Organization Memory Organization 12-5 12-1 Memory Hierarchy Memory hierarchy in a computer system Main Memory: memory unit that communicates directly with the CPU (RAM) Auxiliary Memory: device that provide backup storage (Disk Drives) Cache Memory: special very-high-

Memory Management Ideally programmers want memory that is o large o fast o non volatile o and cheap Memory hierarchy o small amount of fast, expensive memory -cache o some medium-speed, medium price main memory o gigabytes of slow, cheap disk storage Memory management tasks o Allocate and de-allocate memory for processes o Keep track of used memory and by whom

1,4,5University of Tennessee, Knoxville, TN, 37996, USA jcoble1@utk.edu jhines2@utk.edu bupadhya@utk.edu 2Pacific Northwest National Laboratory, Richland, WA, 99352, USA pradeep.ramuhalli@pnnl.gov 3Iowa State University, Ames, IA, 50011, USA bondlj@iastate.edu ABSTRACT The US operating fleet of light water reactors (LWRs) is

Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector, Multimedia extended ISA, GPU, loop level parallelism, Chapter4 slides you may also refer to chapter3-ilp.ppt starting with slide #114 3.

The Memory Hierarchy Basic Problem: How to design a reasonable cost memory system that can deliver data at speeds close to the CPU’s consumption rate. Current Answer: Construct a memory hierarchy with slow (inexpensive, large size) components at the higher levels and with fast (most expensive, smallest) components at the lowest level.

criminal case process; the philosophies and alterna-tive methods of corrections; the nature and processes of treating the juvenile offender; the causes of crime; and the role of government and citizens in finding solutions to America’s crime problems. 2. Develop, state, and defend positions on key issues facing the criminal justice system, including the treatment of victims, police-community .