Chapter 2 Memory Hierarchy Design - Cse.msu.edu

3y ago
24 Views
4 Downloads
1.34 MB
40 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Lucca Devoe
Transcription

Computer ArchitectureA Quantitative Approach, Fifth EditionChapter 2Memory Hierarchy DesignCopyright 2012, Elsevier Inc. All rights reserved.1

n n n Programmers want unlimited amounts of memory withlow latencyFast memory technology is more expensive per bit thanslower memorySolution: organize memory system into a hierarchyn n n IntroductionIntroductionEntire addressable memory space available in largest, slowestmemoryIncrementally smaller and faster memories, each containing asubset of the memory below it, proceed in steps up toward theprocessorTemporal and spatial locality insures that nearly allreferences can be found in smaller memoriesn Gives the allusion of a large, fast memory being presented to theprocessorCopyright 2012, Elsevier Inc. All rights reserved.2

Copyright 2012, Elsevier Inc. All rights reserved.IntroductionMemory Hierarchy3

Copyright 2012, Elsevier Inc. All rights reserved.IntroductionMemory Performance Gap4

Memory hierarchy design becomes more crucialwith recent multi-core processors:n IntroductionMemory Hierarchy DesignAggregate peak bandwidth grows with # cores:n n n n Intel Core i7 can generate two references per core per clockFour cores and 3.2 GHz clock25.6 billion 64-bit data references/second 12.8 billion 128-bit instruction references 409.6 GB/s!DRAM bandwidth is only 6% of this (25 GB/s)Requires:n n n Multi-port, pipelined cachesTwo levels of cache per coreShared third-level cache on chipCopyright 2012, Elsevier Inc. All rights reserved.5

High-end microprocessorshave 10 MB on-chip cachen IntroductionPerformance and PowerConsumes large amount of area and power budgetCopyright 2012, Elsevier Inc. All rights reserved.6

When a word is not found in the cache,a miss occurs:n n n Fetch word from lower level in hierarchy,requiring a higher latency referenceLower level may be another cacheor the main memoryAlso fetch the other words contained within the blockn n IntroductionMemory Hierarchy BasicsTakes advantage of spatial localityPlace block into cache in any location within its set,determined by addressn block address MOD number of setsCopyright 2012, Elsevier Inc. All rights reserved.7

n n sets n-way set associativen n n IntroductionMemory Hierarchy BasicsDirect-mapped cache one block per setFully associative one setWriting to cache: two strategiesn Write-throughn n Write-backn n Immediately update lower levels of hierarchyOnly update lower levels of hierarchywhen an updated block is replacedBoth strategies use write bufferto make writes asynchronousCopyright 2012, Elsevier Inc. All rights reserved.8

n Miss raten n IntroductionMemory Hierarchy BasicsFraction of cache access that result in a missCauses of missesn Compulsoryn n Capacityn n First reference to a blockBlocks discarded and later retrievedConflictn Program makes repeated references to multiple addressesfrom different blocks that map to the same location in thecacheCopyright 2012, Elsevier Inc. All rights reserved.9

IntroductionMemory Hierarchy BasicsNote that speculative and multithreadedprocessors may execute other instructionsduring a missn Reduces performance impact of missesCopyright 2012, Elsevier Inc. All rights reserved.10

Six basic cache optimizations:n Larger block sizen n n n Reduces overall memory access timeGiving priority to read misses over writesn n Reduces conflict missesIncreases hit time, increases power consumptionHigher number of cache levelsn n Increases hit time, increases power consumptionHigher associativityn n Reduces compulsory missesIncreases capacity and conflict misses, increases miss penaltyLarger total cache capacity to reduce miss raten n IntroductionMemory Hierarchy BasicsReduces miss penaltyAvoiding address translation in cache indexingn Reduces hit timeCopyright 2012, Elsevier Inc. All rights reserved.11

Small and simple first level cachesn Critical timing path:n n n n n addressing tag memory, thencomparing tags, thenselecting correct setAdvanced OptimizationsTen Advanced OptimizationsDirect-mapped caches can overlap tag compare andtransmission of dataLower associativity reduces power because fewercache lines are accessedCopyright 2012, Elsevier Inc. All rights reserved.12

Addressing CacheChapter 5 — Large and Fast: Exploiting Memory Hierarchy — 1313

Example: Intrinsity FastMATHChapter 5 — Large and Fast: Exploiting Memory Hierarchy — 1414

Set Associative Cache OrganizationChapter 5 — Large and Fast: Exploiting Memory Hierarchy — 1515

Advanced OptimizationsL1 Size and AssociativityAccess time vs. size and associativityCopyright 2012, Elsevier Inc. All rights reserved.16

Advanced OptimizationsL1 Size and AssociativityEnergy per read vs. size and associativityCopyright 2012, Elsevier Inc. All rights reserved.17

n To improve hit time,predict the way to pre-set muxn n Mis-prediction gives longer hit timePrediction accuracyn n n n n n Advanced OptimizationsWay Prediction 90% for two-way 80% for four-wayI-cache has better accuracy than D-cacheFirst used on MIPS R10000 in mid-90sUsed on ARM Cortex-A8Extend to predict block as welln n “Way selection”Increases mis-prediction penaltyCopyright 2012, Elsevier Inc. All rights reserved.18

n Pipeline cache access to improve bandwidthn Examples:n n n n n Pentium: 1 cyclePentium Pro – Pentium III: 2 cyclesPentium 4 – Core i7: 4 cyclesAdvanced OptimizationsPipelining CacheIncreases branch mis-prediction penaltyMakes it easier to increase associativityCopyright 2012, Elsevier Inc. All rights reserved.19

n Allow hits beforeprevious missescompleten n n n “Hit under miss”“Hit under multiplemiss”Advanced OptimizationsNonblocking CachesL2 must support thisIn general,processors can hideL1 miss penalty butnot L2 miss penaltyCopyright 2012, Elsevier Inc. All rights reserved.20

n Organize cache as independent banks tosupport simultaneous accessn n n ARM Cortex-A8 supports 1-4 banks for L2Intel i7 supports 4 banks for L1 and 8 banks for L2Advanced OptimizationsMultibanked CachesInterleave banks according to block addressCopyright 2012, Elsevier Inc. All rights reserved.21

n Critical word firstn n n Early restartn n n Request missed word from memory firstSend it to the processor as soon as it arrivesAdvanced OptimizationsCritical Word First, Early RestartRequest words in normal orderSend missed work to the processor as soon as itarrivesEffectiveness of these strategies depends onblock size and likelihood of another access tothe portion of the block that has not yet beenfetchedCopyright 2012, Elsevier Inc. All rights reserved.22

n n n When storing to a block that is already pending in thewrite buffer, update write bufferReduces stalls due to full write bufferDo not apply to I/O addressesAdvanced OptimizationsMerging Write BufferNo writebufferingWrite bufferingCopyright 2012, Elsevier Inc. All rights reserved.23

n Loop Interchangen n Swap nested loops to access memory insequential orderAdvanced OptimizationsCompiler OptimizationsBlockingn n Instead of accessing entire rows or columns,subdivide matrices into blocksRequires more memory accesses but improveslocality of accessesCopyright 2012, Elsevier Inc. All rights reserved.24

Fetch two blocks on miss (include nextsequential block)Advanced OptimizationsHardware PrefetchingPentium 4 Pre-fetchingCopyright 2012, Elsevier Inc. All rights reserved.25

n n n Insert prefetch instructions before data isneededNon-faulting: prefetch doesn’t causeexceptionsRegister prefetchn n Loads data into registerCache prefetchn n Advanced OptimizationsCompiler PrefetchingLoads data into cacheCombine with loop unrolling and softwarepipeliningCopyright 2012, Elsevier Inc. All rights reserved.26

Copyright 2012, Elsevier Inc. All rights reserved.Advanced OptimizationsSummary27

n Performance metricsn n n Latency is concern of cacheBandwidth is concern of multiprocessors and I/OAccess timen n Time between read request and when desired wordarrivesCycle timen n Memory TechnologyMemory TechnologyMinimum time between unrelated requests to memoryDRAM used for main memory,SRAM used for cacheCopyright 2012, Elsevier Inc. All rights reserved.28

n SRAMn n n Requires low power to retain bitRequires 6 transistors/bitMemory TechnologyMemory TechnologyDRAMn n Must be re-written after being readMust also be periodically refeshedn n n n Every 8 msEach row can be refreshed simultaneouslyOne transistor/bitAddress lines are multiplexed:n n Upper half of address: row access strobe (RAS)Lower half of address: column access strobe (CAS)Copyright 2012, Elsevier Inc. All rights reserved.29

n Amdahl:n n n Memory capacity should grow linearly with processor speedUnfortunately, memory capacity and speed has not keptpace with processorsMemory TechnologyMemory TechnologySome optimizations:n n Multiple accesses to same rowSynchronous DRAMn n n n n Added clock to DRAM interfaceBurst mode with critical word firstWider interfacesDouble data rate (DDR)Multiple banks on each DRAM deviceCopyright 2012, Elsevier Inc. All rights reserved.30

Copyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyMemory Optimizations31

Copyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyMemory Optimizations32

n DDR:n DDR2n n n DDR3n n n 1.5 V800 MHzDDR4n n n Lower power (2.5 V - 1.8 V)Higher clock rates (266 MHz, 333 MHz, 400 MHz)Memory TechnologyMemory Optimizations1-1.2 V1600 MHzGDDR5 is graphics memory based on DDR3Copyright 2012, Elsevier Inc. All rights reserved.33

n Graphics memory:n Achieve 2-5 X bandwidth per DRAM vs. DDR3n n Wider interfaces (32 vs. 16 bit)Higher clock raten n Memory TechnologyMemory OptimizationsPossible because they are attached via soldering instead ofsocketted DIMM modulesReducing power in SDRAMs:n n Lower voltageLow power mode (ignores clock, continues torefresh)Copyright 2012, Elsevier Inc. All rights reserved.34

Copyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyMemory Power Consumption35

n n n n n n Type of EEPROMMust be erased (in blocks) before beingoverwrittenNon volatileLimited number of write cyclesCheaper than SDRAM, more expensive thandiskSlower than SRAM, faster than diskCopyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyFlash Memory36

n n Memory is susceptible to cosmic raysSoft errors: dynamic errorsn n Hard errors: permanent errorsn n Detected and fixed by error correcting codes(ECC)Memory TechnologyMemory DependabilityUse sparse rows to replace defective rowsChipkill: a RAID-like error recovery techniqueCopyright 2012, Elsevier Inc. All rights reserved.37

n Protection via virtual memoryn n Keeps processes in their own memory spaceRole of architecture:n n n n n Provide user mode and supervisor modeProtect certain aspects of CPU stateProvide mechanisms for switching between usermode and supervisor modeProvide mechanisms to limit memory accessesProvide TLB to translate addressesCopyright 2012, Elsevier Inc. All rights reserved.Virtual Memory and Virtual MachinesVirtual Memory38

n n n n Supports isolation and securitySharing a computer among many unrelated usersEnabled by raw speed of processors, making theoverhead more acceptableAllows different ISAs and operating systems to bepresented to user programsn n n Virtual Memory and Virtual MachinesVirtual Machines“System Virtual Machines”SVM software is called “virtual machine monitor” or“hypervisor”Individual virtual machines run under the monitor are called“guest VMs”Copyright 2012, Elsevier Inc. All rights reserved.39

n Each guest OS maintains its own set of pagetablesn n VMM adds a level of memory between physicaland virtual memory called “real memory”VMM maintains shadow page table that mapsguest virtual addresses to physical addressesn n Requires VMM to detect guest’s changes to its own pagetableOccurs naturally if accessing the page table pointer is aprivileged operationCopyright 2012, Elsevier Inc. All rights reserved.Virtual Memory and Virtual MachinesImpact of VMs on Virtual Memory40

Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors: ! Aggregate peak bandwidth grows with # cores: ! Intel Core i7 can generate two references per core per clock ! Four cores and 3.2 GHz clock 25.6 billion 64-bit data references/second 12.8 billion 128-bit instruction references

Related Documents:

Chapter 2 Memory Hierarchy Design 2 Introduction Goal: unlimited amount of memory with low latency Fast memory technology is more expensive per bit than slower memory –Use principle of locality (spatial and temporal) Solution: organize memory system into a hierarchy –Entire addressable memory space available in largest, slowest memory –Incrementally smaller and faster memories, each .

Part One: Heir of Ash Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26 Chapter 27 Chapter 28 Chapter 29 Chapter 30 .

1. Chapter 2/Appendix B: Memory Hierarchy General Principles of Memory Hierarchies Understanding Caches and their Design Main Memory Organization Virtual Memory 2. Memory Hierarchy – What Is It Key idea: Use layers of increasingly large, cheap and slow storage: – Try to keep as much access as possible in small, fast levels

TO KILL A MOCKINGBIRD. Contents Dedication Epigraph Part One Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Part Two Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18. Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26

21-07-2017 2 Chap. 12 Memory Organization Memory Organization 12-5 12-1 Memory Hierarchy Memory hierarchy in a computer system Main Memory: memory unit that communicates directly with the CPU (RAM) Auxiliary Memory: device that provide backup storage (Disk Drives) Cache Memory: special very-high-

Memory -- Chapter 6 2 virtual memory, memory segmentation, paging and address translation. Introduction Memory lies at the heart of the stored-program computer (Von Neumann model) . In previous chapters, we studied the ways in which memory is accessed by various ISAs. In this chapter, we focus on memory organization or memory hierarchy systems.

Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector, Multimedia extended ISA, GPU, loop level parallelism, Chapter4 slides you may also refer to chapter3-ilp.ppt starting with slide #114 3.

Memory Management Ideally programmers want memory that is o large o fast o non volatile o and cheap Memory hierarchy o small amount of fast, expensive memory -cache o some medium-speed, medium price main memory o gigabytes of slow, cheap disk storage Memory management tasks o Allocate and de-allocate memory for processes o Keep track of used memory and by whom