Chapter 2: Memory Hierarchy Design - Aggregate

3y ago
26 Views
2 Downloads
958.62 KB
73 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Milo Davies
Transcription

Computer ArchitectureA Quantitative Approach, Fifth EditionChapter 2Memory Hierarchy DesignCopyright 2012, Elsevier Inc. All rights reserved.1

IntroductionIntroduction Programmers want unlimited amounts of memory withlow latencyFast memory technology is more expensive per bit thanslower memorySolution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowestmemoryIncrementally smaller and faster memories, each containing asubset of the memory below it, proceed in steps up toward theprocessorTemporal and spatial locality insures that nearly allreferences can be found in smaller memories Gives the allusion of a large, fast memory being presented to theprocessorCopyright 2012, Elsevier Inc. All rights reserved.2

Copyright 2012, Elsevier Inc. All rights reserved.IntroductionMemory Hierarchy3

Copyright 2012, Elsevier Inc. All rights reserved.IntroductionMemory Performance Gap4

Memory hierarchy design becomes more crucialwith recent multi-core processors: IntroductionMemory Hierarchy DesignAggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clockFour cores and 3.2 GHz clock 25.6 billion 64-bit data references/second 12.8 billion 128-bit instruction references 409.6 GB/s!DRAM bandwidth is only 6% of this (25 GB/s)Requires: Multi-port, pipelined cachesTwo levels of cache per coreShared third-level cache on chipCopyright 2012, Elsevier Inc. All rights reserved.5

High-end microprocessors have 10 MB on-chipcache IntroductionPerformance and PowerConsumes large amount of area and power budgetCopyright 2012, Elsevier Inc. All rights reserved.6

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists ofRAM access time Time to determine hit/missMiss: data needs to be retrieve from a block in the lower level (BlockY) Miss Rate 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level Time to deliver the block the processorHit Time Miss Penalty (500 instructions on 21264!)To ProcessorUpper LevelMemoryLower LevelMemoryBlk XFrom ProcessorBlk Y7

Cache Measures Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance,miss rate to average memory access time in memoryAverage memory-access time Hit time Miss rate x Miss penalty(ns or clocks)Miss penalty: time to replace a block from lower level, including timeto replace in CPU access time: time to lower level f(latency to lower level) transfer time: time to transfer block f(BW between upper & lower levels)8

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upperlevel?(Block placement)Q2: How is a block found if it is in the upper level?(Block identification)Q3: Which block should be replaced on a miss?(Block replacement)Q4: What happens on a write?(Write strategy)9

Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way setassociative S.A. Mapping Block Number Modulo NumberSetsDirect Mapped2-Way AssocFull Mapped01234567(12 mod 8) 4(12 mod 4) 67890123456789012345678901Memory10

Q2: How is a block found if it is in theupper level? Tag on each block No need to check index or block offsetIncreasing associativity shrinks index,expands tagBlock AddressTagIndexBlockOffset11

Example Suppose we have a 16KB of data in a directmapped cache with 4 word blocksDetermine the size of the tag, index and offsetfields if we’re using a 32-bit architectureOffset–––need to specify correct byte within a blockblock contains4 words16 bytes24 bytesneed 4 bits to specify correct byte12

Example [contd ] Index: ( index into an “array of blocks”)–––––need to specify correct row in cachecache contains 16 KB 214 bytesblock contains 24 bytes (4 words)# rows/cache # blocks/cache (sincethere’s one block/row) bytes/cachebytes/row214 bytes/cache24 bytes/row 210 rows/cacheneed 10 bits to specify this many rows 13

Example [contd ] Tag: use remaining bits as tag––tag length mem addr length- offset- index 32 - 4 - 10 bits 18 bitsso tag is leftmost 18 bits of memory address14

Accessing data in cacheMemoryAddress (hex). Ex.: 16KB of data, direct00000010a00000014bmapped,00000018c4 word blocks0000001Cd Read 4 addresses.– 4, 0x00008014 000000380000003Ch. Memory values on right:00008010i– only cache/memory level00008014j00008018of hierarchyk0000801Cl.15

Accessing data in cache[contd ] 4 Addresses:– 0x00000014, 0x0000001C, 0x00000034,0x00008014 4 Addresses divided (for convenience) intoTag, Index, Byte Offset 0000000000000000000000000010Tag0000000001 01000000000001 11000000000011 01000000000001 0100IndexOffset16

16 KB Direct Mapped Cache, 16B blocks Valid bit: determines whether anything is stored in that row(when computer initially turned on, all entries are invalid)ValidIndex Tag0 01 02 03 04 05 06 07 0.1022 01023 00x4-70x0-3Example Block0x8-b0xc-f.17

Read 0x00000014 0 00 0.001 0100 000000000000000000 0000000001 0100Index field OffsetTag fieldValid0x4-70x8-b0xc-f0x0-3Tag0Index1 0234567.000000010221023 0.018

So we read block 1 (0000000001) 000000000000000000 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 02 03 04 05 06 07 0.1022 01023 0.19

No valid data 000000000000000000 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 02 03 04 05 06 07 0.1022 01023 0.20

So load that data into cache, setting tag, valid 000000000000000000 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 04 05 06 07 0.1022 01023 0.21

Read from cache at offset, return word b 000000000000000000 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 04 05 06 07 0.1022 01023 0.22

Read 0x0000001C 0 00 0.001 1100 000000000000000000 0000000001 1100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 04 05 06 07 0.1022 01023 0.23

Data valid, tag OK, so read offset return word d 000000000000000000 0000000001 1100Valid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 04 05 06 07 0.1022 01023 0.24

Read 0x00000034 0 00 0.011 0100 000000000000000000 0000000011 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 04 05 06 07 0.1022 01023 0.25

So read block 3 000000000000000000 0000000011 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 04 05 06 07 0.1022 01023 0.26

No valid data 000000000000000000 0000000011 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 04 05 06 07 0.1022 01023 0.27

Load that cache block, return word f 000000000000000000 0000000011 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 1 0efgh4 05 06 07 0.1022 01023 0.28

Read 0x00008014 0 10 0.001 0100 000000000000000010 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 1 0efgh4 05 06 07 0.1022 01023 0.29

So read Cache Block 1, Data is Valid 000000000000000010 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 1 0efgh4 05 06 07 0.1022 01023 0.30

Cache Block 1 Tag does not match (0 ! 2) 000000000000000010 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 0abcd2 03 1 0efgh4 05 06 07 0.1022 01023 0.31

Miss, so replace block 1 with new data & tag 000000000000000010 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 2ijkl2 03 1 0efgh4 05 06 07 0.1022 01023 0.32

And return word j 000000000000000010 0000000001 0100Tag fieldIndex field OffsetValid0x4-70x8-b0xc-f0x0-3Index Tag0 01 1 2ijkl2 03 1 0efgh4 05 06 07 0.1022 01023 0.33

Q3: Which block should bereplaced on a miss? Easy for Direct MappedSet Associative or Fully Associative: Random LRU (Least Recently Used)Assoc:2-waySizeLRU Ran16 KB5.2% 5.7%64 KB1.9% 2.0%256 KB 1.15% 1.17%4-wayLRU Ran4.7% 5.3%1.5% 1.7%1.13% 1.13%8-wayLRU Ran4.4%5.0%1.4%1.5%1.12% 1.12%34

Q3: After a cache read miss, if there are no emptycache blocks, which block should be removed fromthe cache?The Least RecentlyUsed (LRU) block?Appealing,but hard to implementfor high associativityA randomly chosen block?Easy to implement, howwell does it work?Miss Rate for 2-way Set Associative CacheSizeRandomLRU16 KB5.7%2.0%1.17%5.2%1.9%1.15%64 KB256 KBAlso,tryotherLRUapprox.35

Q4: What happens on a write?PolicyDebugDo read missesproduce writes?Do repeatedwrites make it tolower level?Write-ThroughWrite-BackData written tocache blockalso written tolower-levelmemoryWrite data onlyto the cacheUpdate lowerlevel when ablock falls out ofthe cacheEasyNoHardYesYesNoAdditional option -- let writes to an un-cachedaddress allocate a new cache line (“writeallocate”).36

Write Buffers for Write-Through CachesCacheProcessorLowerLevelMemoryWrite BufferHolds data awaiting write-through tolower level memoryQ. Why a writebuffer ?Q. Why a buffer,why not just oneQ.Are ReadAfterregister?Write (RAW)hazards an issue forwrite buffer?A. So CPU doesn’t stallA. Bursts of writesareA.Yes! Drain buffercommon.before next read, orsend read 1st aftercheck write buffers.37

When a word is not found in the cache, a missoccurs: Fetch word from lower level in hierarchy, requiring ahigher latency referenceLower level may be another cache or the mainmemoryAlso fetch the other words contained within the block IntroductionMemory Hierarchy BasicsTakes advantage of spatial localityPlace block into cache in any location within its set,determined by address block address MOD number of setsCopyright 2012, Elsevier Inc. All rights reserved.38

IntroductionMemory Hierarchy Basics n sets n-way set associative Direct-mapped cache one block per setFully associative one setWriting to cache: two strategies Write-through Write-back Immediately update lower levels of hierarchyOnly update lower levels of hierarchy when an updated blockis replacedBoth strategies use write buffer to make writesasynchronousCopyright 2012, Elsevier Inc. All rights reserved.39

Miss rate IntroductionMemory Hierarchy BasicsFraction of cache access that result in a missCauses of misses Compulsory Capacity First reference to a blockBlocks discarded and later retrievedConflict Program makes repeated references to multiple addressesfrom different blocks that map to the same location in thecacheCopyright 2012, Elsevier Inc. All rights reserved.40

IntroductionMemory Hierarchy BasicsNote that speculative and multithreadedprocessors may execute other instructionsduring a miss Reduces performance impact of missesCopyright 2012, Elsevier Inc. All rights reserved.41

Six basic cache optimizations: Larger block size Reduces overall memory access timeGiving priority to read misses over writes Reduces conflict missesIncreases hit time, increases power consumptionHigher number of cache levels Increases hit time, increases power consumptionHigher associativity Reduces compulsory missesIncreases capacity and conflict misses, increases miss penaltyLarger total cache capacity to reduce miss rate IntroductionMemory Hierarchy BasicsReduces miss penaltyAvoiding address translation in cache indexing Reduces hit timeCopyright 2012, Elsevier Inc. All rights reserved.42

Small and simple first level caches Critical timing path: addressing tag memory, thencomparing tags, thenselecting correct setAdvanced OptimizationsTen Advanced OptimizationsDirect-mapped caches can overlap tag compare andtransmission of dataLower associativity reduces power because fewercache lines are accessedCopyright 2012, Elsevier Inc. All rights reserved.43

Advanced OptimizationsL1 Size and AssociativityAccess time vs. size and associativityCopyright 2012, Elsevier Inc. All rights reserved.44

Advanced OptimizationsL1 Size and AssociativityEnergy per read vs. size and associativityCopyright 2012, Elsevier Inc. All rights reserved.45

To improve hit time, predict the way to pre-setmux Mis-prediction gives longer hit timePrediction accuracy Advanced OptimizationsWay Prediction 90% for two-way 80% for four-wayI-cache has better accuracy than D-cacheFirst used on MIPS R10000 in mid-90sUsed on ARM Cortex-A8Extend to predict block as well “Way selection”Increases mis-prediction penaltyCopyright 2012, Elsevier Inc. All rights reserved.46

Pipeline cache access to improve bandwidth Examples: Pentium: 1 cyclePentium Pro – Pentium III: 2 cyclesPentium 4 – Core i7: 4 cyclesAdvanced OptimizationsPipelining CacheIncreases branch mis-prediction penaltyMakes it easier to increase associativityCopyright 2012, Elsevier Inc. All rights reserved.47

Allow hits beforeprevious missescomplete “Hit under miss”“Hit under multiplemiss”Advanced OptimizationsNonblocking CachesL2 must support thisIn general,processors can hideL1 miss penalty butnot L2 miss penaltyCopyright 2012, Elsevier Inc. All rights reserved.48

Organize cache as independent banks tosupport simultaneous access ARM Cortex-A8 supports 1-4 banks for L2Intel i7 supports 4 banks for L1 and 8 banks for L2Advanced OptimizationsMultibanked CachesInterleave banks according to block addressCopyright 2012, Elsevier Inc. All rights reserved.49

Critical word first Early restart Request missed word from memory firstSend it to the processor as soon as it arrivesAdvanced OptimizationsCritical Word First, Early RestartRequest words in normal orderSend missed work to the processor as soon as itarrivesEffectiveness of these strategies depends onblock size and likelihood of another access tothe portion of the block that has not yet beenfetchedCopyright 2012, Elsevier Inc. All rights reserved.50

When storing to a block that is already pending in thewrite buffer, update write bufferReduces stalls due to full write bufferDo not apply to I/O addressesAdvanced OptimizationsMerging Write BufferNo writebufferingWrite bufferingCopyright 2012, Elsevier Inc. All rights reserved.51

Compiler Optimizations McFarling [1989] reduced caches misses by 75%on 8KB direct mapped cache, 4B blocks in softwareInstructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts (using tools they developed)Data Merging Arrays: improve spatial locality by single array of compoundelements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order storedin memory Loop Fusion: Combine 2 independent loops that have same looping andsome variables overlap Blocking: Improve temporal locality by accessing “blocks” of datarepeatedly vs. going down whole columns or rows52

Merging Arrays Example/* Before: 2 sequential arrays */int val[SIZE];int key[SIZE];/* After: 1 array of stuctures */struct merge {int val;int key;};struct merge merged array[SIZE];Reducing conflicts between val & key;improve spatial locality53

Loop Interchange Example/* Before */for (k 0; kfor (j 0;for (i 0;x[i][j] 2/* After */for (k 0; kfor (i 0;for (j 0;x[i][j] 2 ji*100; k k 1) 100; j j 1) 5000; i i 1)x[i][j]; ij*100; k k 1) 5000; i i 1) 100; j j 1)x[i][j];Sequential accesses instead of striding through memory every 100words; improved spatial locality54

Loop Fusion Example/* Before */for (i 0; i N; i i 1)for (j 0; j N; j j 1)a[i][j] 1/b[i][j] * c[i][j];for (i 0; i N; i i 1)for (j 0; j N; j j 1)d[i][j] a[i][j] c[i][j];/* After */for (i 0; i N; i i 1)for (j 0; j N; j j 1){a[i][j] 1/b[i][j] * c[i][j];d[i][j] a[i][j] c[i][j];}2 misses per access to a & c vs. one miss per access; improve spatiallocality55

Blocking Example/* Before */for (i 0; i N; i i 1)for (j 0; j N; j j 1){r 0;for (k 0; k N; k k 1){r r y[i][k]*z[k][j];};x[i][j] r;}; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 2N3 N2 (assuming no conflict; otherwise ) Idea: compute on BxB submatrix that fits in cache56

Blocking Example/* After */for (jj 0; jj N; jj jj B)for (kk 0; kk N; kk kk B)for (i 0; i N; i i 1)for (j jj; j min(jj B-1,N); j j 1){r 0;for (k kk; k min(kk B-1,N); k k 1) {r r y[i][k]*z[k][j];};x[i][j] x[i][j] r;}; B called Blocking FactorCapacity Misses from 2N3 N2 to 2N3/B N257

Fetch two blocks on miss (include nextsequential block)Advanced OptimizationsHardware PrefetchingPentium 4 Pre-fetchingCopyright 2012, Elsevier Inc. All rights reserved.58

Insert prefetch instructions before data isneededNon-faulting: prefetch doesn’t causeexceptionsRegister prefetch Loads data into registerCache prefetch Advanced OptimizationsCompiler PrefetchingLoads data into cacheCombine with loop unrolling and softwarepipeliningCopyright 2012, Elsevier Inc. All rights reserved.59

Copyright 2012, Elsevier Inc. All rights reserved.Advanced OptimizationsSummary60

Performance metrics Latency is concern of cacheBandwidth is concern of multiprocessors and I/OAccess time Time between read request and when desired wordarrivesCycle time Memory TechnologyMemory TechnologyMinimum time between unrelated requests to memoryDRAM used for main memory, SRAM usedfor cacheCopyright 2012, Elsevier Inc. All rights reserved.61

SRAM Requires low power to retain bitRequires 6 transistors/bitMemory TechnologyMemory TechnologyDRAM Must be re-written after being readMust also be periodically refeshed Every 8 msEach row can be refreshed simultaneouslyOne transistor/bitAddress lines are multiplexed: Upper half of address: row access strobe (RAS)Lower half of address: column access strobe (CAS)Copyright 2012, Elsevier Inc. All rights reserved.62

Amdahl: Memory capacity should grow linearly with processor speedUnfortunately, memory capacity and speed has not keptpace with processorsMemory TechnologyMemory TechnologySome optimizations: Multiple accesses to same rowSynchronous DRAM Added clock to DRAM interfaceBurst mode with critical word firstWider interfacesDouble data rate (DDR)Multiple banks on each DRAM deviceCopyright 2012, Elsevier Inc. All rights reserved.63

Copyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyMemory Optimizations64

Copyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyMemory Optimizations65

DDR: DDR2 DDR3 1.5 V800 MHzDDR4 Lower power (2.5 V - 1.8 V)Higher clock rates (266 MHz, 333 MHz, 400 MHz)Memory TechnologyMemory Optimizations1-1.2 V1600 MHzGDDR5 is graphics memory based on DDR3Copyright 2012, Elsevier Inc. All rights reserved.66

Graphics memory: Achieve 2-5 X bandwidth per DRAM vs. DDR3 Wider interfaces (32 vs. 16 bit)Higher clock rate Memory TechnologyMemory OptimizationsPossible because they are attached via soldering instead ofsocketted DIMM modulesReducing power in SDRAMs: Lower voltageLow power mode (ignores clock, continues torefresh)Copyright 2012, Elsevier Inc. All rights reserved.67

Copyright 2012, Elsevier Inc. All rights reserved.Memory TechnologyMemory Power Consumption68

Type of EEPROMMust be erased (in blocks) before beingoverwrittenNon volatileLimited number of write cyclesCh

Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clock Four cores and 3.2 GHz clock 25.6 billion 64-bit data references/second 12.8 billion 128-bit instruction references

Related Documents:

Chapter 2 Memory Hierarchy Design 2 Introduction Goal: unlimited amount of memory with low latency Fast memory technology is more expensive per bit than slower memory –Use principle of locality (spatial and temporal) Solution: organize memory system into a hierarchy –Entire addressable memory space available in largest, slowest memory –Incrementally smaller and faster memories, each .

Part One: Heir of Ash Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26 Chapter 27 Chapter 28 Chapter 29 Chapter 30 .

1. Chapter 2/Appendix B: Memory Hierarchy General Principles of Memory Hierarchies Understanding Caches and their Design Main Memory Organization Virtual Memory 2. Memory Hierarchy – What Is It Key idea: Use layers of increasingly large, cheap and slow storage: – Try to keep as much access as possible in small, fast levels

TO KILL A MOCKINGBIRD. Contents Dedication Epigraph Part One Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Part Two Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18. Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26

21-07-2017 2 Chap. 12 Memory Organization Memory Organization 12-5 12-1 Memory Hierarchy Memory hierarchy in a computer system Main Memory: memory unit that communicates directly with the CPU (RAM) Auxiliary Memory: device that provide backup storage (Disk Drives) Cache Memory: special very-high-

Memory -- Chapter 6 2 virtual memory, memory segmentation, paging and address translation. Introduction Memory lies at the heart of the stored-program computer (Von Neumann model) . In previous chapters, we studied the ways in which memory is accessed by various ISAs. In this chapter, we focus on memory organization or memory hierarchy systems.

Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector, Multimedia extended ISA, GPU, loop level parallelism, Chapter4 slides you may also refer to chapter3-ilp.ppt starting with slide #114 3.

Memory Management Ideally programmers want memory that is o large o fast o non volatile o and cheap Memory hierarchy o small amount of fast, expensive memory -cache o some medium-speed, medium price main memory o gigabytes of slow, cheap disk storage Memory management tasks o Allocate and de-allocate memory for processes o Keep track of used memory and by whom