Lecture 6: Chipkill, PCM

2y ago
9 Views
2 Downloads
219.74 KB
16 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Pierre Damon
Transcription

Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors1

Chipkill Chipkill correct systems can withstand failure of an entireDRAM chip For chipkill correctness the 72-bit word must be spread across 72 DRAM chips or, a 13-bit word (8-bit data and 5-bit ECC) must bespread across 13 DRAM chips2

RAID-like DRAM Designs DRAM chips do not have built-in error detection Can employ a 9-chip rank with ECC to detect and recoverfrom a single error; in case of a multi-bit error, rely on asecond tier of error correction Can do parity across DIMMs (needs an extra DIMM); useECC within a DIMM to recover from 1-bit errors; use parityacross DIMMs to recover from multi-bit errors in 1 DIMM Reads are cheap (must only access 1 DIMM); writes areexpensive (must read and write 2 DIMMs)Used in some HP servers3

RAID-like DRAMUdipi et al., ISCA’10 Add a checksum to every row in DRAM; verified at thememory controller Adds area overhead, but provides self-contained errordetection When a chip fails, can re-construct data by examininganother parity DRAM chip Can control overheads by having checksum for a largerow or one parity chip for many data chips Writes are again problematic4

SSC-DSD The cache line is organized into multi-bit symbols Two symbols are required for error detection and 3/4symbols are used for error correction (can handle completefailure in one symbol, i.e., each symbol is fetched from adifferent DRAM chip) 3-symbol codes are not popular because it leads tonon-standard DIMMs 4-symbol codes are more popular, but are used as 32 4so that standard ECC DIMMs can be used (high activationenergy and low rank-level parallelism) (16 4 would5require a non-standard DIMM)

Virtualized ECCYoon and Erez, ASPLOS’10 Also builds a two-tier error protection scheme, but doesthe second tier in software The second-tier codes are stored in the regular physicaladdress space (not specialized DRAM chips); software hasflexibility in terms of the types of codes to use and the typesof pages that are protected Reads are cheap; writes are expensive as usual; but, thesecond-tier codes can now be cached; greatly helps reducethe number of DRAM writes Requires a 144-bit datapath (increases overfetch)6

LoT-ECCUdipi et al., ISCA 2012 Use checksums to detect errors and parity codes to fix Requires access of only 9 DRAM chips per read, but thestorage overhead grows to 26%57 7777

Phase Change Memory Emerging NVM technology that can replace Flash andDRAM; there are other competing technologies too Much higher density; much better scalability; can domulti-level cells When materials (GST) are heated (with electrical pulses)and then cooled, they form either crystalline or amorphousmaterials depending on the intensity and duration of thepulses; crystalline materials have low resistance (1 state)and amorphous materials have high resistance (0 state) Non-volatile, fast reads ( 50ns), slow and energy-hungrywrites; limited lifetime ( 108 writes per cell), no leakage 8

PCM as a Main MemoryLee et al., ISCA 20099

PCM as a Main MemoryLee et al., ISCA 2009 Two main innovations to overcome these drawbacks: decoupled row buffers and non-destructive PCM reads multiple narrow row buffers (row buffer cache)10

Optimizations for Writes (Energy, Lifetime) Read a line before writing and only write the modifiedbitsZhou et al., ISCA’09 Write either the line or its inverted version, whichevercauses fewer bit-flipsCho and Lee, MICRO’09 Only write dirty lines in a PCM page (when a page isevicted from a DRAM cache) Lee et al., Qureshi et al., ISCA’09 When a page is brought from disk, place it only in DRAMcache and place in PCM upon eviction Qureshi et al., ISCA’09 Wear-leveling: rotate every new page, shift a rowperiodically, swap segmentsZhou et al., Qureshi et al., ISCA’0911

Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradualcapacity degradation when this happens Pairing: among the pool of faulty pages, pair two pagesthat have faults in different locations; replicate data acrossthe two pagesIpek et al., ASPLOS’10 Errors are detected with parity bits; replica reads are issuedif the initial read is faulty12

ECPSchechter et al., ISCA’10 Instead of using ECC to handle a few transient faults inDRAM, use error-correcting pointers to handle hard errorsin specific locations For a 512-bit line with 1 failed bit, maintain a 9-bit field totrack the failed location and another bit to store the valuein that location Can store multiple such pointers and can recover fromfaults in the pointers too ECC has similar storage overhead and can handle softerrors; but ECC has high entropy and can hasten wearout13

SAFERSeong et al., MICRO 2010 Most PCM hard errors are stuck-at faults (stuck at 0 orstuck at 1) Either write the word or its flipped version so that thefailed bit is made to store the stuck-at value For multi-bit errors, the line can be partitioned such thateach partition has a single error Errors are detected by verifying a write; recently failedbit locations are cached so multiple writes can be avoided14

FREE-pYoon et al., HPCA 2011 When a PCM block (64B) is unusable because the number ofhard errors has exceeded the ECC capability, it is remappedto another address; the pointer to this address is storedin the failed block; need another bit per block The pointer can be replicated many times in the failed blockto tolerate the multiple errors in the failed block Requires two accesses when handling failed blocks; thisoverhead can be reduced by caching the pointer at thememory controller15

Title Bullet16

PCM as a Main Memory Lee et al., ISCA 2009 Two main innovations to overcome these drawbacks: decoupled row buffers and non-destructive PCM reads multiple narrow row buffers (row buffer cache) 11 Optimizations for Writes (Energy, Lifetime) R

Related Documents:

Contents Product line System/Torch 1 CENTRICUT SE200 MAX200 2 CENTRICUT SE2000 HT2000 3 AMADA IC200 IC200 4 ESAB PT-31/PT-31XL PCM -VPi, PCM-SMi, PCM-500i, PCM-625i, PCM-50A, PCM-875 and PCM-1125 5 ESAB PT-20AM PCM-100, PCM-875, PCM-1125 and ESP -100i 6 ESAB PT-24 Precision Plasmarc 7

page oem torch system 2 centricut se200 max200 3 se2000 ht2000 4 amada ic200 ic200 5 esab pt-31/pt-31xl pcm-vpi, pcm-smi, pcm-500i, pcm-625i, pcm-50a, pcm-875 and pcm-1125 6 pt-17a/pt-17am pcm-70, pcm-100 and p

S/D Reactivity -4068 pcm 6191 pcm Keff 1 ρ 0 pcm Bank D at 210 Steps Boron Concentration 800 ppm 6 Power Defect 1500 pcm Shutdown Rods-3673 pcm-6191 pcm Reactivity 1550 pcm Total Rod Worth-7741 pcm Bank D @ 210 50

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

5 Plasma www.centricut.com PT-31 /PT-31XL OEMESAB SystemPCM-VPI,PCM-SMI,PCM-500I,PCM-625I,PCM-50A, PCM-875&PCM-1125 Consumables Centricut Reference partnumber

memory ECCs (e.g., chipkill correct, double chipkill correct, DIMM-kill correct, etc.) to reduce the overheads of the under-lying memory ECC. The optimization is to store in memory the bitwise parity of the ECC correction bits instead of always storing in memory the actual ECC correc

98 - D1 PCM encoding PCM ENCODING ACHIEVEMENTS: introduction to pulse code modulation (PCM) and the PCM ENCODER module. Coding of a message into a train of digital words in binary format. PREREQUISITES: an understanding of sampling, from previous experiments, and of

American National Standards Institute (ANSI) A300 (Part 6) – 2012 Transplanting for Tree Care Operations – Tree, Shrub, and other Woody Plant Maintenance Standard Practices (Transplanting) Drip line The hole should be 1.5-2 times the width of the root ball. EX: a 32” root ball should have a minimum wide 48” hole