3/4/14Project Announcement Grading: Part A: 3/4 of total grade Part B: 1/4 of total grade15-440 Distributed Systems Actual grade formula: 3/4*(Max(A1, 0.7*A2)) 1/4*(Max(B self, 0.9B with staff code))) A1 is submission for P1a on or before the part A deadline. A2 is version of P1a submitted *any time* before the part Bdeadline. B self is part B running against your own implementation of LSP 0.9*B with staff code is part B running against the reference LSPLecture 13 – RAIDThanks to Greg Ganger and Remzi Arapaci-Dusseau forslides1Replacement RatesHPC1!Component!COM1!Component!%!Hard SW!Power supply!MLB!SCSI BP!30.6!28.5!14.4!12.4!4.9!2.9!1.7!1.6!1!0.3!Power supply!Memory!Hard drive!Case!Fan!CPU!SCSI Board!NIC Card!LV Pwr Board!CPU .4!8!2!0.6!1.2!0.6!0.6! Using multiple disks%!Hard drive!Motherboard!Power supply!RAID card!Memory!SCSI cable!Fan!CPU!CD-ROM!Raid Controller! Why have multiple disks? problem and .6! RAID levels and performance Estimating availability3Motivation:Why use multiple disks?4JustJusta bunchofofdisksa bunchdisks(JBOD)(JBOD) Capacity More disks allows us to store more data Performance Access multiple disks in parallel Each disk can be working on independent read or write Overlap seek and rotational positioning time for all Reliability Recover from disk (or single sector) failures Will need to store multiple copies of data to recover So, what is the simplest arrangement?A0B0C0D0A1B1C1D1A2B2C2D2A3B3C3D3 Yes, it’s a goofy name Yes, it’s a goofy name industryreally does sell “JBOD enclosures” industry really does sell “JBOD enclosures”October 2010, Greg Ganger 641
3/4/14Disk Subsystem Load BalancingDisk Striping I/O requests are almost never evenly distributedDisk Striping Interleavedata across multiple disks Some data is requested more than other data Depends on the apps, usage, time, . Largefile streamingcan enjoyparallelInterleavedata acrossmultiplediskstransfers Highthroughputrequestscan enjoythorough Largefile streamingcan enjoyparalleltransfersloadbalancing High throughput requests can enjoy thorough load balancing If blockshotfilesfiles equallylikelyon all(really?)(really?) If blocksofofhotequallylikelyondisksall disks What is the right data-to-disk assignment policy? Common approach: Fixed data placement Your data is on disk X, period! For good reasons too: you bought it or you’re paying more. Fancy: Dynamic data placement If some of your files are accessed a lot, the admin(or evensystem) may separate the “hot” files across multiple disksFile Foo: "stripe unitor block In this scenario, entire files systems (or even files) are manually movedby the system admin to specific disks Alternative: Disk striping Stripe all of the data across all of the disksStripe"78October 2010, Greg Ganger 8Disk striping detailsNow, What If A Disk Fails? How disk striping works In a JBOD (independent disk) system Break up total space into fixed-size stripe units Distribute the stripe units among disks in round-robin Compute location of block #B as follows disk# B%N (% modulo,N #ofdisks) LBN# B / N (computes the LBN on given disk) one or more file systems lost In a striped system a part of each file system lost Backups can help, but backing up takes time and effort backup doesn’t help recover data lost during that day Any data loss is a big deal to a bank or stockexchange910Tolerating and masking diskfailuresRedundancy via replicas If a disk fails, it’s data is gone Two (or more) copiesRedundancy via replicas may be recoverable, but may not beTwo (or more) copies mirroring, shadowing, duplexing, etc. To keep operating in face of failuremirroring, shadowing, duplexing, etc. Writeboth, read either Write both, read either must have some kind of data redundancy Common forms of data redundancy replication erasure-correcting codes error-correcting codes110022113312October 2010, Greg Ganger 162
roringMirroring& StripingMirroring& Striping Mirroring can be done in either software or hardwareMirroring canbe donein availableeither softwareor hardwareSoftwaresolutionsarein mostOS’s HardwaresolutionsHardware solutions to to2 virtualdrives,whereeachvirtualdriveis is MirrorMirror2 virtualdrives,whereeachvirtualdrivereallya setof ofstripeddrivesreallya setstripeddrives Softwaresolutions are available in most OS’s Windows2000, Linux, SolarisProvidesreliabilityof ofmirroring performance(withwriteupdatecosts) Providesstripingperformance(withwriteupdatecosts) Windows2000, Linux, SolarisbedonedoneininHostHostBusAdaptor(s) CouldCould beBusAdaptor(s) CouldCould ller October 2010, Greg Ganger 1817 13October 2010, Greg Ganger 14Hardware vs. Software RAIDLower Cost Data Redundancy Hardware RAID Single failure protecting codes general single-error-correcting code is overkill General code finds error and fixes it Storage box you attach to computer Same interface as single disk, but internally much more Multiple disks More complex controller NVRAM (holding parity blocks) Disk failures are self-identifying (a.k.a. erasures) Don’t have to find the error Fact: N-error-detecting code is also N-erasurecorrecting Software RAID OS (device driver layer) treats multiple disks like a single disk Software does all extra work Error-detecting codes can’t find an error,just know its there But if you independently know where error is, allows repair Interface for both Parity is single-disk-failure-correcting code Linear array of bytes, just like a single disk (but larger) recall that parity is computed via XOR it’s like the low bit of the sum16Updating and using the parityUpdating and using the paritySimplest approach: Parity DiskSimplest approach: Parity Disk One extra disk All writes update One extra diskparity diskAll writes update Potentialparitydiskbottleneck potentialbottleneckFault-Free ReadAAAAApBBBBBpCCCCCpDDDDDp2DDDP1DDegraded ReadDD17October 2010, Greg Ganger Fault-Free WriteDP43DDPDegraded WriteDDDP1820October 2010, Greg Ganger 233
3/4/14The parity disk bottleneckSolution: Striping the ParitySolution: Striping the Parity Reads go only to the data disks Removes parity disk bottleneck But, hopefully load balanced across the disks All writes go to the parity diskRemoves paritydisk bottleneck And, worse, usually result in Read-Modify-Writesequence So, parity disk can easily be a bottleneckAAAAApBBBBpBCCCpCCDDpDDD1920October 2010, Greg Ganger Outline25RAID Taxonomy Redundant Array of Inexpensive Independent Disks Constructed by UC-Berkeley researchers in late 80s (Garth) Using multiple disks RAID 0 – Coarse-grained Striping with no redundancy RAID 1 – Mirroring of independent disks RAID 2 – Fine-grained data striping plus Hamming code disks Why have multiple disks? problem and approaches Uses Hamming codes to detect and correct multiple errors Originally implemented when drives didn’t always detect errors Not used in real systems RAID levels and performance Estimating availabilityRAID 3 – Fine-grained data striping plus parity diskRAID 4 – Coarse-grained data striping plus parity diskRAID 5 – Coarse-grained data striping plus striped parityRAID 6 – Coarse-grained data striping plus 2 striped codes21RAID-0: Striping22RAID-0: Striping Stripe blocks across disks in a chunk size How to pick a reasonable chunk size?0415263781291310141115How to calculate where chunk # lives?Disk:Offset within disk:0415263781291310141115 Evaluate for D disks Capacity: How much space is wasted? Performance: How much faster than 1 disk? Reliability: More or less reliable than 1 disk?4
3/4/14RAID-1: MirroringRAID-4: Parity Motivation: Handle disk failures Put copy (mirror or replica) of each chunk on another disk Motivation: Improve capacity Idea: Allocate parity block to encode info about blocks Parity checks all other blocks in stripe across other disks Parity block XOR over others (gives even parity) Example: 0 1 0 ! Parity value?0202131346465757 How do you recover from a failed disk? Example: x 0 0 and parity of 1 What is the failed value?Capacity:Reliability:Performance:RAID-4: Parity 0!3!1!4!2!5!P0!P1!6!9!7!10!8!11!P2!P3!Small number of disks (or large write):Large number of disks (or small write):Comparison15RAID-0N0RAID-1N/21 (for sure)N(if lucky)2RAID-4N 11RAID-5N 11N ·SN ·SN ·RN ·R(N/2) · S(N/2) · SN ·R(N/2) · R(N 1) · S(N 1) · S(N 1) · R1·R2(N 1) · S(N 1) · SN P1!P0!5!6!P3!P2!9!7!10!8!11! Reads: Writes: Still requires 4 I/Os per write, but not always to same parity diskR EDUNDANT A RRAYS OF I NEXPENSIVE D ISKS (RAID S )ThroughputSequential ReadSequential WriteRandom ReadRandom WriteLatencyReadWrite4!7! Capacity: Reliability: Performance: Parity disk is the bottleneckCapacityReliability1!9!Rotate location of parity across all disksReadsWrites: How to update parity block? Two different approaches 3!6!RAID-5: Rotated ParityCapacity:Reliability:Performance: 0!Advanced Issues What happens if more than one fault? Example: One disk fails plus latent sector error on another RAID-5 cannot handle two faults Solution: RAID-6 (e.g., RDP) Add multiple parity blocks Why is NVRAM useful?Table 38.7: RAID Capacity, Reliability, and Performance Example: What if update 2, don t update P0 before power failure(or crash), and then disk 1 fails? NVRAM solution: Use to store blocks updated in same stripe If power failure, can replay all writes in NVRAM Software RAID solution: Perform parity scrub over entire e RAID-5 is basically identical to RAID-4 except in the few caseswhere it is better, it has almost completely replaced RAID-4 in the marketplace. The only place where it has not is in systems that know they willnever perform anything other than a large write, thus avoiding the smallwrite problem altogether [HLM94]; in those cases, RAID-4 is sometimesused as it is slightly simpler to build.38.8RAID Comparison: A SummaryWe now summarize our simplified comparison of RAID levels in Table 38.7. Note that we have omitted a number of details to simplify ouranalysis. For example, when writing in a mirrored system, the average5
3/4/14TheDiskDiskArrayArrayMatrixMatrixTheRAID 6 P Q RedundancyIndependent Protects against multiple failures using Reed-Solomon codes Uses 2 “parity” disks P is parity Q is a second code It’s two equations with two unknowns, just make“biggerbits”NoneReplication Group bits into “nibbles” and add different coefficients to eachequation (two independent equations in two unknowns)Fine StripingJBODRAID0MirroringRAID1RAID0 1Parity Disk Similar to parity striping De-clusters both sets of parity across all drives For small writes, requires 6 I/Os Read old data, old parity1, old parity2 Write new data, new parity1, new parity2Course StripingRAID3Striped ParityRAID4Gray90RAID53132October 2010, Greg Ganger Outline29Sidebar:AvailabilitymetricSidebar: Availabilitymetric Fractionof totohandlehandlerequestsrequests ComputedfromMTBFandMTTR r)Repair) Using multiple disks Why have multiple disks? problem and approachesAvailability RAID levels and performanceInstalledTBF1 Estimating availabilityMTBFMTBF MTTRFixedTTR1FixedTBF2TTR2FixedTBF3TTR3Available during these 3periods of time.33October 2010, Greg Ganger Backto MeanTime To Data LossHowoftenare MTTDL)MTBF (Mean Time Between Failures) MTBF(MeanTime BetweenFailures) MTBFhours ( 136years, 1% per year)disk 1,200,00 MTBF (Mean Time Between Failures) MTBF(MeanTime BetweenFailures) MTBFhours ( 136years, 1% per year)disk 1,200,00MTBFmutli-disk mean time to first disk failuremutli-disk systemsystem pretty darnedgood, if you believe the number MTBFdisk 1,200,00 hours ( 136 years, 1% per year) pretty darnedgood, if you believe the number MTBFdisk 1,200,00 hours ( 136 years, 1% per year)/ (number of disks) which is MTBFdiskdisk / (number of disks) For a striped array of 200 drives MTBFarray 136 years / 200 drives 0.65 years MTBFarray 136 years / 200 drives 0.65 yearsMTBFmutli-disk mean time to first disk failuremutli-disk systemsystem/ (number of disks) which is MTBFdiskdisk / (number of disks) For a striped array of 200 drives MTBFarray 136 years / 200 drives 0.65 years MTBFarray 136 years / 200 drives 0.65 years October 2010, Greg Ganger 34 131435October 2010, Greg Ganger 14366
3/4/14Reliability without rebuildRebuild: restoring redundancyafter failure 200 data drives with MTBFdrive After a drive failure data is still available for access but, a second failure is BAD MTTDLarray MTBFdrive / 200 Add 200 drives and do mirroring So, should reconstruct the data onto a new drive MTBFpair (MTBFdrive / 2) MTBFdrive 1.5 * MTBFdrive MTTDLarray MTBFpair / 200 MTBFdrive / 133 on-line spares are common features of high-end disk arrays reduce time to start rebuild must balance rebuild rate with foreground performance impact a performance vs. reliability trade-offs Add 50 drives, each with parity across 4 data disks MTBFset (MTBFdrive / 5) (MTBFdrive / 4) 0.45 * MTBFdrive MTTDLarray MTBFset / 50 MTBFdrive / 111 How data is reconstructed Mirroring: just read good copy Parity: read all remaining drives (including parity) and compute3738Reliability consequences ofadding rebuildThree modes of operation No data loss, if fast enough Normal mode That is, if first failure fixed before second one happens everything working; maximum efficiency Degraded mode New math is. some disk unavailable must use degraded mode operations MTTDLarray MTBFfirstdrive * (1 / prob of 2nd failure before repair) . which is MTTRdrive / MTBFseconddrive Rebuild mode For mirroring reconstructing lost disk’s contents onto spare degraded mode operations plus competition withrebuild MTBFpair (MTBFdrive / 2) * (MTBFdrive / MTTRdrive) For 5-disk parity-protected arrays MTBFset (MTBFdrive / 5) * ((MTBFdrive / 4 )/ MTTRdrive)3940Mechanics of rebuildConclusions Background process RAID turns multiple disks into a larger, faster, morereliable disk RAID-0: StripingGood when performance and capacity really matter,but reliability doesn t RAID-1: MirroringGood when reliability and write performance matter,but capacity (cost) doesn t RAID-5: Rotating ParityGood when capacity and cost matter or workload isread-mostly use degraded mode read to reconstruct data then, write it to replacement disk Implementation issues Interference with foreground activity and controlling rate Rebuild is important for reliability Foreground activity is important for performance Using the rebuilt disk For rebuilt part, reads can use replacement disk Must balance performance benefit with rebuildinterference Good compromise choice417
RAID 4 – Coarse-grained data striping plus parity disk RAID 5 – Coarse-grained data striping plus striped parity RAID 6 – Coarse-grained data striping plus 2 striped codes 22 RAID-0: Striping Stripe blocks across disks in a chunk size How to pick a reasonable
RAID 0 1-8 RAID 1 1-9 RAID 5 1-10 RAID 6 1-11 RAID 00 1-12 RAID 10 1-13 RAID 50 1-14 RAID 60 1-15 Fault Tolerance 1-16 Generic Drive Replacement Procedure 1-17 Removing a Drive from a Server 1-17. Contents iv Cisco UCS Servers RAID Guide OL-26591-01 Installing a Drive in a .
upon. Each traditional NetApp RAID 4 group has some number of data disks and one parity disk, with aggregates and volumes containing one or more RAID 4 groups. Whereas the parity disk in a RAID 4 volume stores row parity across the disks in a RAID 4 group, the additional RAID-DP parity disk stores diagonal parity across the disks in a RAID-DP .
RAID Terminology RAID-0 Striping; Super Important and widely used. No Redundancy! RAID-1 Mirroring; Super important and widely used. RAID-10 A stripe of mirrors. Super important and widely used. N number of devices are lost capacity-wise. RAID-2 Never Used RAID-3 and RAID-4 Rarely used . 16
The ATTO Configuration Tool provides the capability to configure disk storage into RAID groups or Hot Spare drives. Note: Even an individual JBOD disk is considered to be a RAID group. Use the ATTO Configuration Tool to set up RAID groups on your Sonnet RAID controller in one of the following RAID levels: JBOD RAID Level 0 RAID Level 1
all NetApp aggregates. RAID-DP is a modified RAID-4, supporting double parity disks. In RAID-DP or RAID-TEC, there are two or three dedicated parity disks, depending on the RAID type. For more details on RAID-DP, please check the technical report from NetApp [15]. It is worth noting that RAID-DP/TEC in ONTAP do not suffer from the problem
RAID 10 (Stripe Mirroring) RAID 0 drives can be mirrored using RAID 1 techniques, resulting in a RAID 10 solution for improved performance plus resiliency. The controller combines the performance of data striping (RAID 0) and the fault tolerance of disk mirroring (RAID 1). Data is striped across multiple drives
LSI (SATA) Embedded SATA RAID LSI Embedded MegaRaid Intel VROC LSI (SAS) MegaRAID SAS 8880EM2 MegaRAID SAS 9280-8E MegaRAID SAS 9285CV-8e MegaRAID SAS 9286CV-8e LSI 9200-8e SAS IME on 53C1064E D2507 LSI RAID 0/1 SAS 4P LSI RAID 0/1 SAS 8P RAID Ctrl SAS 6G 0/1 (D2607) D2516 RAID 5/6 SAS based on
traditional RAID architecture will take prohibitively long time increasing the chance of data loss. Such technology trend is likely to continue in the foreseeable future. The requirement of fast RAID reconstruction coupled with the technology trend motivates us to seek for a new RAID architecture that allows high speed online RAID