From Hyper Converged Infrastructure To Hybrid Cloud Infrastructure - USENIX

1y ago
13 Views
2 Downloads
8.41 MB
66 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Mara Blakely
Transcription

From Hyper Converged Infrastructureto Hybrid Cloud InfrastructureKaran Gupta, Principal Architect, Nutanix

Founders’ of Hyper Converged InfrastructureDheeraj Pandey2009: VP of Engineering at Aster Data (Teradata) multi-genre advanced analytics solution2007: Managed storage engine group for OracleDatabase: ExadataMohit Aron2009: Lead architect at Aster Data2007: Lead developer of Google File System(GFS)

2009-10: Changing technology landscapeAWS Multiplexing of compute 25,000 IOPS Dynamically migrateworkloads 100us latency VM high availability Operational simplicity Scale on demand

Hyper Converged Infrastructure

Foundation technologies

Nutanix Controller VM Services§ Medusa (Metadata Service)Medusa(Metadata layer)§ Protocol§ Performance/Scale Enhancements§ Failure profile§ Stargate (Data Path)§ Cache Layer§ NextGen Data path§ Hybrid Cloud26

Virtual disk abstractionVM 0vDisk 0 VM nVM 1vDisk nvDisk 0 vDisk nvDisk 0 vDisk nCluster block storage systemClusterServer 0ComputeCPUCPUServer NStorageSSDHDDComputeCPUCPUStorageSSDHDD3

Metadata index for virtual disksVM 0 vDisk 0VM nVM 1vDisk nvDisk 0 vDisk nvDisk 0 vDisk nMetadata index: virtual disk block - physical disk blockClusterServer 0ComputeCPUCPUServer NStorageSSDHDDComputeCPUCPUStorageSSDHDD4

Medusa: A Consistent system under PartitionsUse DHTs to shardFailures (IASO)metadata indexacross the clusterDistributed hash table (DHT)node 0Log-structuredMerge-treenode CPUCPUCPUSSDSSDSSDHDDShard 0Use LSM forPerformance (TRIAD)durabilityReplicate shardsProtocolanduse (FRSM)Paxos forconsistencyShard n8

Failure Profiles of this ModuleTeam Starts to GrowPerformance and Scalefeatures Leader Only ReadsCompaction ChangesMemory ManagementDirectIOStill Discover Day 1issues DirectIO/Ext4 Leader Only Reads Cassandra skip row

Nutanix Controller VM Services§ Medusa (Metadata Service)Medusa(Metadata layer)§ Protocol§ Performance/Scale Enhancements§ Failure profile§ Stargate (Data Path)§ Cache Layer§ NextGen Data path§ Hybrid Cloud26

Protocol:Fine-Grained Replicated State Machines for a Cluster StorageSystem (FRSM) NSDI’20

Key idea: Fine-grained Replicated State Machine (fRSM) Key Id (Partition identifier)KeyPaxos Instance Number Epoch: Generation Id Timestamp: Advanced by 1 every timevalue is updatedPaxos Consensus State Promised Proposal Number Accepted Proposal Number Chosen bit ClientReplicaNo Operation Logs: next RSM state function(curr RSM state, operation)CAS/Read can support Speculative ExecutionStable Leader: Failure characteristics of the clustersValue is not required for Paxos consensus.

Required APIs by metadata maps Compare-and-Swap (key, old val, new val) Create (key, val), Delete (key) Read (key) Quorum reads Leader Only reads Mutating reads Scan (key range)[1] Maurice Herlihy, Wait-free Synchronization, ACM Transactions onProgrammable languages and systems, 19917

Delete handling under fRSMrequestClientLeaderFollower 0Follower nFollower n19

Delete handling under fRSMrequestClientOwnercheckLeaderFollower 0Follower iFollower nPerform aCASupdate(t 1)19

Delete handling under LeaderFollower 0Follower iDelete Acknowledged tothe client.Follower n19

Delete handling under LeaderFollower 0Follower iFollower nFailure tosend delete19

Delete handling under TombstoneTombstoneackLeaderFollower 0Follower iValue spacereclaimed.Follower n19

Delete handling under TombstoneTombstoneackLeaderFollower 0Follower iPeriodic deleteretriesFollower n19

Delete handling under TombstoneTombstoneackCell ackRemoveLeaderFollower 0Follower iFollower nKey removedafter 24hours19

Ghost Writes: Read after Read inconsistencyNode ZPx, E, TPx, E, TPx, E, TPx,Px,E, E,(T T 1)Node YPx, E, TPx, E, TPx, E, TPx,Px,E, E,(T T 1)Node XPx, E, (T 1)Px, E, (T 1)Px, E, (T 1)Px, E, (T 1)t1t2t3t4

Mutating Reads: Stronger than LinearizabilityNode ZNode YNode XPx, E, TPx, E, TPy,Px,E, E,(T T 1)Py, E, (T 1)Px, E, TPx, E, TE, T 1)Py,Px,E, (TPy, E, (T 1)Px, E, (T 1)t1Px, E, (T 1)t2Px, E, (T 1)Py,Px, E, (T 1)t3t4

Nutanix Controller VM Services§ Medusa (Metadata Service)Medusa(Metadata layer)§ Protocol§ Performance/Scale Enhancements§ Failure profile§ Stargate (Data Path)§ Cache Layer§ NextGen Data path§ Hybrid Cloud26

Scale and Performance:TRIAD: Creating Synergies Between Memory, Disk and Log inLog Structured Key-Value Stores (ATC’17)

TRIAD GoalDecrease background ops overheadto increase user throughput.Reduce Write Amplification26

Background I/O OverheadK Operations/s§300250200150100500Long & slow bg. opsslowdown of user ops.RocksDBRocksDB No BG I/Oup to 3x throughput gap L27Uniform50r-50wSkewed50r-50w

TRIADWorkloadImprove WA inTRIAD-MEMSkewed workloadsFlushing and niform workloadsFlushingThree techniques work together and are complementary.28

TRIAD-MEM: Hot-cold key separationRAMDiskK1V1nK2V2K3V3 KnCmVnflushingCommit LogL0K1V11K2V2K1V12K1V13K1V14 K1V1n29

TRIAD-MEM: Hot-cold keyseparationIdea:Keep hot keys in memoryRAMDiskK1V1nK2V2K3V3 KnFlush only cold keysKeep hot keys in CLCmVnflushingCommit LogL0K1V11K2V2K1V12K1V13K1V14 K1V1n30

TRIAD-MEM: Hot-cold keyseparationIdea:Keep hot keys in memoryK1V1nFlush only cold keysKeep hot keys in CLCmRAMflushingDiskCommit LogK1K2V2K3V3V1nL0 KnVn31

TRIAD-MEM Summaryü Good for skewed workloads.ü Reduce flushing WA: less data written from memory to disk.ü Reduce compaction WA: avoid repeatedly compacting hot keys.32

TRIAD-LOGWorkloadTRIAD-LOGUniform workloadsImprove WA inFlushing33

Problem: Flushing with Uniform WorkloadsRAMDiskKeyValK1V1’K2V2 KnCmVnflushingCommit LogL0K1V1K2V2K1V1’K3V3K3V3’ KnVn34

Problem: Flushing with Uniform WorkloadsKeyValCmRAMflushingDiskKeyValK1V1’K2V2 KnVnflushCommit LogL0K1V1K2V2K1V1’K3V3K3V3’ KnVn35

Problem: Flushing with Uniform WorkloadsKeyValCmRAMflushingDiskKeyValK1V1’K2V2 KnVnflushCommit LogL0K1V1K2V2K1V1’K3V3K3V3’ KnVn36

Problem: Flushing with Uniform WorkloadsKeyValInsight:Flushed data already written to commit log.RAMDisk CmIdea:Use commit logs as SSTables. Avoid bg I/O due to flushing.flushingKeyValK1V1’K2V2 KnVnflushCommit LogL0K1V1K2V2K1V1’K3V3K3V3’ KnVn37

TRIAD-LOGKeyValCL IndexK1V1’3K2V22VnnCm RAMDiskKnPoint to most recent entry in CL.flushingCommit LogL0K1V1K2V2K1V1’K3V3K3V3’ KnVn38

TRIAD-LOGKeyValCL IndexK1V1’3K2V22Vnn RAMDiskKnCmflushingCommit LogL0K1V1K2V2K1V1’K3V3K3V3’ KnVn39

TRIAD-LOGCL IndexK1: 3K2: 2KeyValCL IndexKn: nK1: 3K2: 2CL-SSTableRAMOnly flush CL Index from memory andcoupleDisk it with the current Commit Log.CmKeep index in memory for further reads.Kn: nflushingCL IndexK1V1K1: 3K2V2K2: 2K1V1’ Kn: nKnCommit LogL0Vn40

TRIAD-LOG Summaryü Good for uniform workloads.ü Reuse Commit Log as L0 SST.ü No more flushing of mem component to disk.41

Production Workloads: Throughputhigheris better300KOPS2502002xRocksDBTRIAD150100500Prod Wkld 1 Prod Wkld 2 uniformskewedWrite Amplification350TRIAD: stable throughput across wklds.1086420Prod Wk42

Production Workloads: Write AmplificationRocksDBTRIADloweris better2Write Amplification1086TRIAD: low and uniform WA.4xRocksDB4TRIAD20Prod Wkld 1 Prod Wkld 2 uniformskewed43

Nutanix Controller VM Services§ Medusa (Metadata Service)Medusa(Metadata layer)§ Protocol§ Performance/Scale Enhancements§ Failure profile§ Stargate (Data Path)§ Cache Layer§ NextGen Data path§ Hybrid Cloud26

Fail Slow ErrorsIASO: A Fail-Slow Detection and Mitigation Framework ForDistributed Storage Services [ATC’19]

Fail-slow frequency Frequent - 232 incidents seenacross 39,000 nodes over 7monthsAlmost 1 case per day Can take days to be fullyresolved

Fail-slow problem space

IASO Peer based failure detection Detect a fail slow node / peerQuarantine the faulty nodeResolve the root causeScoreanalyzern1{98, 97}n2{1, 1}n3{1, 1}Set ofscores forpeersOutlier

IASO in ProductionMitigates cluster outage in 10minutesCatches fail slow faults with 96.3% accuracy

Us and Publications from Industry and Academia 2010s RAFT (ATC’ 14) RIFL (SOSP’ 15) WPAXOS (2017) EPAXOS (SOSP’13) Spanner (OSDI’ 12) HERMES (ASPLOS’2020) Pysalia (NSDI’2020)27

Nutanix Controller VM Services§ Medusa (Metadata Service)Medusa(Metadata layer)§ Protocol§ Performance/Scale Enhancements§ Failure profile§ Stargate (Data Path)§ Cache Layer§ NextGen Data path§ Hybrid Cloud26

Stargate: Data IO Path

Nutanix Controller VM Services§ Medusa (Metadata Service)Medusa(Metadata layer)§ Protocol§ Performance/Scale Enhancements§ Failure profile§ Stargate (Data Path)§ Cache Layer§ NextGen Data path§ Hybrid Cloud26

Lines of Code over time - CacheIntelligentCompressed CacheTouch pooladjustmentIn MemoryCachingScanResistanceTTL basedeviction O(1)Unification ofCachesAccurateWarmup ofHot DataTimestamp and Tagbased O(1) subsetinvalidationsSSD SpilloverTITLE OF PRESENTATION CONFIDENTIALPriority basedcaching / AutoDisableNew Use cases(OSS)NewTypes/Watches

TCMalloc Issues TCMalloc designed for performance over garbage collection Thread-local cachesUnordered fast reuse from freelists and new pages.Single Process with multiple possibly independent modules sharing memory Shared memory domain/arena for memory and cpu efficiencyBursty writes and reads means bursty allocations and deallocations in TCMallocModules can expand and reduce memory usage over time in a common memory spaceIssuesGBs of fragmented memory in Central CachePerformance impacts and less efficient CPU and Memory usageCaches have to be pruned to regain memory.

Nutanix Controller VM Services§ Medusa (Metadata Service)Medusa(Metadata layer)§ Protocol§ Performance/Scale Enhancements§ Failure profile§ Stargate (Data Path)§ Cache Layer§ NextGen Data path§ Hybrid Cloud26

New Media Shifts Bottleneck to Software 550,000 IO/s 10us57

Fast Drives: Random Writes Sequential Writes3000Bandwidth (MB/s)Higher is betterRandom 4k Sequential WritesNew Hardware200010000Existing assumption2010201020132013201620162018201830

Accelerating the Data Path with Blockstore SPDKCurrentController VM2HCY20FutureController VMStargateStargateController VMStargateBlockstoreBlockstoreSPDKSystem CallsBlock subsystem(SCSI)System CallsFile SysPurpose built for device accessthru SPDK for NVMe media(extent store)Block subsystem(SCSI)NVMeSSDFully utilize new media performanceUser-Space FilesystemSSDPurpose-built for NVMe but also benefits SSDsand HDDsEfficient filesystem metadatamanagementNutanix Data Path59

Accelerating Further for Full Stack (AOS and AHV)CurrentHypervisorFuture Use of iSCSI over RDMAbetween AHV(initiator) andStargate(target)AHVSystem CallsISERController VMController VMStargateBlockstoreStargateBlockstore Zero copy DMA operationsSPDKSPDK Eliminates system callsNVMeNVMeSystem calls b/w storage andhypervisorShortest data path from App to Storage60

Nutanix Hybrid CloudINTEGRATIONDATALOCALITYPrivateCloudData Center§ Medusa (Metadata Service)SaaSAppCROSS-CLOUDSECURITY§ Protocol§ Performance/Scale Enhancements§ Failure LITY§ Stargate (Data Path)HostedPrivateLICENSEPORTABILITY§ Cache Layer§ NextGen Data path§ Hybrid CloudTrad.HostingLATENCY ANDDIRECTCONNECT26

Hybrid Multicloud Architecture 62VNETVPC1Click to Cloud with existing VPCs,Subnets and accounts2Govern and Manage costs acrossall clouds3App Mobility with programmableinfrastructure and portabilityNutanix Hybrid Multicloud PlatformEC2Bare MetalNutanix Private CloudAWS icsearchAzureDedicated HostsAzure sSQL DB.ExpressRouteDatabricks

Active Research and Development Areas§ Medusa§ New media like Optane drives and Non LSM databases§ (Kvell, SOSP’19)§ S3 based Timeseries databases§ SmartNICs: Background process offloading to even protocol.§ Stargate (Data Path)§ NextGen architecture to support GPUs§ SmartNICs: Dissaggregated storage§ SmartStorage§ WAN optimized storage and data mobility§ Better memory management§ Hybrid Cloud 63

2009-10: Changing technology landscapeAWS Multiplexing of compute 25,000 IOPSDynamically migrateworkloads 100us latencyVM high availability Operationalsimplicity Scale on demand

20019-20: Changing technology landscapeNVM Multiplexing ofcompute VMs,Containers,Functions 10 GBps 1us latency Cloudspecializations:Geos, functionality,features andgovernmentregulations

Thank You

From Hyper Converged Infrastructure to Hybrid Cloud Infrastructure Karan Gupta, Principal Architect, Nutanix. Dheeraj Pandey . 250 300 350 Prod Wkld 1 Prod Wkld 2 KOPS RocksDB TRIAD 0 2 4 6 8 10 Prod Wkld 1 Prod Wkld 2 Write Amplification RocksDB TRIAD uniform skewed TRIAD: low and uniformWA. 4x

Related Documents:

ODMs). The driving force behind our two topics—converged and hyper-converged infrastructure—emerged from all of these trends. To understand where converged, hyper-converged and other solutions fit within the overall IT spending landscape, innovative all-flash array vendor SolidFire has an excellent graphic of where next-generation

Hyper-converged platforms deliver the high availability that organizations need for mission-critical workloads. Plus, hyper-converged infrastructure typically includes storage redundancy and the ability to mirror entire nodes or clusters. Organizations are also migrating their testing and development environments to hyper-converged infrastructure.

solution. It provides a step-by-step guidance on configuring a hyper-converged 2-node Hyper-V cluster using StarWind Virtual SAN to convert local storage of the Hyper-V hosts into a fault tolerant shared storage resource for Hyper-V. A full

Hyper-Convergence: Infrastructure in a Box Hyper-converged systems that bundle storage, compute, networking and virtualization come in several types of packages. Examine the types of products available and the features you need to know about when shopping for the best hyper-converged system.

the Powershell CLI commands, see Install the ASAv on Hyper-V Using the Command Line, page 68. For instructions to install using the Hyper-V Manager, see Install the ASAv on Hyper-V Using the Hyper-V Manager, page 68 . Hyper-V does not provide a serial console option. You can manage Hyper-V through SSH or ASDM over the management interface.

FlashStack is a converged infrastructure solution that brings the benefits of an all-flash storage platform to your converged infrastructure deployments. Built on best of breed components from Cisco and Pure Storage, FlashStack provides a converged infrastructure solution that is simple, flexible, efficient, and costs less than legacy

network fabric for VxRail hyper-converged environments. The 1st HCI appliance with network configuration automation. Automates up to 98% of the network configuration steps for VxRail hyper-converged environments. Provides 100% fully automated, zero-touch operation for VxRail hyper-converged networks. 1 2 3 4

Hyper Converged 250 System for VMware vSphere Installation Guide Part Number: 867788-003 Published: December, 2016 Edition: 8 Abstract This document describes how to install the Hyper Converged 250 System for VMware vSphere and the ConvergedSystem 200-HC StoreVirtual system into a rack and configure it using the preinstalled OneView InstantOn .