Design Patterns For Distributed Non-Relational Databases

1y ago
11 Views
2 Downloads
1.58 MB
48 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Adalynn Cowell
Transcription

Design Patterns for DistributedNon-Relational DatabasesakaJust Enough Distributed Systems To BeDangerous(in 40 minutes)Todd Lipcon(@tlipcon)ClouderaJune 11, 2009

IntroductionCommon Underlying AssumptionsDesign PatternsConsistent HashingConsistency ModelsData ModelsStorage LayoutsLog-Structured Merge TreesCluster ManagementOmniscient MasterGossipQuestions to Ask Presenters

Why We’re All HereIIScaling up doesn’t workScaling out with traditional RDBMSs isn’t sohot eitherIIISharding scales, but you lose all the features thatmake RDBMSs useful!Sharding is operationally obnoxious.If we don’t need relational features, we want adistributed NRDBMS.

Closed-source NRDBMSs“The Inspiration”IGoogle BigTableIIAmazon DynamoIIApplications: webtable, Reader, Maps, Blogger,etc.Shopping Cart, ?Yahoo! PNUTSIApplications: ?

Data Interfaces“This is the NOSQL meetup, right?”IIIIIIEvery row has a key (PK)Key/value get/putmultiget/multiputRange scan? With predicate pushdown?MapReduce?SQL?

Underlying Assumptions

Assumptions - Data SizeIIIThe data does not fit on one node.The data may not fit on one rack.SANs are too expensive.Conclusion:The system must partition its data across manynodes.

Assumptions - ReliabilityIIIThe system must be highly available to serveweb (and other) applications.Since the system runs on many nodes, nodeswill crash during normal operation.Data must be safe even though disks andnodes will fail.Conclusion:The system must replicate each row to multiplenodes and remain available despite certain node anddisk failure.

Assumptions - Performance.and price thereofIIIAll systems we’re talking about today aremeant for real-time use.95th or 99th percentile is more important thanaverage latencyCommodity hardware and slow disks.Conclusion:The system needs to perform well on commodityhardware, and maintain low latency even duringrecovery operations.

Design Patterns

Partitioning Schemes“Where does a key live?”IIIGiven a key, we need to determine whichnode(s) it belongs on.If that node is down, we need to find anothercopy elsewhere.Difficulties:IIIUnbounded number of keys.Dynamic cluster membership.Node failures.

Consistent HashingMaintaining hashing in a dynamic cluster

Consistent HashingKey Placement

Consistency ModelsIIA consistency model determines rules forvisibility and apparent order of updates.Example:IIIIIIRow X is replicated on nodes M and NClient A writes row X to node NSome period of time t elapses.Client B reads row X from node MDoes client B see the write from client A?Consistency is a continuum with tradeoffs

Strict ConsistencyIIAll read operations must return the data fromthe latest completed write operation, regardlessof which replica the operations went toImplies either:IIIAll operations for a given row go to the same node(replication for availability)or nodes employ some kind of distributedtransaction protocol (eg 2 Phase Commit or Paxos)CAP Theorem: Strict Consistency can’t beachieved at the same time as availability andpartition-tolerance.

Eventual ConsistencyIIIIAs t , readers will see writes.In a steady state, the system is guaranteed toeventually return the last written valueFor example: DNS, or MySQL SlaveReplication (log shipping)Special cases of eventual consistency:IIIRead-your-own-writes consistency (“sent mail”box)Causal consistency (if you write Y after reading X,anyone who reads Y sees X)gmail has RYOW but not causal!

Timestamps and Vector ClocksDetermining a history of a rowIIIIEventual consistency relies on deciding whatvalue a row will eventually converge toIn the case of two writers writing at “thesame” time, this is difficultTimestamps are one solution, but rely onsynchronized clocks and don’t capture causalityVector clocks are an alternative method ofcapturing order in a distributed system

Vector ClocksIDefinition:IIA vector clock is a tuple {t1 , t2 , ., tn } of clockvalues from each nodev1 v2 if:IIIIIFor all i, v1i v2iFor at least one i, v1i v2iv1 v2 implies global time ordering of eventsWhen data is written from node i, it sets ti toits clock value.This allows eventual consistency to resolveconsistency between writes on multiple replicas.

Data ModelsWhat’s in a row?IIPrimary Key ValueValue could be:IIIIIBlobStructured (set of columns)Semi-structured (set of column families witharbitrary columns, eg linkto: url in webtable)Each has advantages and disadvantagesSecondary Indexes? Tables/namespaces?

Multi-Version StorageUsing Timestamps for a 3rd dimensionIIIIIEach table cell has a timestampTimestamps don’t necessarily need tocorrespond to real lifeMultiple versions (and tombstones) can existconcurrently for a given rowReads may return “most recent”, “most recentbefore T”, etc. (free snapshots)System may provide optimistic concurrencycontrol with compare-and-swap on timestamps

Storage LayoutsHow do we lay out rows and columns on disk?IIIIDetermines performance of different accesspatternsStorage layout maps directly to disk accesspatternsFast writes? Fast reads? Fast scans?Whole-row access or subsets of columns?

Row-based StorageIPros:IIIGood locality of access (on disk and in cache) ofdifferent columnsRead/write of a single row is a single IO operation.Cons:IBut if you want to scan only one column, you stillread all.

Columnar StorageIPros:IIIData for a given column is stored sequentiallyScanning a single column (eg aggregate queries) isfastCons:IReading a single row may seek once per column.

Columnar Storage with Locality GroupsIIIColumns are organized into families (“localitygroups”)Benefits of row-based layout within a group.Benefits of column-based - don’t have to readgroups you don’t care about.

Log Structured Merge Treesaka “The BigTable model”IIIIIRandom IO for writes is bad (and impossible insome DFSs)LSM Trees convert random writes to sequentialwritesWrites go to a commit log and in-memorystorage (Memtable)The Memtable is occasionally flushed to disk(SSTable)The disk stores are periodically compactedP. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The log-structured merge-tree(LSM-tree). Acta Informatica. 1996.

LSM Data Layout

LSM Write Path

LSM Read Path

LSM Read Path Bloom Filters

LSM Memtable Flush

LSM Compaction

Cluster ManagementIIIIClients need to know where to find data(consistent hashing tokens, etc)Internal nodes may need to find each other aswellSince nodes may fail and recover, aconfiguration file doesn’t really sufficeWe need a way of keeping some kind ofconsistent view of the cluster state

Omniscient MasterIIIIWhen nodes join/leave or change state, theytalk to a masterThat master holds the authoritative view of theworldPros: simplicity, single consistent view of theclusterCons: potential SPOF unless master is madehighly available. Not partition-tolerant.

GossipIIGossip is one method to propagate a view ofcluster status.Every t seconds, on each node:IIIIIThe node selects some other node to chat with.The node reconciles its view of the cluster with itsgossip buddy.Each node maintains a “timestamp” for itself andfor the most recent information it has from everyother node.Information about cluster state spreads inO(lgn) rounds (eventual consistency)Scalable and no SPOF, but state is onlyeventually consistent

Gossip - Initial State

Gossip - Round 1

Gossip - Round 2

Gossip - Round 3

Gossip - Round 4

Questions to Ask Presenters

Scalability and ReliabilityIIIIIWhat are the scaling bottlenecks? How does itreact when overloaded?Are there any single points of failure?When nodes fail, does the system maintainavailability of all data?Does the system automatically re-replicatewhen replicas are lost?When new nodes are added, does the systemautomatically rebalance data?

PerformanceIIIIIWhat’s the goal? Batch throughput or requestlatency?How many seeks for reads? For writes? Howmany net RTTs?What 99th percentile latencies have beenmeasured in practice?How do failures impact serving latencies?What throughput has been measured inpractice for bulk loads?

ConsistencyIIIIWhat consistency model does the systemprovide?What situations would cause a lapse ofconsistency, if any?Can consistency semantics be tweaked byconfiguration settings?Is there a way to do compare-and-swap on rowcontents for optimistic locking? Multirow?

Cluster Management and TopologyIIIIIDoes the system have a single master? Does ituse gossip to spread cluster management data?Can it withstand network partitions and stillprovide some level of service?Can it be deployed across multiple datacentersfor disaster recovery?Can nodes be commissioned/decomissionedautomatically without downtime?Operational hooks for monitoring and metrics?

Data Model and StorageIIIIIIWhat data model and storage system does thesystem provide?Is it pluggable?What IO patterns does the system cause underdifferent workloads?Is the system best at random or sequentialaccess? For read-mostly or write-mostly?Are there practical limits on key, value, or rowsizes?Is compression available?

Data Access MethodsIIIIWhat methods exist for accessing data? Can Iaccess it from language X?Is there a way to perform filtering or selectionat the server side?Are there bulk load tools to get data in/outefficiently?Is there a provision for data backup/restore?

Real Life Considerations(I was talking about fake life in the first 45 slides)IIIIIWho uses this system? How big are theclusters it’s deployed on, and what kind of loaddo they handle?Who develops this system? Is this a communityproject or run by a single organization? Areoutside contributions regularly accepted?Who supports this system? Is there an activecommunity who will help me deploy it anddebug issues? Docs?What is the open source license?What is the development roadmap?

sql.pdf

Eventual Consistency I As t !1, readers will see writes. I In a steady state, the system is guaranteed to eventually return the last written value I For example: DNS, or MySQL Slave Replication (log shipping) I Special cases of eventual consistency: I Read-your-own-writes consistency (\sent mail" box) I Causal consistency (if you write Y after reading X, anyone who reads Y sees X)

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Distributed Systems Stream Groups Local Patterns Global Patterns Figure 1: Distributed data mining architecture. local patterns (details in section 5). 3) From the global patterns, each autonomous system further refines/verifies their local patterns. There are two main options on where the global patterns are computed. First, all local patterns

Distributed Database Design Distributed Directory/Catalogue Mgmt Distributed Query Processing and Optimization Distributed Transaction Mgmt -Distributed Concurreny Control -Distributed Deadlock Mgmt -Distributed Recovery Mgmt influences query processing directory management distributed DB design reliability (log) concurrency control (lock)

LLinear Patterns: Representing Linear Functionsinear Patterns: Representing Linear Functions 1. What patterns do you see in this train? Describe as What patterns do you see in this train? Describe as mmany patterns as you can find.any patterns as you can find. 1. Use these patterns to create the next two figures in Use these patterns to .