Fast Databases With Fast Durability And Recovery Through .

3y ago

28 Views

2 Downloads

717.27 KB

14 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Aliana Wahl

Report this link

Download PDF

Transcription

Fast Databases with Fast Durabilityand Recovery Through Multicore ParallelismWenting Zheng and Stephen Tu, Massachusetts Institute of Technology;Eddie Kohler, Harvard University; Barbara Liskov, Massachusetts Institute of /technical-sessions/presentation/zheng wentingThis paper is included in the Proceedings of the11th USENIX Symposium onOperating Systems Design and Implementation.October 6–8, 2014 Broomfield, CO978-1-931971-16-4Open access to the Proceedings of the11th USENIX Symposium on Operating SystemsDesign and Implementationis sponsored by USENIX.

Fast Databases with Fast Durability and RecoveryThrough Multicore ParallelismWenting Zheng, MIT*Stephen Tu, MIT*Eddie Kohler, Harvard UniversityBarbara Liskov, MITAbstractMulticore in-memory databases for modern machinescan support extraordinarily high transaction rates for online transaction processing workloads. A potential weakness, however, is recovery from crash failures. Can classical techniques, such as checkpoints, be made both efficient enough to keep up with current systems’ memory sizes and transaction rates, and smart enough toavoid additional contention? Starting from an efficientmulticore database system, we show that naive loggingand checkpoints make normal-case execution slower, butthat frequent disk synchronization allows us to keepup with many workloads with only a modest reductionin throughput. We design throughout for parallelism:during logging, during checkpointing, and during recovery. The result is fast. Given appropriate hardware(three SSDs and a RAID), a 32-core system can recovera 43.2 GB key-value database in 106 seconds, and a 70 GB TPC-C database in 211 seconds.1IntroductionIn-memory databases on modern multicore machines [10] can handle complex, large transactions atmillions to tens of millions of transactions per second,depending on transaction size. A potential weaknessof such databases is robustness to crashes and powerfailures. Replication can allow one site to step in foranother, but even replicated databases must write datato persistent storage to survive correlated failures, andperformance matters for both persistence and recovery.Crash resistance mechanisms, such as logging andcheckpointing, can enormously slow transaction execution if implemented naively. Modern fast in-memorydatabases running tens of millions of small transactionsper second can generate more than 50 GB of log dataper minute when logging either values or operations. Interms of both transaction rates and log sizes, this is upto several orders of magnitude more than the values reported in previous studies of in-memory-database durability [2, 14, 24]. Logging to disk or flash is at least theoretically fast, since log writes are sequential, but sequential log replay is not fast on a modern multicore machine.Checkpoints are also required, since without them, logswould grow without bound, but checkpoints require a*Currently at University of California, Berkeley.USENIX Associationwalk over the entire database, which can cause datamovement and cache pollution that reduce concurrenttransaction performance. Recovery of a multi-gigabytedatabase using a single core could take more than 90minutes on today’s machines, which is a long time evenin a replicated system.Our goal in this work was to develop an in-memorydatabase with full persistence at relatively low cost totransaction throughput, and with fast recovery, meaning we hoped to be able to recover a large database toa transactionally-consistent state in just a few minuteswithout replication. Starting from Silo [27], a very fastin-memory database system, we built SiloR, which addslogging, checkpointing, and recovery. Using a combination of logging and checkpointing, we are able to recover a 43.2 GB YCSB key-value-style database to atransactionally-consistent snapshot in 106 seconds, anda more complex 70 GB TPC-C database with manytables and secondary indexes in 211 seconds.Perhaps more interesting than our raw performance isthe way that performance was achieved. We used concurrency in all parts of the system. The log is writtenconcurrently to several disks, and a checkpoint is takenby several concurrent threads that also write to multiple disks. Concurrency was crucial for recovery, and wefound that the needs of recovery drove many of our design decisions. The key to fast recovery is using all ofthe machine’s resources, which, on a modern machine,means using all cores. But some designs tempting on thelogging side, such as operation logging (that is, loggingtransaction types and arguments rather than logging values), are difficult to recover in parallel. This drive forfast parallel recovery affected many aspects of our logging and checkpointing designs.Starting with an extremely fast in-memory database,we show: All the important durability mechanisms can andshould be made parallel. Checkpointing can be fast without hurting normaltransaction execution. The fastest checkpoints introduce undesired spikes and crashes into concurrent throughput, but through good engineering andby pacing checkpoint production, this variabilitycan be reduced enormously.11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 465

Even when checkpoints are taken frequently, ahigh-throughput database will have to recover froma very large log. In our experiments, log recoveryis the bottleneck; for example, to recover a 35 GBTPC-C database, we recover 16 GB from a checkpoint and 180 GB from the log, and log recovery accounts for 90% of recovery time. Our design allowsus to accomplish log replay at roughly the maximum speed of I/O. The system built on these ideas can recover a relatively large database quite quickly.2Silo overviewWe build on Silo, a fast in-memory relational databasethat provides tables of typed records. Clients issue oneshot requests: all parameters are available when a request begins, and the request does not interact with itscaller until it completes. A request is dispatched to a single database worker thread, which carries it out to completion (commit or abort) without blocking. Each workerthread is pinned to a physical core of the server machine.Most cores run workers, but SiloR reserves several coresfor logging and checkpointing tasks.Silo tables are stored in efficient, cache-friendly concurrent B-trees [15]. Each table uses one primary treeand zero or more secondary trees for secondary indexes.Key data is embedded in tree structures, and values arestored in separately-allocated records. All structures arestored in shared memory, so any worker can access theentire database.Silo uses a variant of optimistic concurrency control(OCC) [11] to serialize transactions. Concurrency control centers on transaction IDs (TIDs). Each record contains the TID of the transaction that most recently modified it. As a worker runs a transaction, it maintains aread-set containing the old TID of each read or writtenrecord, and a write-set containing the new state of eachwritten record. On transaction completion, a worker determines whether the transaction can commit. First itlocks the records in the write-set (in a global order toavoid deadlock). Then it computes the transaction’s TID;this is the serialization point. Next it compares the TIDsof records in the read-set with those records’ currentTIDs, and aborts if any TIDs have changed or any recordis locked by a different transaction. Otherwise it commits and overwrites the write-set records with their newvalues and the new TID.2.1EpochsSilo transaction IDs differ in an important way fromthose in other systems, and this difference impacts theway SiloR does logging and recovery. Classical OCCobtains the TID for a committing transaction by effectively incrementing a global counter. On modern multicore hardware, though, any global counter can becomea source of performance-limiting contention. Silo eliminates this contention using time periods called epochsthat are embedded in TIDs. A global epoch number Eis visible to all threads. A designated thread advances itperiodically (every 40 ms). Worker threads use E duringthe commit procedure to compute the new TID. Specifically, the new TID is (a) greater than any TID in theread-set, (b) greater than the last TID committed by thisworker, and (c) in epoch E.This avoids false contention on a global TID, butfundamentally changes the relationship between TIDsand the serial order. Consider concurrent transactionsT1 and T2 where T1 reads a key that T2 then overwrites. The relationship between T1 and T2 is calledan anti-dependency: T1 must be ordered before T2 because T1 depends on the absence of T2. In conventionalOCC, whose TIDs capture anti-dependencies, our example would always have TID(T1) TID(T2). But inSilo, there is no communication whatsoever from T1 toT2, and we could find TID(T1) TID(T2)! This meansthat replaying a Silo database’s committed transactionsin TID order might recover the wrong database.Epochs provide the key to correct replay. On totalstore-order (TSO) architectures like x86-64, the designated thread’s update of E becomes visible at all workerssimultaneously. Because workers read the current epochat the serialization point, the ordering of TIDs with different epochs is always compatible with the serial order, even in the case of anti-dependencies. Epochs allowfor a form of group commit: SiloR persists and recoversin units of epochs. We describe below how this impactslogging, checkpointing, and recovery.3LoggingThis section explains how SiloR logs transaction modifications for persistence. Our design builds on Silo, whichincluded logging but did not consider recovery, log truncation, or checkpoints. The SiloR logging subsystemadds log truncation, makes changes related to liveness,and allows more parallelism on replay.3.1Basic loggingThe responsibility for logging in SiloR is split betweenworkers, which run transactions, and separate loggingthreads (“loggers”), which handle only logging, checkpointing, and other housekeeping tasks. Workers generate log records as they commit transactions; they passthese records to loggers, which commit the logs to disk.When a set of logs is committed to disk via fsync, theloggers inform the workers. This allows workers to sendtransaction results to clients.466 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)USENIX Association

A log record comprises a committed transaction’s TIDplus the table, key, and value information for all recordsmodified by that transaction. Each worker constructslog records in disk format and stores them in a memory buffer taken from a per-worker buffer pool. When abuffer fills, or at an epoch boundary, the worker passesthe buffer to the logger over a shared-memory queue.3.2Value logging vs. operation loggingSiloR uses value logging, not operation or transactionlogging. This means that SiloR logs contain each transaction’s output keys and values, rather than the identityof the executed operation and its parameters.The choice of value logging is an example of recovery parallelism driving the normal-case logging design. Value logging has an apparent disadvantage relative to operation logging: for many workloads (such asTPC-C) it logs more data, and therefore might unnecessarily slow transaction execution. However, from thepoint of view of recovery parallelism, the advantages ofvalue logging outweigh its disadvantages. Value loggingis easy to replay in parallel—the largest TID per valuewins. This works in SiloR because TIDs reflect dependencies, i.e., the order of writes, and because we recoverin units of epochs, ensuring that anti-dependencies arenot a problem. Operation logging, in contrast, requiresthat transactions be replayed in their original serial order. This is always hard to parallelize, but in Silo, itwould additionally require logging read-sets (keys andTIDs) to ensure anti-dependencies were obeyed. Operation logging also requires that the initial pre-replaydatabase state be a transactionally consistent snapshot,which value logging does not; and for small transactionsvalue and operation logs are about the same size. Theseconsiderations led us to prefer value logging in SiloR.We solve the problem of value logging I/O by addinghardware until logging is not a bottleneck, and then using that hardware wisely.3.3Workers and loggersLoggers have little CPU work to do. They collect logsfrom workers, write them to disk, and await durabilitynotification from the kernel via the fsync/fdatasync system call. Workers, of course, have a lot of CPU work todo. A SiloR deployment therefore contains many workerthreads and few logger threads. We allocate enough logger threads per disk to keep that disk busy, one per diskin our evaluation system.But how should worker threads map to loggerthreads? One possibility is to assign each logger a partition of the database. This might reduce the data written by loggers (for example, it could improve the efficacy of compression), and it might speed up replay.We rejected this design because of its effect on normalcase transaction execution. Workers would have to doUSENIX Associationmore work to analyze transactions and split their updates appropriately. More fundamentally, every workermight have to communicate with every logger. Thoughlog records are written in batches (so the communication would not likely introduce contention), this designwould inevitably introduce remote writes or reads: physical memory located on one socket would be accessed,either for writes or reads, by a thread running on a different socket. Remote accesses are expensive and shouldbe avoided when possible.Our final design divides workers into disjoint subsets,and assigns each subset to exactly one logger. Core pinning is used to ensure that a logger and its workers runon the same socket, making it likely that log buffers allocated on a socket are only accessed by that socket.3.4Buffer managementAlthough loggers should not normally limit transactionexecution, loggers must be able to apply backpressureto workers, so that workers don’t generate indefiniteamounts of log data. This backpressure is implementedby buffer management. Loggers allocate a maximumnumber of log buffers per worker core. Buffers circulate between loggers and workers as transactions execute, and a worker blocks when it needs a new log bufferand one is not available. A worker flushes a buffer toits logger when either the buffer is full or a new epochbegins, whichever comes first. It is important to flushbuffers on epoch changes, whether or not those buffersare full, because SiloR cannot mark an epoch as persistent until it has durably logged all transactions that happened in that epoch. Each log buffer is 512 KB. Thisis big enough to obtain some benefit from batching, butsmall enough to avoid wasting much space when a partial buffer is flushed.We found that log-buffer backpressure in Silo triggered unnecessarily often because it was linked withfsync times. Loggers amplified file system hiccups, suchas those caused by concurrent checkpoints, into majordips in transaction rates. SiloR’s loggers instead recirculate log buffers back to workers as soon as possible—after a write, rather than after the following epochchange and fsync. We also increased the number of logbuffers available to workers, setting this to about 10% ofthe machine’s memory. The result was much less noisein transaction execution rates.3.5File managementEach SiloR logger stores its log in a collection of filesin a single directory. New entries are written to a filecalled data.log, the current log file. Periodically (currently every 100 epochs) the logger renames this file toold data.e, where e is the largest epoch the file contains,then starts a new data.log. Using multiple files simplifies the process of log truncation and, in our measure-11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 467

ments, didn’t slow logging relative to Silo’s more primitive single-file design.Log files do not contain transactions in serial order. Alog file contains concatenated log buffers from severalworkers. These buffers are copied into the log withoutrearrangement; in fact, to reduce data movement, SiloRlogger threads don’t examine log data at all. A log filecan even contain epochs out of order: a worker that delays its release of the previous epoch’s buffer will notprevent other workers from producing buffers in the newepoch. All we know is that a file old data.e contains norecords with epochs e. And, of course, a full log comprises multiple log directories stored independently bymultiple loggers writing to distinct disks. Thus, no singlelog contains enough information for recovery to producea correct database state. It would be possible to extractthis information from all logs, but instead SiloR usesa distinguished logger thread to maintain another file,pepoch, that contains the current persistent epoch. Thelogger system guarantees that all transactions in epochs pepoch are durably stored in some log. This epoch iscalculated as follows:1. Each worker w advertises its current epoch, ew , andguarantees that all future transactions it sends to itslogger will have epoch ew . It updates ew by setting ew E after flushing its current log buffer toits logger.2. Each logger l reads log buffers from workers andwrites them to log files.3. Each logger regularly decides to make its writesdurable. At that point, it calculates the minimum ofthe ew for each of its workers and the epoch numberof any log buffer it owns that remains to be written.This is the logger’s current epoch, el . The loggerthen synchronizes all its writes to disk.4. After this synchronization completes, the loggerpublishes el . This guarantees that all associatedtransactions with epoch el have been durablystored for this logger’s workers.5. The distinguished logger thread periodically computes a persistence epoch e p as min{el } 1 overall loggers. It writes e p to the pepoch file and thensynchronizes that write to disk.6. Once pepoch is durably stored, the distinguishedlogger thread publishes e p to a global variable. Atthat point all transactions with epochs e p have become durable and workers can release their resultsto clients.This protocol provides a form of group commit. It ensures that the logs contain all information about trans-actions in epochs e p , and that no results from transactions with epoch e p were released to clients. Therefore it is safe for recovery to recover all transactions withepochs e p , and also necessary since those results mayhave been released to clients. It has one important disadvantage, namely that the critical path for transactioncommit contains two fsyncs (one for the log file and onefor pepoch) rather than one. This somewhat increaseslatency.4CheckpointsAlthough logs suffice to recover a database, they donot suffice to recover a database in bounded time. Inmemory databases must take periodic checkpoints oftheir state to allow recovery to complete quickly, and tosupport log truncation. This section describes how SiloRtakes checkpoints.4.1OverviewOur main goal in checkpoint production is to producecheckpoints as quickly as possible without disruptingworker throughput. Checkpoint speed matters becauseit limits the amount of log data that will need to be replayed at recovery. The smaller the distance betweencheckpoints, the less log data needs to be replayed, andwe found the size of the log to be the major recovery expense. Thus, as with log production, checkpointing usesmultiple threads and multiple disks.Checkpoints are written by checkpointer threads, oneper checkpoint disk. In our current implementationcheckpoints are stored on the same disks as logs, andloggers and checkpointers execute on the same cores(which are separate from the worker cores that execute transactions). Different checkpointe

USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 465 Fast Databases with Fast Durability and Recovery Through Multicore Parallelism Wenting Zheng, MIT* Stephen Tu, MIT* Eddie Kohler, Harvard University Barbara Liskov, MIT Abstract Multicore in-memory databases for modern machines

Related Documents:

7 databases 53 databases 58 databases 14 databases ...

14 databases History 183 databases ProQuest Primary Sources available for: Introduction ProQuest Historical Primary Sources Support Research, Teaching and Learning. Faculty and students are using a variety of resources in research, teaching and learning – including primary sources,

36 Views

3y ago

/LLO edag peq - BBMKU

Control Techniques, Database Recovery Techniques, Object and Object-Relational Databases; Database Security and Authorization. Enhanced Data Models: Temporal Database Concepts, Multimedia Databases, Deductive Databases, XML and Internet Databases; Mobile Databases, Geographic Information Systems, Genome Data Management, Distributed Databases .

10 Views

10m ago

Introduction to Databases Lecture 5 Distributed databases ...

Distributed databases allow more concurrent database requests than single-server databases. Gianluca Quercini Introduction to Databases Master DSBA 2020 { 20219/58. . 14 Administration 300,000 25 Education 150,000 62 Finance 600,000 45 Human Resources 150,000 Department B C codeD nameD budget

61 Views

3y ago

6.1 Bioinformatics Databases and Tools - Introduction

Examples. 6.2 Primary sequence databases 6.2.1 Introduction In the early 1980’s, several primary database projects evolved in diﬀerent parts of the world (see table 6.1). There are two main classes of databases:DNA (nucleotide) databases and protein databases. The primary sequence d

28 Views

2y ago

Goal of the presentation is to give an introduction of NoSQL databases ...

1. A paradigm shift from the traditional data model. SQL databases enforce a strict schema, whereas NoSQL databases has a week notion of schema. At the core all NoSQL databases are key/value systems, the difference is whether the database understands the value or not. Different type of NoSQL databases have different properties. We'll see four major

20 Views

1y ago

Fast Databases with Fast Durability and Recovery Through Multicore ...

database with full persistence at relatively low cost to transaction throughput, and with fast recovery, mean-ing we hoped to be able to recover a large database to a transactionally-consistent state in just a few minutes without replication. Starting from Silo [27], a very fast in-memory database system, we built SiloR, which adds

8 Views

10m ago

SCIENTIFIC LITERATURE DATABASES AND CITATION QUALITY ...

Mechanics and Geotechnical Engineering (ISSMGE). It briefly covers the following topics: i) databases of scientific literature, ii) criteria for selecting Journals and Proceedings for inclusion in the databases, iii) procedure to apply for inclusion in the databases, and iv) quality indicators based on citations.

35 Views

3y ago

Curriculum Guide

2nd Grade . ELA Priority Standards Grade 2 CCSS PA Core Foundational Skills RF.2.3 CC.1.1.2.D Know and apply grade level phonics and word analysis skills in decoding words. Distinguish long and short vowels when reading regularly spelled one- syllable words. Decode two-syllable words with long vowels and words with common prefixes and suffixes. Read grade level high-frequency .

57 Views

3y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

452 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

313 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

323 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

400 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

340 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

11m ago

129 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

372 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

388 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

302 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

342 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

491 Views

Exploring the relationship between Google Trends data and stock price data.

google search data. Additionally, we propose using a test that we created to explore the relationship, if any, of stock prices and the popularity of google searches. Finally, we share our results from the test and discuss the possibility of using the popularity of google searches to predict future stock price movement. 1. Introduction 1.1 Stock .

1y ago

169 Views

Columbus,Ohio 1890

Slicing Steaks 3563 Beef Tender, Select In Stock 3852 Angus XT Shoulder Clod, Choice In Stock 3853 Angus XT Chuck Roll, Choice 20/up In Stock 3856 Angus XT Peeled Knuckle In Stock 3857 Angus XT Inside Rounds In Stock 3858 Angus XT Flats, Choice In Stock 3859 Angus XT Eye Of Round, Choice In Stock 3507 Point Off Bnls Beef Brisket, Choice In Stock

2y ago

274 Views

Introduction - Google Earth User Guide

Google Earth Community: Learn from other Google Earth users by asking questions and sharing answers on the Google Earth Community forums. Using Google Earth: This blog describes how you can use some of the interesting features of Google Earth. Selecting a Server Note: This section is relevant to Google Earth Pro and EC users.

3y ago

296 Views

Using Google Forms to Manage Officials Signups

Google Sheets, deleting a response from the form or sheet will not affect the other. Once the Google Form is linked to a Google Sheet, clicking on the spreadsheet icon will open the linked Google Sheet. Google Responses Sheet Google automatically creates and populates the sp

2y ago

281 Views

Fast Databases With Fast Durability And Recovery Through .

It looks like you're using an ad-blocker