Decibel: The Relational Dataset Branching System

3y ago

21 Views

2 Downloads

424.02 KB

12 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Audrey Hope

Report this link

Download PDF

Transcription

Decibel: The Relational Dataset Branching SystemMichael Maddox , David Goehring , Aaron J. Elmore‡ ,Samuel Madden , Aditya Parameswaran† and Amol Deshpande§ MITCSAILmaddox@mit.edu, dggoeh1@mit.edu, madden@csail.mit.edu† Universityof Illinois (UIUC)adityagp@illinois.eduAbstract—As scientific endeavors and data analysis becomesincreasingly collaborative, there is a need for data managementsystems that natively support the versioning or branching ofdatasets to enable concurrent analysis, cleaning, integration,manipulation, or curation of data across teams of individuals.Common practice for sharing and collaborating on datasetsinvolves creating or storing multiple copies of the dataset, onefor each stage of analysis, with no provenance informationtracking the relationships between these datasets. This resultsnot only in wasted storage, but also makes it challenging totrack and integrate modifications made by different users to thesame dataset. In this paper, we introduce the Relational DatasetBranching System, Decibel, a new relational storage system withbuilt-in version control designed to address these shortcomings.We present our initial design for Decibel and provide a thoroughevaluation of three versioned storage engine designs that focuson efficient query processing with minimal storage overhead. Wealso develop an exhaustive benchmark to enable the rigoroustesting of these and future versioned storage engine designs.I. I NTRODUCTIONWith the rise of “data science”, individuals increasingly findthemselves working collaboratively to construct, curate, andmanage shared datasets. Consider, for example, researchers ina social media company, such as Facebook or Twitter, working with a historical snapshot of the social graph. Differentresearchers may have different goals: one may be developinga textual analysis to annotate each user in the graph with adkeywords based on recent posts; another may be annotatingedges in the graph with weights that estimate the strengthof the relationship between pairs of users; a third may becleaning the way that location names are attached to usersbecause a particular version of the social media client insertedplace names with improper capitalization. These operationsmay happen concurrently, and often analysts want to performthem on multiple snapshots of the database to measure theeffectiveness of some an algorithm or analysis. Ultimately, theresults of some operations may need to be visible to all users,while others need not be shared with other users or mergedback into the main database.Existing mechanisms to coordinate these kinds of operationson shared databases are often ad hoc. For example, severalcomputational biology groups we interviewed at MIT to motivate our work reported that the way they manage such sharedrepositories is to simply make a new copy of a dataset for eachnew project or group member. Conversations with colleaguesin large companies suggest that practices there are not muchbetter. This ad hoc coordination leads to a number of problems,including: Redundant copies of data, which wastes storage.‡ Universityof Chicagoaelmore@cs.uchicago.edu§ Universityof Maryland (UMD)amol@cs.umd.edu No easy way for users to share enhancements or patchesto datasets with other users or merge them into the“canonical” version of the dataset. No systematic way to record which version of a datasetwas used for an experiment. Often, ad hoc directorystructures or loosely-followed filename conventions areused instead. No easy way to share data with others, or to keep trackof who is using a particular dataset, besides using filesystem permissions.One potential solution to this problem is to use an existingdistributed version control system such as git or mercurial.These tools, however, are not well-suited to versioning largedatasets for several reasons. First, they generally requireeach user to “checkout” a separate, complete copy of adataset, which is impractical within large, multi-gigabyte orterabyte-scale databases. We envision instead a hosted solution,where users issue queries to a server to read or modify recordsin a particular version of the database. Second, because theyare designed to store unstructured data (text and arbitrarybinary objects), they have to use general-purpose differencingtools (like Unix diff) to encode deltas and compare versions.In contrast, deltas in databases can often be encoded aslogical modifications to particular records or fields, whichcan be orders of magnitude smaller than the results of thesegeneral differencing tools, and are not sensitive to the order ofthe records. Moreover, version control systems like these donot provide any of the high-level data management featuresand APIs (e.g., SQL) typically found in database systems,relational or otherwise.In this paper, we address the above limitations by presentingDecibel, a system for managing large collections of relationaldataset versions. Decibel allows users to create working copies(i.e., branches) of a dataset based either off the present stateof a dataset or from prior versions. As in existing versioncontrol systems such as git, many such branches or workingcopies can co-exist, and branches may be merged periodicallyby users. Decibel also allows modifications across differentbranches, or within the same branch.We describe our versioning API and the logical data modelwe adopt for versioned datasets, and then describe severalalternative approaches for physically encoding the branchingstructure. Choosing the right physical data layout is criticalfor achieving good performance and storage efficiency from aversioned data store. Consider a naive physical design thatstores each version in its entirety: if versions substantiallyoverlap (which they generally will), such a scheme will behugely wasteful of space. Moreover, data duplication could

prove costly when performing cross-version operations like diffas it sacrifices the potential for shared computation.In contrast, consider a version-first storage scheme whichstores modifications made to each branch in a separate tablefragment (which we physically store as a file) along withpointers to the table fragments comprising the branch’s directancestors. A linear chain of such fragments thus comprisesthe state of a branch. Since modifications to a branch areco-located within single files, it is easier to read the contents ofa single branch or version by traversing its lineage. However,this structure makes it difficult to perform queries that compareversions, e.g., that ask which versions satisfy a certain propertyor contain a particular tuple [1].As an alternative, we also consider a tuple-first scheme whereevery tuple that has ever existed in any version is stored in asingle table, along with a bitmap to indicate the versions eachtuple exists in. This approach is very efficient for queries thatcompare the contents of versions (because such queries can besupported through bitmap intersections), but can be inefficientfor queries that read a single version since data from manyversions is interleaved.Finally, we propose a hybrid scheme that stores recordsin segmented files like in the version-first scheme, but alsoleverages a collection of bitmaps like those in the tuple-firstscheme to track the version membership of records. For theoperations we consider, this system performs as well or betterthan both schemes above, and also affords a natural parallelismacross most query types.For each of these schemes, we describe the algorithmsrequired to implement key versioning operations, includingversion scans, version differencing, and version merging. Thekey contribution of this paper is a thorough exploration of thetrade-offs between these storage schemes across a variety ofoperations and workloads. Besides describing these schemes,this paper makes several other contributions: We provide the first full-fledged integration of modernversion control ideas with relational databases. We describe our versioning API, our interpretation of versioningsemantics within relational systems, and several implementations of a versioned relational storage engine. We describe a new versioning benchmark we have developed, modeled after several workloads we believe are representative of the use cases we envision. These workloadshave different branching and merging structures, designedto stress different aspects of the storage managers. We provide an evaluation of our storage engines, showing that our proposed hybrid scheme outperforms thetuple-first and version-first schemes on our benchmark.Decibel is a key component of DataHub [2], a collaborativedata analytics platform that we’re building. DataHub includesthe version control features provided by Decibel along withother features such as access control, account management, andbuilt-in data science functionalities such as visualization, datacleaning, and integration. Our vision paper on DataHub [2]briefly alluded to the idea of version and tuple-first storage,but did not describe any details, implementation, or evaluation,and also did not describe the hybrid approach presentedhere (which, as we show, outperforms the other approachessignificantly, sometimes by an order-of-magnitude or more.)Also in recent work [3], we presented algorithms to minimizethe storage and recreation costs of a collection of unstructureddatasets, as opposed to building and evaluating an end-to-endstructured dataset version management system like Decibel.We begin by presenting motivating examples, showing howend users could benefit from Decibel. We then provide anoverview of our versioning API and data model in Section II.A detailed overview of the aforementioned physical storageschemes is presented in Section III. We then describe ourversioned benchmarking strategy in Section IV and the experimental evaluation of our storage models on a range ofversioned query types in Section V.A. Versioning Patterns & ExamplesWe now describe two typical dataset versioning patternsthat we have observed across a wide variety of scenarios. Wedescribe how they motivate the need for Decibel, and capturethe variety of ways in which datasets are versioned and sharedacross individuals and teams. These patterns based on ourdiscussions with domain experts, and inspire the workloadsthat we use to evaluate Decibel in Section V.Science Pattern: This pattern is used by data scientist teams.These data scientists typically begin by taking the latest copyof an evolving dataset, then may perform normalization andcleaning (e.g., remove or merge columns, deal with NULLvalues or outliers), annotate the data with additional derivedfeatures, separate into test and training subsets, and run modelsas part of an iterative process. At the same time, the underlyingdataset that the data scientists started with may typicallyevolve, but often analysts will prefer to limit themselvesto the subset of data available when analysis began. UsingDecibel, such scientists and teams can create a private branchin which their analysis can be run without having to make acomplete copy of the data. They can return to this branch whenrunning a subsequent analysis, or create further branches totest and compare different cleaning or normalization strategies,or different models or features, while retaining the ability toreturn to previous versions of their work.This pattern applies to a variety of data science teamsincluding a) The ads team of a startup, analyzing the impactof the ad campaigns on visitors to websites. b) A physicalscientist team, who would like to build and test models andphysical theories on snapshots of large-scale simulation data.c) A medical data analysis team, analyzing patient care andmedical inefficiencies, who are only allowed to access recordsof patients who have explicitly agreed to such a study (thiscan be used to define branches that the team works on).Curation Pattern: This pattern is used by teams collectivelycurating a structured dataset. While the canonical version of thedataset evolves in a linear chain, curators may work on editing,enhancing, or pruning portions of this dataset via branches, andthen apply these fixes back to the canonical version. Whilethis is cumbersome to do via current data management tools,Decibel can easily support multiple individuals simultaneouslycontributing changes to their branches, and then merging thesechanges back to the canonical version. This way, curatorscan “install and test” changes on branches without exposingpartial changes to other curators or production teams using thecanonical version until updates have been tested and validated.This pattern applies to a variety of data curation teamsincluding a) The team managing the product catalog of a business with individuals who manage different product segments,applying updates to their portion of the catalog in tandem. b) Avolunteer team of community users contributing changes toa collaboratively managed map, e.g. OpenStreetMaps, where

#1Version ARVersion ASRS2Version BbranchcommitVersion BRSVersion CRRSSVersion ERSVersion DBranch 1commitRR3SVersion DmergeSVersion FRMaster BranchS4Master Branch(a)(b)Fig. 1: An Example Workflowindividual users may focus on local regions, adding points ofinterest or fixing detailed geometry or metadata (e.g., one wayinformation) of roads. c) A team of botanists collaborativelycontributing to a dataset containing the canonical properties ofplants found in a tropical rainforest.II. D ECIBEL API AND A RCHITECTUREWe begin with a brief overview of the Decibel architecturebefore describing the version control model and API thatDecibel provides.A. ArchitectureWe now briefly summarize the architecture of Decibel.Decibel is implemented in Java, on top of the MIT SimpleDBdatabase. In this paper, we focus on the design of the Decibelstorage engine, which is a new version-optimized data storagesystem able to implement the core operations to scan, filter,difference, and merge branching data sets. Note, however, thatDecibel does support general SQL query plans, but most of ourquery evaluation (joins, aggregates) is done in the (unmodified)SimpleDB query planning layer. The changes we made forDecibel were localized to the storage layer. The storage layerreads in data from one of the storage schemes, storing pages ina fairly conventional buffer pool architecture with (with 4 MBpages), exposing iterators over different single versions of datasets. The buffer pool also encompasses a lock manager used forconcurrency control. In addition to this buffer pool we store anadditional version graph on disk and in memory. In this paperwe focus on the versioned storage manager and versioning datastructures, with support for versioning operations in severaldifferent storage schemes, not the design of the query executor.In the rest of this section, we describe Decibel query language.B. Decibel Model and APIWe first describe the logical data model that we use, andthen describe the version control API in detail. We describethese concepts in the context of Figure 1, where (a) and (b)depict two evolution patterns of a dataset.1) Data Model.: Decibel uses a very flexible logical datamodel, where the main unit of storage is the dataset. Adataset is a collection of relations, each of which consists ofa collection of records. Each relation in each dataset musthave a well-defined primary key; the primary key is used totrack records across different versions or branches, and thusis expected to be immutable (a change to the primary keyattribute, in effect, creates a new record). For the same reason,primary keys should not be reused across semantically distinctrecords; however, we note that Decibel does not attempt toenforce either of these two properties.2) Version Control Model: Decibel uses a version controlmodel that is similar to that of software version control systemslike git. In Decibel, a version consists of a point-in-timeQuery TypeSingle version scan:find all tuples in relationR in version v01Multiple version positive diff:positive diff relation R betweenversions v01 and v02Multiple version join:join tuples in R inversions v01 and v02satisfying Name SamSeveral version scan:find all head versionsof relation RSQL EquivalentSELECT * FROM RWHERE R.Version ‘v01’SELECT * FROM RWHERE R.Version ‘v01’ AND R.idNOT IN (SELECT id from RWHERE R.Version ‘v02’)SELECT * FROM R as R1, R as R2 WHERER1.Version ‘v01’ AND R2.Version ‘v02’AND R1.id R2.id AND R1.Name ‘Sam’SELECT * FROM R WHEREHEAD(R.Version) trueTABLE I: Sample Queriessnapshot of one or more relations that are semantically groupedtogether into a dataset (in some sense, it is equivalent tothe notion of a commit in git/svn). For instance, VersionsA—D in Figure 1(a) all denote versions of a dataset thatcontain two relations, R and S. A version, identified by anID, is immutable and any update to a version conceptuallyresults in a new version with a different version ID (as wediscuss later in depth, the physical data structures are notnecessarily immutable and we would typically not want tocopy all the data over, but rather maintain differences). Newversions can also be created by merging two or more versions(e.g., Version F in Figure 1(b)), or through the application oftransformation programs to one or more existing versions (e.g.,Version B from Version A in Figure 1(a)). The version-levelprovenance that captures these processes is maintained as adirected acyclic graph, called a version graph; the nodes andedges in Figure 1(a) or (b) comprise a version graph.In Decibel, a branch denotes a working copy of a dataset.There is an active branch corresponding to every leaf node orversion in the version graph. Logically, a branch is comprisedof the history of versions that occur in the path from the branchleaf to the root of the version graph. For instance, in Figure 1(a)there are two branches, one corresponding to Version D andone corresponding to C. On the other hand, in Figure 1(b) thereis a single branch corresponding to version F. The initial branchcreated is designated the master branch, which serves as theauthoritative branch of record for the evolving dataset. Thus, aversion can be seen as capturing a series of modifications to abranch, creating a point-in-time snapshot of a branch’s content.The leaf version, i.e., the (chronologically) latest version in abranch is called its head; it is expected that most operationswill occur on the heads of the branches.3) Decibel Operational Semantics: We now describe thesemantics of each of the core operations of the versioncontrol workflow described above as implemented in Decibel.Although the core operations Decibel supports are analogousto operations supported by systems like git, unlike those,Decibel supports in-place modifications to the data and needsto support both version control commands as well as datadefinition and manipulation commands. We will describe theseoperations in the context of Figure 1(a).Users interact with Decibel by opening a connection to theDecibel server, which creates a session. A session captures theuser’s state, i.e., the commit (or the branch) that the operationsthe user issues will read or modify. Concurrent transactions bymultiple users on the same version (but different sessions) areisolated from each other through two-phase locking.Init: The repository is initialized, i.e., the first version (VersionA in the figure) is created, using a special init transaction thatcreates the two tables as well as populates them with initial

data (if needed). At this point, there is only a single Masterbranch with a single version in it (which is also its head).Commit and Checkout: Commits create new versions ofdatasets, adding an extra node to one of the existing branchesin the version graph. Suppose a user increments the valuesof the second column by one for each record in relation R,then commits the change as Version B on the Master branch.This commit in Decibel creates a ne

dataset, which is impractical within large, multi-gigabyte or terabyte-scale databases. We envision instead a hosted solution, where users issue queries to a server to read or modify records in a particular version of the database. Second, because they are designed to store unstructured data (text and arbitrary

Related Documents:

Nonprofit Self-Assessment Checklist

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

1.4K Views

2y ago

Name of thé élément in thé language and script of thé ... - UNESCO

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

106 Views

9m ago

[Kl - Mauritius

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

464 Views

1y ago

Employee Benefits Event - Schneider Downs Tax Services

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

323 Views

1y ago

Study Investigating thè Effect of E- Service Quality on Customer's ...

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

117 Views

9m ago

Visual Studio - Continuous Integration

Feature Branching: Branching based on the features to be developed Release Branching: A branch is created to stabilize the release and later merged to main branch after the release. Quality Branching: Branching done for different teams with focus on quality TFS allows merging of code from different branches.

20 Views

2y ago

Kinh Giải Thâm Mật HT. Thích Trí Quang dịch giải

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

1.6K Views

3y ago

RELATIONAL ALGEBRA - California Institute of Technology

The Relational Algebra A procedural query language Comprised of relational algebra operations Relational operations: Take one or two relations as input Produce a relation as output Relational operations can be composed together Each operation produces a relation A query is simply a relational algebra expression Six "fundamental" relational operations

10 Views

5m ago

Recent Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

OWNER'S GUIDE - NinjaKitchen

auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. please keep these important safeguards in mind when using the . appliance: mportant: make sure that the .

1y ago

285 Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

Consumer Guide Auto Insurance - Tennessee

Auto insurance doesn't cover paying off your loan if your car is damaged and its market value is less than what you owe. Auto dealers and lenders may offer guaranteed auto protection (GAP) insurance for this purpose. Your auto insurance will cover you if you drive into Canada. To drive into Mexico, however, you'll need to buy Mexican auto .

1y ago

199 Views

NAIC Consumer Shopping Tool for Auto Insurance

Whether you are buying auto insurance for the first time, or shopping to be sure you are getting the best deal, you already know how important auto insurance is. By law in most states, if you own a car, you must have some auto insurance. Remember, there is no such thing as a "full coverage" auto insurance policy. Policies are made up of

1y ago

185 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

REVIEW OF AUTOMOBILE INSURANCE RATES - Consumers' Association of Canada

In the summer of 2003 the Association compiled over 7,000 auto insurance rate quotes from sources across Canada. In the case of those provinces in which private insurers provide auto insurance the study ensured that the rate quotes obtained reflected the range of prices likely to be found in those markets.

1y ago

213 Views

Broadway towing winchester ky

MO 77 Motors: Rock Hill, SC 7th Avenue Auto Salvage: Fargo, ND 81 Auto Parts & Recycling : Salem, VA 82 Auto Wrecking: Brookfield, OH #9 Truck & Auto Parts (No US Shipping) : Tottenham, ON 97 Auto Wrecking Shull's Towing: Brewster , WA 98 Auto Recyclers: Brooksville, FL 99 Auto Dismantler: Stockton, CA A & A Auto & Truck LLC:

2y ago

465 Views

All about auto insurance - Option Consommateurs

of insurance companies with which they have agreements. Insurance agents: agents work for a specific insurance company. Before you decide to do business with either a broker or an agent, check out prices, the products being proposed and the quality of the service. Buying auto insurance 4 All about auto insurance

1y ago

230 Views

A Message from Our President - Fox Valley Corvette

Bob Jass Chev-rolet 630-365-6481 Auto Parts 25% in most cas-es Ron Westphal Chevrolet 630-898-9630 Auto Parts 25% in most cas-es Thomsons Auto Parts 630-879-6363 Auto Parts 10% in most cas-es American Mod-ern Insurance Co. Collector Car Auto Insurance 10% on Collector Auto Polic

2y ago

225 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

Decibel: The Relational Dataset Branching System

It looks like you're using an ad-blocker