Decibel: The Relational Dataset Branching System

3y ago
21 Views
2 Downloads
424.02 KB
12 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Audrey Hope
Transcription

Decibel: The Relational Dataset Branching SystemMichael Maddox , David Goehring , Aaron J. Elmore‡ ,Samuel Madden , Aditya Parameswaran† and Amol Deshpande§ MITCSAILmaddox@mit.edu, dggoeh1@mit.edu, madden@csail.mit.edu† Universityof Illinois (UIUC)adityagp@illinois.eduAbstract—As scientific endeavors and data analysis becomesincreasingly collaborative, there is a need for data managementsystems that natively support the versioning or branching ofdatasets to enable concurrent analysis, cleaning, integration,manipulation, or curation of data across teams of individuals.Common practice for sharing and collaborating on datasetsinvolves creating or storing multiple copies of the dataset, onefor each stage of analysis, with no provenance informationtracking the relationships between these datasets. This resultsnot only in wasted storage, but also makes it challenging totrack and integrate modifications made by different users to thesame dataset. In this paper, we introduce the Relational DatasetBranching System, Decibel, a new relational storage system withbuilt-in version control designed to address these shortcomings.We present our initial design for Decibel and provide a thoroughevaluation of three versioned storage engine designs that focuson efficient query processing with minimal storage overhead. Wealso develop an exhaustive benchmark to enable the rigoroustesting of these and future versioned storage engine designs.I. I NTRODUCTIONWith the rise of “data science”, individuals increasingly findthemselves working collaboratively to construct, curate, andmanage shared datasets. Consider, for example, researchers ina social media company, such as Facebook or Twitter, working with a historical snapshot of the social graph. Differentresearchers may have different goals: one may be developinga textual analysis to annotate each user in the graph with adkeywords based on recent posts; another may be annotatingedges in the graph with weights that estimate the strengthof the relationship between pairs of users; a third may becleaning the way that location names are attached to usersbecause a particular version of the social media client insertedplace names with improper capitalization. These operationsmay happen concurrently, and often analysts want to performthem on multiple snapshots of the database to measure theeffectiveness of some an algorithm or analysis. Ultimately, theresults of some operations may need to be visible to all users,while others need not be shared with other users or mergedback into the main database.Existing mechanisms to coordinate these kinds of operationson shared databases are often ad hoc. For example, severalcomputational biology groups we interviewed at MIT to motivate our work reported that the way they manage such sharedrepositories is to simply make a new copy of a dataset for eachnew project or group member. Conversations with colleaguesin large companies suggest that practices there are not muchbetter. This ad hoc coordination leads to a number of problems,including: Redundant copies of data, which wastes storage.‡ Universityof Chicagoaelmore@cs.uchicago.edu§ Universityof Maryland (UMD)amol@cs.umd.edu No easy way for users to share enhancements or patchesto datasets with other users or merge them into the“canonical” version of the dataset. No systematic way to record which version of a datasetwas used for an experiment. Often, ad hoc directorystructures or loosely-followed filename conventions areused instead. No easy way to share data with others, or to keep trackof who is using a particular dataset, besides using filesystem permissions.One potential solution to this problem is to use an existingdistributed version control system such as git or mercurial.These tools, however, are not well-suited to versioning largedatasets for several reasons. First, they generally requireeach user to “checkout” a separate, complete copy of adataset, which is impractical within large, multi-gigabyte orterabyte-scale databases. We envision instead a hosted solution,where users issue queries to a server to read or modify recordsin a particular version of the database. Second, because theyare designed to store unstructured data (text and arbitrarybinary objects), they have to use general-purpose differencingtools (like Unix diff) to encode deltas and compare versions.In contrast, deltas in databases can often be encoded aslogical modifications to particular records or fields, whichcan be orders of magnitude smaller than the results of thesegeneral differencing tools, and are not sensitive to the order ofthe records. Moreover, version control systems like these donot provide any of the high-level data management featuresand APIs (e.g., SQL) typically found in database systems,relational or otherwise.In this paper, we address the above limitations by presentingDecibel, a system for managing large collections of relationaldataset versions. Decibel allows users to create working copies(i.e., branches) of a dataset based either off the present stateof a dataset or from prior versions. As in existing versioncontrol systems such as git, many such branches or workingcopies can co-exist, and branches may be merged periodicallyby users. Decibel also allows modifications across differentbranches, or within the same branch.We describe our versioning API and the logical data modelwe adopt for versioned datasets, and then describe severalalternative approaches for physically encoding the branchingstructure. Choosing the right physical data layout is criticalfor achieving good performance and storage efficiency from aversioned data store. Consider a naive physical design thatstores each version in its entirety: if versions substantiallyoverlap (which they generally will), such a scheme will behugely wasteful of space. Moreover, data duplication could

prove costly when performing cross-version operations like diffas it sacrifices the potential for shared computation.In contrast, consider a version-first storage scheme whichstores modifications made to each branch in a separate tablefragment (which we physically store as a file) along withpointers to the table fragments comprising the branch’s directancestors. A linear chain of such fragments thus comprisesthe state of a branch. Since modifications to a branch areco-located within single files, it is easier to read the contents ofa single branch or version by traversing its lineage. However,this structure makes it difficult to perform queries that compareversions, e.g., that ask which versions satisfy a certain propertyor contain a particular tuple [1].As an alternative, we also consider a tuple-first scheme whereevery tuple that has ever existed in any version is stored in asingle table, along with a bitmap to indicate the versions eachtuple exists in. This approach is very efficient for queries thatcompare the contents of versions (because such queries can besupported through bitmap intersections), but can be inefficientfor queries that read a single version since data from manyversions is interleaved.Finally, we propose a hybrid scheme that stores recordsin segmented files like in the version-first scheme, but alsoleverages a collection of bitmaps like those in the tuple-firstscheme to track the version membership of records. For theoperations we consider, this system performs as well or betterthan both schemes above, and also affords a natural parallelismacross most query types.For each of these schemes, we describe the algorithmsrequired to implement key versioning operations, includingversion scans, version differencing, and version merging. Thekey contribution of this paper is a thorough exploration of thetrade-offs between these storage schemes across a variety ofoperations and workloads. Besides describing these schemes,this paper makes several other contributions: We provide the first full-fledged integration of modernversion control ideas with relational databases. We describe our versioning API, our interpretation of versioningsemantics within relational systems, and several implementations of a versioned relational storage engine. We describe a new versioning benchmark we have developed, modeled after several workloads we believe are representative of the use cases we envision. These workloadshave different branching and merging structures, designedto stress different aspects of the storage managers. We provide an evaluation of our storage engines, showing that our proposed hybrid scheme outperforms thetuple-first and version-first schemes on our benchmark.Decibel is a key component of DataHub [2], a collaborativedata analytics platform that we’re building. DataHub includesthe version control features provided by Decibel along withother features such as access control, account management, andbuilt-in data science functionalities such as visualization, datacleaning, and integration. Our vision paper on DataHub [2]briefly alluded to the idea of version and tuple-first storage,but did not describe any details, implementation, or evaluation,and also did not describe the hybrid approach presentedhere (which, as we show, outperforms the other approachessignificantly, sometimes by an order-of-magnitude or more.)Also in recent work [3], we presented algorithms to minimizethe storage and recreation costs of a collection of unstructureddatasets, as opposed to building and evaluating an end-to-endstructured dataset version management system like Decibel.We begin by presenting motivating examples, showing howend users could benefit from Decibel. We then provide anoverview of our versioning API and data model in Section II.A detailed overview of the aforementioned physical storageschemes is presented in Section III. We then describe ourversioned benchmarking strategy in Section IV and the experimental evaluation of our storage models on a range ofversioned query types in Section V.A. Versioning Patterns & ExamplesWe now describe two typical dataset versioning patternsthat we have observed across a wide variety of scenarios. Wedescribe how they motivate the need for Decibel, and capturethe variety of ways in which datasets are versioned and sharedacross individuals and teams. These patterns based on ourdiscussions with domain experts, and inspire the workloadsthat we use to evaluate Decibel in Section V.Science Pattern: This pattern is used by data scientist teams.These data scientists typically begin by taking the latest copyof an evolving dataset, then may perform normalization andcleaning (e.g., remove or merge columns, deal with NULLvalues or outliers), annotate the data with additional derivedfeatures, separate into test and training subsets, and run modelsas part of an iterative process. At the same time, the underlyingdataset that the data scientists started with may typicallyevolve, but often analysts will prefer to limit themselvesto the subset of data available when analysis began. UsingDecibel, such scientists and teams can create a private branchin which their analysis can be run without having to make acomplete copy of the data. They can return to this branch whenrunning a subsequent analysis, or create further branches totest and compare different cleaning or normalization strategies,or different models or features, while retaining the ability toreturn to previous versions of their work.This pattern applies to a variety of data science teamsincluding a) The ads team of a startup, analyzing the impactof the ad campaigns on visitors to websites. b) A physicalscientist team, who would like to build and test models andphysical theories on snapshots of large-scale simulation data.c) A medical data analysis team, analyzing patient care andmedical inefficiencies, who are only allowed to access recordsof patients who have explicitly agreed to such a study (thiscan be used to define branches that the team works on).Curation Pattern: This pattern is used by teams collectivelycurating a structured dataset. While the canonical version of thedataset evolves in a linear chain, curators may work on editing,enhancing, or pruning portions of this dataset via branches, andthen apply these fixes back to the canonical version. Whilethis is cumbersome to do via current data management tools,Decibel can easily support multiple individuals simultaneouslycontributing changes to their branches, and then merging thesechanges back to the canonical version. This way, curatorscan “install and test” changes on branches without exposingpartial changes to other curators or production teams using thecanonical version until updates have been tested and validated.This pattern applies to a variety of data curation teamsincluding a) The team managing the product catalog of a business with individuals who manage different product segments,applying updates to their portion of the catalog in tandem. b) Avolunteer team of community users contributing changes toa collaboratively managed map, e.g. OpenStreetMaps, where

#1Version ARVersion ASRS2Version BbranchcommitVersion BRSVersion CRRSSVersion ERSVersion DBranch 1commitRR3SVersion DmergeSVersion FRMaster BranchS4Master Branch(a)(b)Fig. 1: An Example Workflowindividual users may focus on local regions, adding points ofinterest or fixing detailed geometry or metadata (e.g., one wayinformation) of roads. c) A team of botanists collaborativelycontributing to a dataset containing the canonical properties ofplants found in a tropical rainforest.II. D ECIBEL API AND A RCHITECTUREWe begin with a brief overview of the Decibel architecturebefore describing the version control model and API thatDecibel provides.A. ArchitectureWe now briefly summarize the architecture of Decibel.Decibel is implemented in Java, on top of the MIT SimpleDBdatabase. In this paper, we focus on the design of the Decibelstorage engine, which is a new version-optimized data storagesystem able to implement the core operations to scan, filter,difference, and merge branching data sets. Note, however, thatDecibel does support general SQL query plans, but most of ourquery evaluation (joins, aggregates) is done in the (unmodified)SimpleDB query planning layer. The changes we made forDecibel were localized to the storage layer. The storage layerreads in data from one of the storage schemes, storing pages ina fairly conventional buffer pool architecture with (with 4 MBpages), exposing iterators over different single versions of datasets. The buffer pool also encompasses a lock manager used forconcurrency control. In addition to this buffer pool we store anadditional version graph on disk and in memory. In this paperwe focus on the versioned storage manager and versioning datastructures, with support for versioning operations in severaldifferent storage schemes, not the design of the query executor.In the rest of this section, we describe Decibel query language.B. Decibel Model and APIWe first describe the logical data model that we use, andthen describe the version control API in detail. We describethese concepts in the context of Figure 1, where (a) and (b)depict two evolution patterns of a dataset.1) Data Model.: Decibel uses a very flexible logical datamodel, where the main unit of storage is the dataset. Adataset is a collection of relations, each of which consists ofa collection of records. Each relation in each dataset musthave a well-defined primary key; the primary key is used totrack records across different versions or branches, and thusis expected to be immutable (a change to the primary keyattribute, in effect, creates a new record). For the same reason,primary keys should not be reused across semantically distinctrecords; however, we note that Decibel does not attempt toenforce either of these two properties.2) Version Control Model: Decibel uses a version controlmodel that is similar to that of software version control systemslike git. In Decibel, a version consists of a point-in-timeQuery TypeSingle version scan:find all tuples in relationR in version v01Multiple version positive diff:positive diff relation R betweenversions v01 and v02Multiple version join:join tuples in R inversions v01 and v02satisfying Name SamSeveral version scan:find all head versionsof relation RSQL EquivalentSELECT * FROM RWHERE R.Version ‘v01’SELECT * FROM RWHERE R.Version ‘v01’ AND R.idNOT IN (SELECT id from RWHERE R.Version ‘v02’)SELECT * FROM R as R1, R as R2 WHERER1.Version ‘v01’ AND R2.Version ‘v02’AND R1.id R2.id AND R1.Name ‘Sam’SELECT * FROM R WHEREHEAD(R.Version) trueTABLE I: Sample Queriessnapshot of one or more relations that are semantically groupedtogether into a dataset (in some sense, it is equivalent tothe notion of a commit in git/svn). For instance, VersionsA—D in Figure 1(a) all denote versions of a dataset thatcontain two relations, R and S. A version, identified by anID, is immutable and any update to a version conceptuallyresults in a new version with a different version ID (as wediscuss later in depth, the physical data structures are notnecessarily immutable and we would typically not want tocopy all the data over, but rather maintain differences). Newversions can also be created by merging two or more versions(e.g., Version F in Figure 1(b)), or through the application oftransformation programs to one or more existing versions (e.g.,Version B from Version A in Figure 1(a)). The version-levelprovenance that captures these processes is maintained as adirected acyclic graph, called a version graph; the nodes andedges in Figure 1(a) or (b) comprise a version graph.In Decibel, a branch denotes a working copy of a dataset.There is an active branch corresponding to every leaf node orversion in the version graph. Logically, a branch is comprisedof the history of versions that occur in the path from the branchleaf to the root of the version graph. For instance, in Figure 1(a)there are two branches, one corresponding to Version D andone corresponding to C. On the other hand, in Figure 1(b) thereis a single branch corresponding to version F. The initial branchcreated is designated the master branch, which serves as theauthoritative branch of record for the evolving dataset. Thus, aversion can be seen as capturing a series of modifications to abranch, creating a point-in-time snapshot of a branch’s content.The leaf version, i.e., the (chronologically) latest version in abranch is called its head; it is expected that most operationswill occur on the heads of the branches.3) Decibel Operational Semantics: We now describe thesemantics of each of the core operations of the versioncontrol workflow described above as implemented in Decibel.Although the core operations Decibel supports are analogousto operations supported by systems like git, unlike those,Decibel supports in-place modifications to the data and needsto support both version control commands as well as datadefinition and manipulation commands. We will describe theseoperations in the context of Figure 1(a).Users interact with Decibel by opening a connection to theDecibel server, which creates a session. A session captures theuser’s state, i.e., the commit (or the branch) that the operationsthe user issues will read or modify. Concurrent transactions bymultiple users on the same version (but different sessions) areisolated from each other through two-phase locking.Init: The repository is initialized, i.e., the first version (VersionA in the figure) is created, using a special init transaction thatcreates the two tables as well as populates them with initial

data (if needed). At this point, there is only a single Masterbranch with a single version in it (which is also its head).Commit and Checkout: Commits create new versions ofdatasets, adding an extra node to one of the existing branchesin the version graph. Suppose a user increments the valuesof the second column by one for each record in relation R,then commits the change as Version B on the Master branch.This commit in Decibel creates a ne

dataset, which is impractical within large, multi-gigabyte or terabyte-scale databases. We envision instead a hosted solution, where users issue queries to a server to read or modify records in a particular version of the database. Second, because they are designed to store unstructured data (text and arbitrary

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Feature Branching: Branching based on the features to be developed Release Branching: A branch is created to stabilize the release and later merged to main branch after the release. Quality Branching: Branching done for different teams with focus on quality TFS allows merging of code from different branches.

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

The Relational Algebra A procedural query language Comprised of relational algebra operations Relational operations: Take one or two relations as input Produce a relation as output Relational operations can be composed together Each operation produces a relation A query is simply a relational algebra expression Six "fundamental" relational operations