Management Of Flexible Schema Data In RDBMSs

2y ago
31 Views
2 Downloads
279.58 KB
9 Pages
Last View : 1d ago
Last Download : 3m ago
Upload by : Ciara Libby
Transcription

Management of Flexible Schema Data in RDBMSs- Opportunities and Limitations for NoSQL Zhen Hua Liu, Dieter GawlickOracle Corporation500 Oracle ParkwayRedwood Shores, CA 94065, USA{zhen.liu, dieter.gawlick}@oracle.com1. ABSTRACTRDBMSs are designed to manage well-structured data requiringusers to design a schema before storing and querying data. Thisis the ‘schema first, data later’ approach. However, there aresignificant amount of unstructured data and semi-structured datathat cannot be effectively modelled this way. Even if certainparts of the data can be modelled using schema, the inclusion ofall fields would typically lead to a very large schema with manyoptional fields and with frequent schema evolution as datainstances vary widely and evolve fast. Obviously, these datarequires the ‘data first, schema later/never’ approach. We callthese data Flexible Schema Data (FSD). In this paper, wedescribe the engineering principles and practices to manage FSDin RDBMSs to meet FSD’s unique requirements and challenges.We describe the limitations and issues of current practices andpotential research opportunities. Having a single data platformfor managing both well-structured data and FSD is beneficial tousers; this approach reduces significantly integration, migration,development, maintenance, and operational issues.Categories and Subject DescriptorsH.2.4 [Database Management]: Systems – Relationaldatabases, transaction processing.General TermsAlgorithms, Management, Performance, Design, Languages,Standardization.KeywordsJSON, SQL/JSON, Schema-less, No-SQL, XML, SQL/XML,Flexible Schema, MongoDB2. INTRODUCTIONThe focus on the ‘schema first, data later’ approach has so farprevented RDBMSs from being the ideal platform of managingFSD. Instead, FSD like support has been implemented inspecialized DBMSs due to RDBMS is found to be inadequate tosupport schema evolution [43] to handle data whose structurechanges a lot over time.This article is published under a Creative Commons Attribution /), which permits distribution andreproduction in any medium as well allowing derivative works, provided thatyou attribute the original work to the author(s) and CIDR 2015.7th Biennial Conference on Innovative Data Systems Research (CIDR ’15)January 4-7, 2015, Asilomar, California, USA.For managing unstructured documents, it is common to usecontent management systems that store documents as files withtext index providing keyword search [13, 16]. For managingdocument-oriented semi-structured data, such as XML,MarkLogic NoSQL system [34] is popular with XQuery asquery language. As JSON becomes the data-centric semistructured data format, MongoDB [33] based NoSQL systemswith JSON specific query language become a popular choice formanaging JSON data. Polygolt storage with NoSQL [40] isgetting popular. As the volume of FSD grows at an ever fasterrate, the trend towards using these specialized NoSQL [4] baseddatabase systems accelerates. This trend leads to significantlyincreased complexity of the management of data since userscannot manage all of their data in one platform. When usershave to work with different data platforms, they have to writedata integration code in their applications. Users can not queryall of their data using a single high level declarative querylanguage. Instead, they have to use different query languages forquerying different data and implement their own join algorithmsto join between relational data and FSD. Last, but not the least,many specialized systems lack essential advanced functionality,such as bi-temporality, provenance, and fine grain security thatare standard in modern RDBMSs.In contrary to [2], our goal is to enable RDBMSs to manageFSD along with relational data and thereby leveraging all theadvanced data management services that have been developedover many years for relational data. All leading RDBMSplatforms have supported XML data management using theSQL/XML [10] query language during the last decade so thatXML and relational data can be queried and managed together.Lately, Oracle [21], Vertica [24], TeraData [25], Postgres SQL[7], and Sinew Hadapt System [22] are all supporting JSON datamanagement by extending SQL so that JSON and relational datacan be queried and managed together in an RDBMS. Thebenefits of extending RDBMSs to manage FSD are: Enabling schema-less data application developmentparadigm (data first, schema later/never) in RDBMS. Enable agile style rapid data access with maximum schemaflexibility (Schema on Read but not on Write Paradigm) inRDBMS. Efficient consolidated single data management platform –covering both relational data and FSD to reduce integrationissues, simplify operations, and eliminate migration issues.

Efficient productive declarative application development –by leveraging SQL as a set-oriented query language todeclaratively query domain specific FSD.We understand the rationale of “One Size Does Not Fit All” [31]argument as a way to encourage out-of-box thinking and rearchitecting RDBMSs to handle a variety of new challengingdata management requirements that do not fit the originalrelational data management paradigm that has been establishedmore than four decades ago. However, it is desirable to present asingle system which hides the complexity of multiplearchitectures instead of having users to manage multiplesystems. Therefore, we prefer to build an evolutionary path forextending RDBMSs to support FSD data management.Nevertheless, since FSD management has challenged thefundamental assumption of RDBMSs that require existence ofschema to store, index and query data, we do need new ways ofthinking how to store, query, update and index FSD differentlyfrom relational data. That is, we need to think out-of-theschema. Indeed, management of FSD challenges us to thinkhow to store, query and index data without up-front schemadefinition?To accomplish this goal, we leverage the RDBMS extensibilitytechnology for managing user defined object types, functionsand indexes [12, 14]. Applying extensibility ideas leads us to thecurrent engineering principles and practises for managing FSDin RDBMSs as follows: Storage Principle: Use the document-object-store modelby storing FSD as one object without relying on any staticschema & E/R model to decompose FSD into relationaltables. That is, no schema on write. Flexible schema that isembedded in FSD can be computed as data-guide toprovide schema on read capability. Query and Update Principle: Leverage SQL as adeclarative Set-oriented Query Language. That is, position“NoSQL” to mean Naturally open Set-oriented QueryLanguage to embed FSD domain specific query language.FSD domain language provides query and navigationcapability for both schema and data for each FSD instance. Index Principle: Index FSD using relational table indexand search index. The relational table index derivable fromdata-guide and query workload provides efficient relationalaccess for pre-defined query access patterns. The searchindex using generalized inverted index strategy providesefficient search for ad-hoc query access patterns.The main contribution of this paper is a detailed analysis anddiscussions of these principles to understand the rationale ofwhy we propose to use these principles to manage FSD, theissues and limitations when practising these principles, and thepotential new research challenges and opportunities to manageFSD equally well as that of relational data in an integratedRDBMS platform. Although database extensibility technology iswell-known as RDBMS engineering practise [12,14], we foundthat abstracting this engineering practise using these threeprinciples helps to more adequately address the challenges ofFSD.Outline of the Paper: Section 3, 4, 5 goes into details forstoring, querying, updating and indexing FSD respectively withsection 6 on advanced data management capability for FSD.Section 7 draws conclusion followed by the acknowledgementsin section 8.3. Storing FSD3.1 FSD Storage RequirementsThe relational design leverages the E/R model [5] whichprovides a clean separation between structure and data. Thismethod has been very successful for a large class of applicationsby extracting common structures out of data as schema.Schemata are managed by RDBMSs in a central dictionary.Therefore, in the E/R model, a schema has to be defined beforedata can be loaded.A collection of FSD data, such as JSON objects, XMLdocuments, has typically a small number of common attributescomplemented by a large variety of non-common attributes. Theattributes form hierarchical structural relationships. Thestructure is not easily separable from data content because thestructure varies greatly from instance to instance. ShreddingFSD collections relationally results in a large number of tablesjoined by a large number of primary/foreign key relationshipsand still many tables have many sparsely populated columns [1,3]. Furthermore, constant schema evolution is required as newsparse attributes are detected in new FSD instances or singleoccurrence of an existing attribute is detected to have multipleoccurrences in new FSD instances. Therefore, this is not ascalable solution. Instead, the instance schema is embedded ineach FSD instance so that each FSD instance is self-containedand can be distributed to different tiers. Schemata of a FSDcollection are not managed as central dictionary data but rathercomputable dynamically as data-guide [19] from all FSDinstances stored in a FSD collection.Abstractly, schema based data can be defined as a set of data(which is denoted as 'S') that satisfies the following properties:there exists a set of finite size of dimension (which is denoted as'D') such that every element of S can be expressed as a linearcombination of elements from D.Flexible schema based data is the negation of Schema baseddata. That is, there does NOT exit a set of finite size ofdimension D such that every element of S can be expressed as alinear combination of elements from set D. Intuitively, schemabased data can have unbounded number of elements but has abounded dimensions as schema definition whereas flexibleschema based data has unbounded dimensions.Because schema based data has finite dimensions, therefore,schema based data can be processed by separating the dataaway from its dimension so that an element in a schema baseddata set can be expressed by a vector of values, each of whichrepresents the projection of the element in a particulardimension. All the dimensions are known as schema. Flexibleschema based data cannot be processed by separating the dataaway from its dimension. Each element in a flexible schemabased data has to keep track of its dimensions and thecorresponding value. An element in a flexible schema baseddata is expressed by a vector of dimension and value (name-

value pair). Therefore, flexible schema based data requires store,query and index both schema and data together.3.2 FSD Storage Current PractisesSelf-contained Document-object-store model: The currentpractice for storing FSD is to store FSD instances in a FSDcollection using document-object-store model where bothstructure and data are stored together for each FSD instance sothat it is self-descriptive without relying on a central schemadictionary. New structures can be added on a per-record basiswithout dealing with schema evolution. Aggregated storagesupports full document-object retrieval efficiently without thecost of querying and stitching pieces of data from multiplerelational tables. Each FSD instance can be independentlyimported, exported, distributed without any schema dependency.Table1 shows DDL to create resumeDoc tab collection ofresume XML documents, a shoppingCar tab collection ofshopping cart JSON objects. SQL/XML standard defines XMLas a built-in datatype in SQL. For upcoming SQL/JSONstandard [21], it supports storing JSON in SQL varchar,varbinary, CLOB, BLOB datatype with the new ‘IS JSON’check constraint to ensure the data stored in the column is avalid JSON object. Adding a new domain FSD by storing intoexisting SQL datatype, such as varchar or LOB, without addinga new SQL type allows the new domain FSD to have full dataoperational completeness capability (Transactions, Replication,Partition, Security, Provenance, Export/Export, Client APIs etc)support with minimal development efforts.T1T2CREATE TABLE resumeDoc tab(id number, docEnterDate date,docVerifyDate date, resume XMLType)CREATE TABLE shoppingCar tab(oid number, shoppingCar BLOB check (shoppingCar ISJSON))Table 1 – Document-Object-Store Table ExamplesData-Guide as soft Schema: The data-guide can be computedfrom FSD collections to understand the complete structures ofthe data which helps to form queries over FSD collection. Thatis, FSD management with data-guide supports the paradigm of“storage without schema but query with schema”. For commontop-level scalar attributes that exist in all FSD instances of aFSD collection, they can be automatically projected out asvirtual columns or flexible table view [21, 22, 24]. For nestedmaster-detail hierarchical structures exist in FSD instances,relational table indexes [11] and materialized views [35], aredefined using FSD TABLE() table function (Q4 in Table 2).They can be built as secondary structures on top of the primaryhierarchical FSD storage to provide efficient relational viewaccess of FSD. FSD TABLE() serves as a bridge between FSDdata and relational data. They are flexible because they can becreated on demand. See section 5.2 for how to manageFSD TABLE() and virtual columns as indexing or in-memorycolumnar structures. Furthermore, to ensure data integrity, softschema can be defined as check constraint as verificationmechanism but not storage mechanism.3.3 FSD Storage Limitations and Research ChallengesSingle Hierarchy: The document-object-storage model isessentially a de-normalized storage model with single roothierarchy. When XML support was added into RDBMSs, theIMS hierarchical data model issues were brought up [32].Fundamentally, the hierarchy storage model re-surfaces thesingle root hierarchy problem that relational model has resolvedsuccessfully. In particular, supporting m-n relationship in onehierarchy is quite awkward. Therefore, a research challenge ishow to resolve single hierarchy problem in document-objectstorage mode that satisfies ‘data first, structural later’requirement. Is there an aggregated storage model, other thanE/R model, that can support multi-hierarchy access efficiently?Papers [20, 23] have proposed ideas on approaching certainaspects of this problem.Optimal instance level binary FSD format: The documentobject-storage model is essentially a de-normalized storagewhere master and detail data are stored together as onehierarchical tree structure, therefore, it is feasible to achievebetter query performance than with normalized storage at theexpense of update. Other than storing FSD instances in textualform, they can also be stored in a compact binary form native tothe FSD domain data so that the binary storage format can beused to efficiently process FSD domain specific query language[3, 22]. In particular, since FSD is a hierarchical structurebased, the domain language for hierarchical data is path-driven.The underlying native binary storage form of FSD is treenavigation friendly which improves significant performanceimprovement than text parsing based processing. The challengein designing the binary storage format of FSD instance is tooptimize the format for both query and update. A query friendlyformat typically uses compact structures to achieve ultra queryperformance while leaving no room for accommodating update,especially for the delta-update of a FSD instance involvingstructural change instead of just leaf value change. The currentpractise is to do full FSD instance update physically even thoughlogically only components of a FSD instance need to beupdated. Although typically a FSD instance is of small tomedium size, the update may still cause larger transaction logthan updating simple relational columns. A command levellogging approach [27] can be investigated to see if it is optimalfor high frequent delta-update of FSD instances.Optimal FSD instance size: Although the size of FSDcollections can be scaled to very large number, in practise, eachFSD instances is of small to medium size instead of single largesize. In fact, many vendors have imposed size limit per FSDinstance. This is because each FSD instance provides a logicalunit for concurrency access control, document and Index updateand logging granularity. Supporting single large FSD instancerequires RDBMS locking, logging to provide intra-documentscalability [43] in addition to the current mature inter-documentscalability.4. Querying and Updating FSD4.1 FSD Query and Update RequirementsA FSD collection is stored as a table of FSD instances. A FSDinstance itself is domain specific and typically has its owndomain-specific query language. For FSD of XML documents,the domain-specific query language is XQuery. For FSD ofJSON objects, the domain-specific query language is theSQL/JSON path language as described in [21]. Table 2 showsthe example of SQL/XML[10] and SQL/JSON[21] queries and

DML statements embedding XQuery and SQL/JSON pathlanguage. In general, the domain-specific query languageprovides the following requirements: Capability of querying and navigating document-objectstructures declaratively: A FSD instance is not shreddedinto tables since hierarchies in a FSD can be flexible anddynamic without being modelled as a fixed master-detailjoin pattern. Therefore, it is natural to express hierarchicaltraversal of FSD as path navigation with value predicateconstructs in the FSD domain language. The path namecan contain a wildcard name match and the path step can berecursive to facilitate exploratory query of the FSD data.For example, capabilities of the wildcard tag name matchand recursive descendant tag match in XPath expressionssupport the notation of navigating structures withoutknowing the exact names or the exact hierarchy of thestructures. See ‘.//experience’ XPath expression in Q1 andQ2. Such capability is needed to provide flexibility ofwriting explorative and discovery queries.Capability of doing full context aware text searchdeclaratively: FSD instances can be document centric withmixture of textual content and structures. There is asignificant amount of full text content in FSD that aresubject to full text search. However, unlike plain textualdocument, FSD has text content that is embedded insidehierarchical structure. Full text search can be furtherconfined within a context identified by path navigation intothe FSD instance. Therefore, context aware full text searchis needed in FSD domain languages. See XQuery full textsearch expression in XMLEXISTS() predicate of Q1 andQ2 and path-aware full text search expression inJSON TEXTCONTAINS() predicate of Q3.Capability of projecting, transforming objectcomponent and constructing new document or object:Unlike relational query results which are tuples of scalardata, results of path navigational queries can be fragmentsof FSD. New FSD can be constructed by extractingcomponents of existing FSD and combine them throughconstruction and transformation. Therefore, constructingand transforming FSD instances are required in any FSDlanguage. See XQuery constructor expression in theXMLQUERY() function in Q1.provides the necessary constructs to express set algebraoperators, such as selection, projection, join, group by,aggregation, union, intersection and difference among FSDinstances. SQL is openable to support a set of FSD XXX()functions that can embed FSD domain specific query language.These FSD XXX() functions are used in strategic places in SQLto filter, process, transform and update FSD instances. SeeFigure 1 for details.FSD Filtering: FSD EXISTS() is used as a conditionalexpression in a

support schema evolution [43] to handle data whose structure . by storing FSD as one object without relying on any static schema & E/R model to decompose FSD into relational tables. That is, no schema on write . Flexible schema that is . schema ba

Related Documents:

HS-PORTAL 150 Hebe-Schiebe-Türbeschlag für Holzelemente Schema-Übersicht und allgemeine Hinweise 3.1.2 Ausführbar mit Führungsschiene HH0130-01/-02 KH0130-01 Schema A Schema D Schema G Schema G-2 Schema G-3 Schema H Schema C Schema F Schema K Schema E Schema L Achtung: Die für den SIEGENIA-AUBI-Beschlag HS-PORTAL 150 angegebenen .

The totality of these behaviors is the graph schema. Drawing a graph schema . The best way to represent a graph schema is, of course, a graph. This is how the graph schema looks for the classic Tinkerpop graph. Figure 2: Example graph schema shown as a property graph . The graph schema is pretty much a property-graph.

Our framework of schema management is designed for docu-ment stores, which includes three components as shown in Fig. 2, schema extraction and discovery component, repository compo-nent, and schema consuming component with two functions of Schema Extraction and Discovery This component provi

developers to understand and analyze schema evolution in schema-less NoSQL data stores. Our approach, summarized in Figure 3, is made up of three phases, namely schema . find a particular author based on a given identifier; (2) line 5 . authorQuery object. By analyzing the usage flow of this given

This component is responsible for all the functions related to visualization of the mapping process. At present, one visualization m opened for mapping, one local schema and one global schema. This terminology a from the ETANA-DL domain. A local schema denotes the schema for any new s data must be harvested into the Union Catalog.

figure 8 (also figure 41) : visualization schemas and its relationship to database schema. database schema builds on top of databases and visualization schema builds on top of database schema.10 figure 9 - a user can connect snap to a network or local database in order to begin visualization construction.

XML to JSON converter it might be possible to translate a single XML document into its JSON equivalent, but this will not be an option for the Schema instances: One can not simply convert a given XML Schema document with the hope, that the resulting JSON document will be a valid JSON Schema document as well, because the semantics of both .

Arduinos, Lego Mindstorms, Scratch, Python, Sketchup and much more. Pupils can work on their own projects or follow tutorials to build amazing computing creations! Computing Top-Up* Years 7-11 Tailored to both Computing enthusiasts and technophobes, Computing Top-Up provides a chance to reinforce learning done in lesson time and to explore exciting new avenues of technology. Debating Society .