A Study Of NoSQL Database - IJERT

2y ago
20 Views
2 Downloads
376.84 KB
5 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kelvin Chao
Transcription

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 4, April - 2014A Study of NoSQL DatabaseBiswajeet Sethi1, Samaresh Mishra2, Prasant ku. Patnaik31,2,3School of Computer Engineering, KIIT UniversityBhubaneswar, IndiaAbstract—Some of the applications of web service 2.0 needbig data handling. This requires the existing relational databaseto scale horizontally in order to achieve demand for highperformance, especially for applications which require highscale of user data and of high concurrency. These issues areimportant consideration for designers to come up with a newgroup of databases, popularly known as NoSQL. The growingdemand for cloud computing and the development of Internetmotivates the NoSQL movement. This paper deals with featuresand data models of NoSQL databases used in cloud computingenvironment along with strength and limitation of each of themodel. In addition this paper talks about classification ofNoSQL databases based upon CAP theorem.Keywords—NoSQL;Family; Big Data.I.CAP;DocumentOriente;ColumnINTRODUCTIONB. Features of NoSQLNoSQL databases may not require a predefined tableschema, typically scale horizontally and usually avoid joinoperations. Because of schema less nature and involvement ofsmaller subset analysis of NoSQL system, this database can bebetter described as structured data stores. Three importantbasic features of NoSQL databases are scale-out, flexible datastructure and replication, which are explained as follows.IJERTEfficient Storage and retrieval of data with availability andscalability is the main purpose of NoSQL databases. NoSQLdoes not stand for no to SQL; it means “NOT ONLY SQL”[10]. NoSQL database is just an alternative to traditionalrelational database. The industry of database has seen anintroduction of many non relational databases such asMongoDB [11], Hbase [9], Neo4j [8] in last few years.Depending upon the business requirement and strategy a cloudvendor can go with any of the database type. Still somedesigners of pre relational database claim the NoSQLdatabases not to be efficient enough in handling data integrity.This paper is organized as follows: Section 2 describes theimportance of NoSQL databases. Section 3 highlights on theNoSQL data models. Section 4 highlights on the transaction inNoSQL databases. Section 5 puts light on the comparison forNoSQL databases. Finally, we conclude this paper in Section6.stored in their big data centers and have to respond to massiveread-write requests without a noticeable latency. To scale arelational database, data needs to get distributed on multipleservers. Before providing to the application the desiredinformation has to be collected from many tables andcombined. Similarly while writing data also; it has to beperformed on many tables in a coordinated manner. For anyapplication, it could be a bottleneck to handle tables acrossmultiple servers. In relational databases „join‟ operationslowdowns the system to a crawl, especially when millions ofusers are doing lookups against tables with millions of rows ofdata. Large scale web services such as Google, Amazon,Yahoo, Facebook found these to be the cases to develop theirown non-relational database in order to meet the scalabilityand performance needs.II.IMPORTANTCE OF NOSQLA. BackgroundFor last couple of years, SQL vs. NoSQL has beenemerged as a heated argument over the Internet. The argument“SQL vs NoSQL,” actually talks about relational versus nonrelational databases. Because of normalized data model andenforcement of strict ACID properties, traditional relationaldatabase is considered to be a schema based transactionoriented database. It requires a strict predefined schema priorto storing data into it. Redefining a schema in case of a futurechange, once after data got inserted into the database isdisruptive. Whereas in the era of Big Data, there is a constantneed for adding new types of data to enrich the applications.Again the storage solution of relational database can make abig impact on speed and scalability. Web services likeAmazon and Google have terabytes and petabytes of dataIJERTV3IS041265 Scale-out: Scaling out refers to achieve high performancein a distributed environment by using many general-purposemachines. NoSQL databases allow the distribution of thedata over a large number of machines with a distributedprocessing load. Many NoSQL databases allow automaticdistribution of data to new machines when they are added tothe cluster. Scale-out is evaluated in terms of scalability andelasticity. Flexibility: Flexibility in terms of data structure says thatthere is no need to define a schema for databases. NoSQLdatabases do not require a predefined schema. This allowsthe users to store data of various structures in the samedatabase table. However, support for high-level querylanguages such as SQL is not supported by most of theNoSQL databases. Data Replication: One of the features of NoSQL databasesis data replication. In this process a copy of the data isdistributed to different systems in order to achieveredundancy and load distribution. However there is a chanceof losing data consistency among the replicas. But it isbelieved that sometimes this consistency may be achievedeventually. Consistence and availability are the factors forevaluating replication [3].www.ijert.org1131

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 4, April - 2014III.NOSQL DATA MODELSThese are some categories of NoSQL database modelsdiscussed as follows [1][2][4].A. Key-Value Data StoresIn order to handle highly concurrent access to database,the category of NoSQL designed is key-value stores. It is thesimplest, still the most powerful data store. In a key-valuestore each data consists of a pair of a unique key and value. Inorder to save data a key gets generated by the application andvalue gets associated with the key. And this key-value pairgets submitted to the data store. The data values stored in keyvalue stores can have dynamic sets of attributes attached to itand is opaque to the database management system. Hence keyis the only means to access the data values. The type ofbinding from the key to value depends on the programminglanguage used in the application. An application needs toprovide a key to the data stores in order to retrieve data. Manykey-value data stores use a hash function. The applicationhashes the key and find out the location of the data in thedatabase. The key-value data stores are row focused. Whichmeans it enables the application to retrieve data for completeentities.As shown in Fig. 2, the document database stores data inform of key-value pairs. But the data stored in the database istransparent to the system unlike key-value databases. Theapplication can query the database not only with the key i.e.'Employee ID' but also with the defined fields in the documenti.e. FirstNm, LastNm, age etc. Document data stores areefficient approach to model data based on common softwareproblems. But it comes at the cost of slightly lowerperformance and scalability in comparison to key-value datastores. Few of the most prominent document stores are Riak,MongoDB [11], CouchDB.IJERTFig.1 describes retrieval of data from a key-value database.The application has specified a key 'Emp102' to the data storein order to retrieve data. Using the hash function theapplication hashes the key in order to trace the location of datain the data store. The design of the key should support themost frequent queries fired on the data store. Efficiency of thehash function, design of the key and size of the values beingstored are the factors which affect the performance of a keyvalue data store. The operations performed on such data storesare mostly limited to read and write operations. Because of thesimplicity of the key-value data store, it provides users withfastest means of storing and fetching data. All other categoriesof NoSQL are built upon the simplicity, scalability andperformance of key-value data stores. Redis, Voldemort andMembase database systems are examples of prominent keyvalue data stores.B. Document Oriented Data StoresAt an abstract level document oriented database is similarto key-value data store. It also holds value, which anapplication can read or fetch by using a key. Several documentdatabases automatically generate the unique key whilecreating a new document. A document in a document databaseis an entity, which is a collection of named fields. The featurewhich distinguishes the document oriented database from akey–value data store is transparency of the data held by thedatabase. Hence the query possibility is not restricted with thekey only. In order to support scenarios where the applicationrequires querying the database not only based on its key butalso with attribute values, can switch for document databases.A document needs to be self-describing in a documentoriented database. Information is stored in a portable and wellunderstood format such as XML, BSON or JSON.Fig. 2. An example of a document data store.Fig. 1. An example of a key-value data store.IJERTV3IS041265C. Column Family Data StoresSometimes an application may want to read or fetch asubset of fields, similar to the SQL's projection operation.Column family data store enables storing data in columncentric approach. The column family data store partitions thekey space. In NoSQL a key space is considered to be an objectwhich holds all column families of a design together. It is theouter most grouping of the data in the data store. Eachpartition of the key space is known to be a Table. Columnfamilies are declared by these tables. Each column familyconsists of number of columns. A row in a column family isstructured as collections of arbitrary number of columns. Eachcolumn is a map of a key-value pair. In this map, keys are thenames of columns and columns themselves are the values.Each of these mappings is called a cell. Each row in a columnfamily database is identified by a unique row key, defined bythe application. Use of these row keys makes the data retrievalquicker. In order to avoid overwriting of the cell values few ofthe popular column-family databases add timestampinformation automatically to individual columns. Every timewww.ijert.org1132

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 4, April - 2014there is an update, it creates a new version of the cells whichhave been affected by the update operation. Always the readerreads the value which is last written or committed. A row key,column family, column and timestamp constitute a key. Hencethe exact mapping can be represented asalong with a row key constitute a row of a super columnfamily. As in columns, the super column names and the subcolumn names are sorted. Super column is also a name-valueentity but with no timestamps.(row key, column family, column, timestamp) -- value.A generalized structure of a column family database hasbeen shown in Fig. 3 as follows.Fig. 4. Column-family data store (Cassandra).IJERTFig. 3. Column-family data store.D. Graph DatabaseGraph databases are considered to be the specialists ofhighly linked data. Therefore it handles data involving a hugenumber of relationships [8]. There are basically three coreabstractions of graph database. These are nodes, edges whichconnect two different nodes, and properties. Each node holdsinformation about an entity. The edges represent the existenceof relationship between the entities. Each relationship ishaving a relationship type and is directional with a start point(node) and an end point. The end point can be some othernode than that of the start node or possibly the same node.Key-value properties are associated not only with the nodesbut also with the relationships. The properties of therelationships provide additional information about therelationships. The direction of the relationship determines thetraversal path from one node to the other in a graph database.This data model has been popularly accepted as "sparse,distributed, consistent multidimensional sorted map" [4]. Anadvantage of using a column family data store over atraditional database is in handling NULL values. In arelational database, when a value for an attribute is notapplicable for a particular row, NULL gets stored. While in acolumn family database the column can be simply removedfor corresponding row in case the data is not available. That‟swhy Google calls it a sparse database. One of the key featuresof this database is that it can be distributed in billion of cellsover thousands of machines. The cells are sorted on basis ofrow keys. Sorting of keys allows searching data for a range ofkeys. Since the data in such kind of model get organized as aset of rows and columns, representation wise this database ismost similar to the relational database. But like a relationaldatabase it does not need any predefined schema. At runtime,rows and columns can be added flexibly but oftentimes thecolumn families have to be predefined, which leads the datastore to be less flexible than key-value or document datastores. Developers should understand the data captured by theapplication and the query possibilities before deciding thecolumn families. A well-designed column-family databaseenables an application to satisfy majority of its queries byvisiting less number of column families as possible. Comparedto a relational database holding equivalent amount of data, acolumn family data store is more scalable and faster. But theperformance comes at the price of the database being lessgeneralized than a relational database as it is designed insupport for a specific set of queries. Hbase [9] and Hypertabledatabase systems are based on the data model describedabove. Whereas another database system Cassandra differsfrom the data model, as it is having a new dimension addedcalled super column [1]. As shown in Fig. 4 a super columnconsists of multiple columns. A collection of super columnsIJERTV3IS041265Fig. 5 represents a part of the 'Employee' databasestructured as graph database. Each node in this graph databaserepresents an employee entity. These entities are related witheach other through a relationship of relationship type“knows”. The property associated with the relationship is“Duration”. The key difference between a graph and relationaldatabase is data querying. Instead of using cost intensiveprocess like recursive join as in relational database, graphdatabases use traversal method. While querying through graphdatabase, a start node has to be specified by the application.Traversal starts from the start node and progresses viarelationships to nodes connected to the start node, based uponsome rule defined by the application logic. The traversalmethod involves only nodes which are relevant to theapplication not the entire data set. Hence, a huge increase innumber of nodes does not affect the traversal rate much.Social networking, data mining, managing networks, andcalculating routes are few of the fields where graph databasehas been used extensively. Neo4j [8], GraphDB are populargraph databases in use today.www.ijert.org1133

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 4, April - 2014resolved, resynchronization of data takes place, butwithout the guarantee of consistency. Riak, CouchDB,KAI are few databases which follow this principle.Afterwards CAP theorem gets expanded into PACELC[1]. PACELC is an abbreviation for partition, availability,consistency, else, latency, consistency. According to thismodel the tradeoff between availability and consistency is notonly based upon partition tolerance, but it is also dependent onthe existence of network partition. It suggests latency to beone of the important factors, since most of the distributeddatabase systems use replication technology for ensuringavailability. Later eBay introduced a new theorem known asBASE theorem [3]. BASE aims to achieve availability insteadof consistency of databases. BASE is the abbreviation forbasically available, soft state and eventually consistent.Fig. 5. An example of graph database.IV. Basically Available: Basically available says that evenif a part of the database becomes unavailable, otherparts of the database continue to function as expected.In case of a node failure, the operation continues on thereplica of the data stored in some other node.TRANSACTION IN NOSQL DATABASES Soft State: Soft state says that on the basis of userinteraction a data may be dependent on time. Thesedata may also have possible expiration after a certainperiod of time. Hence to keep the data relevant in asystem it has to be updated or accessed. Eventually Consistent: Eventual consistency says afterany data update, data may not become consistentacross the entire system but it will become consistentwith time eventually. Therefore, the data is said to beconsistent in the future.IJERTWhen we talk about SQL vs. NoSQL, the competition isactually not between the databases. The comparison isbetween the transaction models of both the databases.Transaction is defined to be the logical unit of a databaseprocessing formed by an executing program. The transactionof SQL database is based upon strict ACID properties. WhereACID is the abbreviation for Atomicity, Consistence,Isolation and Durability. But designers of the NoSQLdatabase came up with a decision, that ACID property is toorestrictive to achieve the demands of big data. HenceProfessor Eric Brewer in the year of 2000 came up with anew theorem known as CAP theorem [2]. CAP is theabbreviation for Consistency, Availability and Partitiontolerance. The theorem says that the designers can achieveany two of these properties at a time in a distributedenvironment. The designers can ensure Consistency andAvailability at the cost of Partition tolerance, i.e. CA baseddatabase. If the designer goes for availability and partitiontolerance at the cost of Consistency, then it is an AP baseddatabase. And if ensure Consistency and Partition tolerance atthe cost of availability then the database is CP based. Thetransaction of NoSQL can be classified as follows. Concerned about consistency and availability (CA):This kind of database system ensures its priority moretowards data availability and consistency by usingreplication approach [2]. Part of database doesn'tbother about partition tolerance. In case of occurrenceof a partition between nodes, the data will go out ofsync. The relational database, Vertica, and Greenplumdatabase systems fall under such category of databases.V.There is not any hard and fast rule to decide which NoSQLdatabase is best for an enterprise. Business Model, strategy,cost and transaction model demand are few of the importantfactors that an enterprise should consider while choosing adatabase. Following are few of the facts which may help inchoosing a database for an enterprise. Concerned about consistency and partition tolerance(CP): The priority of such database system is to ensuredata consistency. But it does not support for goodavailability. Data gets stored in distributed nodes [2].When a node goes down, data becomes unavailable tomaintain consistency between the nodes. It maintainspartition tolerance by preventing resynchronization ofdata. Hypertable, BigTable, HBase are few databasesystems which are concerned about CP. Concerned about availability and partition tolerance(AP): The priority of such database system is to ensuredata availability and partition tolerance primarily. Evenif there is a communication failure between the nodes,nodes remain online. Once after the partition getsIJERTV3IS041265COMPARISION OF NOSQL DATABASESwww.ijert.org If the applications simply store and retrieve data itemswhich are opaque to the database management systemand blobs by using a key as identifier, then a key-valuestore is the best choice. But if the application likes toquery the database with some attribute value other thanthe key, it fails. Also while updating or reading anindividual field in a record key-value store is a failure. When applications are more selective and need to filterrecords based on non-key fields, or retrieve or updateindividual fields in a record as it, then documentdatabase is an efficient solution. Document data storesoffer better query possibility than key-value datastores. When the applications need to store records withhundreds or thousands of fields, but retrieves a subsetof those fields in most of the queries that it performs,in that case column-family data store is an efficientchoice. Such data stores are suitable for large datasetsthat scale high.1134

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 4, April - 2014 If the applications need to store and processinformation on heavily linked data with highlycomplex relationship between the entities, graphdatabase is the best choice. In a graph database, entitiesand relationship between the entities are treated withequal importance.TABLE I.Database ToolREFERENCESMaria Indrawan, “Database Research: Are We At A Crossroad?,” 15thInternational Conference on Network-Based Information Systems, pp.45-48, 2012.[2] Jing Han, Haihong E, Guan Le,Jian Du, “Survey on NoSQL Database,”IEEE, pp. 363- 366, 2011.[3] Shalini R., Savita G., Subramanian A., “Comparison of CloudDatabase: Amazon‟s SimpleDB and Google‟s Bigtable,” InternationalConference on Recent Trends in Information Systems, IEEE, pp. 165168, 2011.[4] R Hecht, S Jablonski, “NoSQL Evaluation,” International Conferenceon Cloud and Service Computing, IEEE, pp. 336-338, 2011.[5] Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin, “MongoDBvs Oracle - database comparison,” Third International Conference onEmerging Intelligent Data and Web Technologies, 2012.[6] Jing Han, Meina Song and Junde Song, “A Novel Solution ofDistributed Memory NoSQL Database for Cloud Computing,” 10thIEEE/ACIS International Conference on Computer and InformationScience, 2011.[7] Guoxi Wang,Jianfeng Tang, “The NoSQL Principles and BasicApplication of Cassandra Model,” IEEE, pp.1332-1333, 2012.[8] Neo4j, http://neo4j.org.[9] Hbase, http://hbase.apache.org.[10] Mahdi Negahi Shirazi,Ho Chin Kuan,Hossein Dolatabadi, “DesignPatterns to Enable Data Portability between Clouds‟ Databases,” 12thInternational Conference on Computational Science and ItsApplications, pp. 117-118, 2012.[11] Mongodb, http://www.mongodb.org.[1]IJERTTable 1 represents a list of databases, their correspondingdata models, along with transaction model and querylanguage used by these databases. Cassandra for facebook,HBase [9] for Google, DynamoDB for Amazon is few of thedatabases which were developed by different companies inorder to meet their demand for high data storage requirement.On the other hand database systems such as Neo4j, Riak, andMongoDB were developed in order to serve otherorganizations. In terms of transaction model, most of thedatabases such as DynamoDB, Riak, Cassandra andVoldermort give more preference to availability overconsistency. Whereas Tokyo Cabinet, Hbase preferconsistency over availability. NoSQL database was designedin order to handle large volume data processing, excludingsome of the support system of RDBMS like ad-hoc query.Though many of NoSQL databases mentioned in Table. 1support ad-hoc queries but the level of programmingexpertise in writing queries needs to be much higher than thatof a relational database.environment are the areas which need detailed research infuture.A COMPARISION OF DIFFERENT NOSQL DATABASESData l (CAP)APAPVoldermortTokyo ryBuilt in APICorrugatedIronNoNoCloudant,LuceneBSON basedformatBuiltin,LimitedHIVE, PIGHIVE, PIGChyperCONCLUSIONIn the database domain the NoSQL database is considered tobe quite new. However these are being developed on knownand existing theory. NoSQL databases systems still havevarious limitations. There is neither a common standard norany common and familiar query language for queryingNoSQL databases. Each database behaves in a different wayand does things differently. Relatively these databases areimmature and constantly evolving. NoSQL database does notsupport strict ACID properties, hence there is no guaranteethat all data will be written successfully to the data store. Thispaper describes the limitation of relational database along withdifferent categories of NoSQL data models. Since there is noevaluation available to find the right tool, this paper comparesthe strength and limitation of each the data model. Limitationsof NoSQL databases and its use in a cloud computingIJERTV3IS041265www.ijert.org1135

stores. Few of the most prominent document stores are Riak, MongoDB [11], CouchDB. C. Column Family Data Stores . Sometimes an application may want to read or fetch a subset of fields, similar to the SQL's projection operation. Column family da

Related Documents:

Oracle NoSQL Database Hands on Workshop Lab Exercise 1 - Start Oracle NoSQL Database instance and access data from Formatter classes In this exercise, you will start an Oracle NoSQL Database instance that has movie data preloaded. KVLite will be used as the Oracle NoSQL Database Instance. A very brief introduction to KVLite follows:

NoSQL database. A NoSQL database can be used to solve new problems that require: Scalability - A NoSQL database can scale horizontally to the scale required by big data. Applications can run in parallel on a cloud-based cluster comprising of dozens, hundreds, or even thousands of commodity servers. The NoSQL scale-out architecture

towards NoSQL databases is the high cost of legacy RDBMS vendors versus NoSQL software. In general, NoSQL software is a fraction of what vendors such as IBM and Oracle charge for their databases. What Constitutes an Enterprise NoSQL Solution? What should a technology leader or decision-maker look for in a NoSQL offering that defines it as truly

1. SQL Interface to RDB and NoSQL Database. To access both RDB and NoSQL databases, we provide a general SQL interface. It consists of a SQL query parser and Apache Phoenix to connect HBase as a NoSQL database to a SQL translator and a MySQL JDBC driver to an RDB connector. The application does not need to change the queries or manage NoSQL .

Welcome to SQL for Oracle NoSQL Database. This language provides a SQL-like interface to Oracle NoSQL Database. The SQL for Oracle NoSQL Database data model supports flat relational data, hierarchical typed (schema-full) data, and schema-less JSON data. SQL for Oracle NoSQL Database is designed to handle all such data seamlessly without any

Chapter 2: NoSQL Tutorial: Learn NoSQL Features, Types, What is, Advantages What is NoSQL? NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to scale. NoSQL database is used for distributed data stores with humongous data storage needs. No

this issue is to distribute the database load on multiple hosts when load increases. This process is called as "scaling out." NoSQL database is non-relational database, so it scales out better than relational databases they are designed for web applications. 1.1.2. Brief History of NoSQL Databases 1998- Carlo Strozzi use the term NoSQL for his

NoSQL database is available in four generic types: document-based, column-based, key-value, and graph [3]. NoSQL database has an advantage over relational database due to its "flexi-schema". The "flexi-schema" behaviour allows different structures of records to be stored within the same table [4]. For example, in a document-based NoSQL