Apache HBase Primer - Programmer-books

1y ago
2 Views
2 Downloads
8.94 MB
147 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Amalia Wilborn
Transcription

Apache HBasePrimer—Deepak Vohrawww.allitebooks.com

ApacheHBase PrimerDeepak Vohrawww.allitebooks.com

Apache HBase PrimerDeepak VohraWhite Rock, British ColumbiaCanadaISBN-13 (pbk): 978-1-4842-2423-6DOI 10.1007/978-1-4842-2424-3ISBN-13 (electronic): 978-1-4842-2424-3Library of Congress Control Number: 2016959189Copyright 2016 by Deepak VohraThis work is subject to copyright. All rights are reserved by the Publisher, whether the wholeor part of the material is concerned, specifically the rights of translation, reprinting, reuse ofillustrations, recitation, broadcasting, reproduction on microfilms or in any other physicalway, and transmission or information storage and retrieval, electronic adaptation, computersoftware, or by similar or dissimilar methodology now known or hereafter developed.Trademarked names, logos, and images may appear in this book. Rather than use a trademarksymbol with every occurrence of a trademarked name, logo, or image we use the names, logos,and images only in an editorial fashion and to the benefit of the trademark owner, with nointention of infringement of the trademark.The use in this publication of trade names, trademarks, service marks, and similar terms, evenif they are not identified as such, is not to be taken as an expression of opinion as to whether ornot they are subject to proprietary rights.While the advice and information in this book are believed to be true and accurate at thedate of publication, neither the authors nor the editors nor the publisher can accept any legalresponsibility for any errors or omissions that may be made. The publisher makes no warranty,express or implied, with respect to the material contained herein.Managing Director: Welmoed SpahrLead Editor: Steve AnglinTechnical Reviewer: Massimo NardoneEditorial Board: Steve Anglin, Pramila Balan, Laura Berendson, Aaron Black,Louise Corrigan, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John,Nikhil Karkal, James Markham, Susan McDermott, Matthew Moodie, Natalie Pao,Gwenan SpearingCoordinating Editor: Mark PowersCopy Editor: Mary BehrCompositor: SPi GlobalIndexer: SPi GlobalArtist: SPi GlobalDistributed to the book trade worldwide by Springer Science Business Media New York,233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505,e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is aCalifornia LLC and the sole member (owner) is Springer Science Business Media Finance Inc(SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.For information on translations, please e-mail rights@apress.com, or visit www.apress.com.Apress and friends of ED books may be purchased in bulk for academic, corporate,or promotional use. eBook versions and licenses are also available for most titles.For more information, reference our Special Bulk Sales–eBook Licensing web page atwww.apress.com/bulk-sales.Any source code or other supplementary materials referenced by the author in this text areavailable to readers at www.apress.com. For detailed information about how to locate yourbook’s source code, go to www.apress.com/source-code/. Readers can also access source codeat SpringerLink in the Supplementary Material section for each chapter.Printed on acid-free paperwww.allitebooks.com

Contents at a GlanceAbout the Author . xiiiAbout the Technical Reviewer . xvIntroduction . xvii Part I: Core Concepts . 1 Chapter 1: Fundamental Characteristics . 3 Chapter 2: Apache HBase and HDFS . 9 Chapter 3: Application Characteristics. 45 Part II: Data Model . 49 Chapter 4: Physical Storage . 51 Chapter 5: Column Family and Column Qualifier . 53 Chapter 6: Row Versioning . 59 Chapter 7: Logical Storage . 63 Part III: Architecture . 67 Chapter 8: Major Components of a Cluster. 69 Chapter 9: Regions . 75 Chapter 10: Finding a Row in a Table . 81 Chapter 11: Compactions . 87 Chapter 12: Region Failover . 99 Chapter 13: Creating a Column Family . 105iiiwww.allitebooks.com

CONTENTS AT A GLANCE Part IV: Schema Design . 109 Chapter 14: Region Splitting. 111 Chapter 15: Defining the Row Keys . 117 Part V: Apache HBase Java API . 121 Chapter 16: The HBaseAdmin Class. 123 Chapter 17: Using the Get Class . 129 Chapter 18: Using the HTable Class . 133 Part VI: Administration . 135 Chapter 19: Using the HBase Shell . 137 Chapter 20: Bulk Loading Data . 145Index . 149ivwww.allitebooks.com

ContentsAbout the Author . xiiiAbout the Technical Reviewer . xvIntroduction . xvii Part I: Core Concepts . 1 Chapter 1: Fundamental Characteristics . 3Distributed . 3Big Data Store . 3Non-Relational. 3Flexible Data Model . 4Scalable. 4Roles in Hadoop Big Data Ecosystem. 5How Is Apache HBase Different from a Traditional RDBMS? . 5Summary . 8 Chapter 2: Apache HBase and HDFS . 9Overview . 9Storing Data . 14HFile Data files- HFile v1 . 15HBase Blocks . 17Key Value Format . 18HFile v2 . 19Encoding. 20vwww.allitebooks.com

CONTENTSCompaction . 21KeyValue Class . 21Data Locality. 24Table Format . 25HBase Ecosystem . 25HBase Services . 26Auto-sharding. 27The Write Path to Create a Table . 27The Write Path to Insert Data . 28The Write Path to Append-Only R/W . 29The Read Path for Reading Data . 30The Read Path Append-Only to Random R/W . 30HFile Format . 30Data Block Encoding . 31Compactions . 32Snapshots . 32The HFileSystem Class . 33Scaling . 33HBase Java Client API. 35Random Access . 36Data Files (HFile) . 36Reference Files/Links . 37Write-Ahead Logs . 38Data Locality. 38Checksums . 40Data Locality for HBase . 42viwww.allitebooks.com

CONTENTSMemStore . 42Summary . 43 Chapter 3: Application Characteristics. 45Summary . 47 Part II: Data Model . 49 Chapter 4: Physical Storage . 51Summary . 52 Chapter 5: Column Family and Column Qualifier . 53Summary . 57 Chapter 6: Row Versioning . 59Versions Sorting . 61Summary . 62 Chapter 7: Logical Storage . 63Summary . 65 Part III: Architecture . 67 Chapter 8: Major Components of a Cluster. 69Master . 70RegionServers . 70ZooKeeper . 71Regions . 72Write-Ahead Log. 72Store . 72HDFS. 73Clients . 73Summary . 73viiwww.allitebooks.com

CONTENTS Chapter 9: Regions . 75How Many Regions? . 76Compactions . 76Region Assignment. 76Failover. 77Region Locality . 77Distributed Datastore . 77Partitioning . 77Auto Sharding and Scalability . 78Region Splitting . 78Manual Splitting . 79Pre-Splitting . 79Load Balancing . 79Preventing Hotspots . 80Summary . 80 Chapter 10: Finding a Row in a Table . 81Block Cache. 82The hbase:meta Table . 83Summary . 85 Chapter 11: Compactions . 87Minor Compactions . 87Major Compactions . 88Compaction Policy . 88Function and Purpose . 89Versions and Compactions . 90Delete Markers and Compactions . 90Expired Rows and Compactions . 90viiiwww.allitebooks.com

CONTENTSRegion Splitting and Compactions . 90Number of Regions and Compactions . 91Data Locality and Compactions . 91Write Throughput and Compactions . 91Encryption and Compactions. 91Configuration Properties . 92Summary . 97 Chapter 12: Region Failover . 99The Role of the ZooKeeper . 99HBase Resilience. 99Phases of Failover . 100Failure Detection . 102Data Recovery . 102Regions Reassignment . 103Failover and Data Locality . 103Configuration Properties . 103Summary . 103 Chapter 13: Creating a Column Family . 105Cardinality . 105Number of Column Families . 106Column Family Compression . 106Column Family Block Size . 106Bloom Filters . 106IN MEMORY . 107MAX LENGTH and MAX VERSIONS . 107Summary . 107ixwww.allitebooks.com

CONTENTS Part IV: Schema Design . 109 Chapter 14: Region Splitting. 111Managed Splitting . 112Pre-Splitting . 113Configuration Properties . 113Summary . 116 Chapter 15: Defining the Row Keys . 117Table Key Design . 117Filters . 118FirstKeyOnlyFilter Filter . 118KeyOnlyFilter Filter . 118Bloom Filters . 118Scan Time. 118Sequential Keys. 118Defining the Row Keys for Locality . 119Summary . 119 Part V: Apache HBase Java API . 121 Chapter 16: The HBaseAdmin Class. 123Summary . 127 Chapter 17: Using the Get Class . 129Summary . 132 Chapter 18: Using the HTable Class . 133Summary . 134x

CONTENTS Part VI: Administration . 135 Chapter 19: Using the HBase Shell . 137Creating a Table. 137Altering a Table. 138Adding Table Data. 139Describing a Table . 139Finding If a Table Exists . 139Listing Tables. 139Scanning a Table . 140Enabling and Disabling a Table. 141Dropping a Table. 141Counting the Number of Rows in a Table . 141Getting Table Data . 141Truncating a Table . 142Deleting Table Data . 142Summary . 143 Chapter 20: Bulk Loading Data . 145Summary . 147Index . 149xi

About the AuthorDeepak Vohra is a consultant and a principal member ofthe NuBean software company. Deepak is a Sun-certifiedJava programmer and Web component developer. He hasworked in the fields of XML, Java programming, and JavaEE for over seven years. Deepak is the coauthor of ProXML Development with Java Technology (Apress, 2006).Deepak is also the author of the JDBC 4.0 and OracleJDeveloper for J2EE Development, Processing XMLDocuments with Oracle JDeveloper 11g, EJB 3.0 DatabasePersistence with Oracle Fusion Middleware 11g, and JavaEE Development in Eclipse IDE (Packt Publishing). Healso served as the technical reviewer on WebLogic: TheDefinitive Guide (O’Reilly Media, 2004) and RubyProgramming for the Absolute Beginner (CengageLearning PTR, 2007).xiii

About the TechnicalReviewerMassimo Nardone has more than 22 years ofexperience in security, web/mobile development, andcloud and IT architecture. His true IT passions aresecurity and Android. He has been programming andteaching how to program with Android, Perl, PHP, Java,VB, Python, C/C , and MySQL for more than 20 years.Technical skills include security, Android, cloud, Java,MySQL, Drupal, Cobol, Perl, web and mobiledevelopment, MongoDB, D3, Joomla, Couchbase,C/C , WebGL, Python, Pro Rails, Django CMS, Jekyll,Scratch, etc.He currently works as Chief InformationSecurity Office (CISO) for Cargotec Oyj. He holds fourinternational patents (PKI, SIP, SAML, and Proxy areas).He worked as a visiting lecturer and supervisor forexercises at the Networking Laboratory of the Helsinki University of Technology (AaltoUniversity). He has also worked as a Project Manager, Software Engineer, ResearchEngineer, Chief Security Architect, Information Security Manager, PCI/SCADA Auditor,and Senior Lead IT Security/Cloud/SCADA Architect for many years. He holds a Masterof Science degree in Computing Science from the University of Salerno, Italy.Massimo has reviewed more than 40 IT books for different publishing companies,and he is the coauthor of Pro Android Games (Apress, 2015).xv

IntroductionApache HBase is an open source NoSQL database based on the wide-column data storemodel. HBase was initially released in 2008. While many NoSQL databases are available,Apache HBase is the database for the Apache Hadoop ecosystem.HBase supports most of the commonly used programming languages such as C, C ,PHP, and Java. The implementation language of HBase is Java. HBase provides accesssupport with Java API, RESTful HTTP API, and Thrift.Some of the other Apache HBase books have a practical orientation and do notdiscuss HBase concepts in much detail. In this primer level book, I shall discuss ApacheHBase concepts. For practical use of Apache HBase, refer another Apress book: PracticalHadoop Ecosystem.xvii

PART ICore Concepts

CHAPTER 1Fundamental CharacteristicsApache HBase is the Hadoop database. HBase is open source and its fundamentalcharacteristics are that it is a non-relational, column-oriented, distributed, scalable, bigdata store. HBase provides schema flexibility. The fundamental characteristics of ApacheHBase are as follows.DistributedHBase provides two distributed modes. In the pseudo-distributed mode, all HBasedaemons run on a single node. In the fully-distributed mode, the daemons run onmultiple nodes across a cluster. Pseudo-distributed mode can run against a local filesystem or an instance of the Hadoop Distributed File System (HDFS). When run againstlocal file system, durability is not guaranteed. Edits are lost if files are not properly closed.The fully-distributed mode can only run on HDFS. Pseudo-distributed mode is suitablefor small-scale testing while fully-distributed mode is suitable for production. Runningagainst HDFS preserves all writes.HBase supports auto-sharding, which implies that tables are dynamically split anddistributed by the database when they become too large.Big Data StoreHBase is based on Hadoop and HDFS, and it provides low latency, random, real-time, read/write access to big data. HBase supports hosting very large tables with billions of rows andbillions/millions of columns. HBase can handle petabytes of data. HBase is designed forqueries of massive data sets and is optimized for read performance. Random read accessis not a Apache Hadoop feature as with Hadoop the reader can only run batch processing,which implies that the data is accessed only in a sequential way so that it has to search theentire dataset for any jobs needed to perform.Non-RelationalHBase is a NoSQL database. NoSQL databases are not based on the relational databasemodel. Relational databases such as Oracle database, MySQL database, and DB2database store data in tables, which have relations between them and make use of Deepak Vohra 2016D. Vohra, Apache HBase Primer, DOI 10.1007/978-1-4842-2424-3 13

CHAPTER 1 FUNDAMENTAL CHARACTERISTICSSQL (Structured Query Language) to access and query the tables. NoSQL databases,in contrast, make use of a storage-and-query mechanism that is predominantly basedon a non-relational, non-SQL data model. The data storage model used by NoSQLdatabases is not some fixed data model; it is a flexible schema data model. The commonfeature among the NoSQL databases is that the relational and tabular database model ofSQL-based databases is not used. Most NoSQL databases make use of no SQL at all, butNoSQL does not imply absolutely no SQL is used, because of which NoSQL is also termedas “not only SQL.”Flexible Data ModelIn 2006 the Google Labs team published a paper entitled “BigTable: A Distributed StorageSystem for Structured Data" h.google.com/en//archive/bigtable-osdi06.pdf). Apache HBase is a wide-column datastore based on Apache Hadoop and on BigTable concepts. The basic unit of storage inHBase is a table. A table consists of one or more column families, which further consistsof columns. Columns are grouped into column families. Data is stored in rows. A row is acollection of key/value pairs. Each row is uniquely identified by a row key. The row keysare created when table data is added and the row keys are used to determine the sortorder and for data sharding, which is splitting a large table and distributing data acrossthe cluster.HBase provides a flexible schema model in which columns may be added to a tablecolumn family as required without predefining the columns. Only the table and columnfamily/ies are required to be defined in advance. No two rows in a table are required tohave the same column/s. All columns in a column family are stored in close proximity.HBase does not support transactions. HBase is not eventually consistent but is astrongly consistent at the record level. Strong consistency implies that the latest data isalways served but at the cost of increased latency. In contrast, eventual consistency canreturn out-of-date data.HBase does not have the notion of data types, but all data is stored as an array of bytes.Rows in a table are sorted lexicographically by row key, a design feature that makesit feasible to store related rows (or rows that will be read together) together for optimizedscan.ScalableThe basic unit of horizontal scalability in HBase is a region. Rows are shared by regions. Aregion is a

233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com , or visit www.springeronline.com . Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

Related Documents:

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

HBase Standalone is a mode which allow you to get rid of HDFS and to test HBase before deploying in a cluster, It is not production oriented. Installing HBase in standalone is extremely simple. First you have to download the HBase archive named hbase

Apache HBase, a Hadoop NoSQL database, offers the following benefits: Efficient storage of sparse data—Apache HBase provides fault-tolerant storage for large quantities of sparse data using column-based compression. Apache HBase is capable of storing and processing billions of rows and millions of columns per row.

Latin Primer 1: Teacher's Edition Latin Primer 1: Flashcard Set Latin Primer 1: Audio Guide CD Latin Primer: Book 2, Martha Wilson (coming soon) Latin Primer 2: Student Edition Latin Primer 2: Teacher's Edition Latin Primer 2: Flashcard Set Latin Primer 2: Audio Guide CD Latin Primer: Book 3, Martha Wilson (coming soon) Latin Primer 3 .

Java Developer Apache Member Apache James Committer Apache Onami Committer Apache HBase Contributor Worked in London with Hadoop, Hive, Cascading, HBase, Cassand

Limitations of the Backup-Restore Only one active backup session is supported. HBASE-16391 will introduce multiple-backup sessions support Both backup and restore can't be canceled while in progress. (HBASE-15997,15998) Single backup destination only supported. HBASE-15476 There is no merge for incremental images (HBASE-14135)

01/02/2020 TP4 - HBase - TP Big Data 127.1:8000/tp4/ 5/ 18 Autres caractéristiques de HBase: HBase n'a pas de schéma pr édéni, sauf qu'il faut dénir les familles de colonnes à la cr éation des tables, car elles r eprésentent l'organisation physique des données HBase est décrite comme étant un magasin de données clef/v aleur, où la