What Is Big Data And Hadoop?

2y ago

25 Views

3 Downloads

1.33 MB

8 Pages

Last View : 11d ago

Last Download : 3m ago

Upload by : Roy Essex

Report this link

Download PDF

Transcription

What is Big Data and Hadoop?Big Data refers to large sets of data that cannot be analyzed with traditional tools. It stands for datarelated to large-scale processing architectures.Hadoop is the software framework that is developed by Apache to support distributed processing ofdata. Initially, Java language was used to develop Hadoop, but today many other languages are usedfor scripting Hadoop. Hadoop is used as the core platform to structure Big Data and helps inperforming data analytics.Table of ContentsChapter: 1Important DefinitionsChapter: 2MapReduceChapter: 3HDFSChapter: 4Pig vs. SQLChapter: 5HBase componenetsChapter: 6ClouderaChapter: 7Zookeeper and SqoopChapter: 8Hadoop Ecosystem

Chapter: 1 — Important DeﬁnitionsTERMDEFINITIONBig Data refers to the data sets whose size makes it difficult for commonlyBig dataused data capturing software tools to interpret, manage, and process them within areasonable time frame.HadoopVMware PlayerHadoop is an open-source framework built on the Java environment. It assists in theprocessing of large data sets in a distributed computing environment.VMware Player is a free software package offered by VMware, Inc., which is used to createand manage virtual machines.Hadoop ArchitectureHadoop is a master and slave architecture that includes the NameNode as the master and theDataNode as the slave.The Hadoop Distributed File System (HDFS) is a distributed file system that shares some ofHDFSthe features of other distributed file systems. It is used for storing and retrievingunstructured data.MapReduceApache HadoopThe MapReduce is a core component of Hadoop, and is responsible for processing jobs indistributed mode.One of the primary technologies, which rules the field of Big Data technology, is ApacheHadoop .Ubuntu is a leading open-source platform for scale out. Ubuntu helps in utilizing theUbuntu Serverinfrastructure at its optimum level irrespective of whether users want to deploy a cloud, aweb farm, or a Hadoop cluster.The Apache Pig is a platform which helps to analyze large datasets that includes high-levelPiglanguage required to express data analysis programs. Pig is one of the components of theHadoop eco-system.Hive is an open-source data warehousing system used to analyze a large amount of datasetHivethat is stored in Hadoop files. It has three key functions like summarization of data, query,and analysis.SQLMetastoreDriverQuery compilerIt is a query language used to interact with SQL databases.It is the component that stores the system catalog and metadata about tables, columns,partitions, etc. It is stored in a traditional RDBMS format.Driver is the component that Manages the lifecycle of a HiveQL statement.A query compiler is one of the driver components. It is responsible for compiling the Hivescript for errors.

Query optimizerExecution engine:Hive serverA query optimizer optimizes Hive scripts for faster execution of the same. It consists of achain of transformations.The role of the execution engine is to execute the tasks produced by the compiler in properdependency order.The Hive Server is the main component which is responsible for providing an interface to theuser. It also maintains connectivity in modules.Client componentsThe developer uses client components to perform development in Hive. The clientcomponents include Command Line Interface (CLI), web UI, and the JDBC/ODBC driver.It is a distributed, column-oriented database built on top of HDFS (Hadoop DistributedApache HBaseFilesystem). HBase can scale horizontally to thousands of commodity servers and petabytesby indexing the storage.It is used for performing region assignment. ZooKeeper is a centralized management serviceZooKeeperfor maintaining and configuring information, naming, providing distributed synchronization,and group services.ClouderaIt is a commercial tool for deploying Hadoop in an enterprise setup.It is a tool that extracts data derived from non-Hadoop sources and formats them such thatSqoopthe data can be used by Hadoop later.Chapter: 2 — MapReduceThe MapReduce component of Hadoop is responsible for processing jobs indistributed mode. The features of MapReduce are as follows:Distributed data processingThe first feature of MapReducecomponent is that it performsdistributed data processing using theMapReduce programming paradigm.User-deﬁned map phaseThe second feature of MapReduce isthat you can possess a user-definedmap phase, which is a parallel,share-nothing processing of input.Aggregation of outputThe third feature of MapReduce isaggregation of the output of themap phase, which is a user-definedreduce phase after a map process.

Chapter: 3 — HDFSHDFS is used for storing and retrieving unstructured data. The features ofHadoop HDFS are as follows:Provides access to data blocksHelps to manage ﬁle systemHDFS provides a high-throughput accessto data blocks. When an unstructureddata is uploaded on HDFS, it is convertedinto data blocks of fixed size. The data ischunked into blocks so that it iscompatible with commodityhardware's storage.HDFS provides a limited interface formanaging the file system to allow itto scale. This feature ensures thatyou can perform a scale up or scaledown of resources in the Hadoopcluster.Creates multiple replicasof data blocksThe third feature of MapReduce isaggregation of the output of themap phase, which is a user-definedreduce phase after a map process.Chapter: 4 — Pig vs. SQLThe table below includes the differences between Pig and SQL:DiﬀerenceDeﬁnitionExamplePigSQLHDFS provides a limited interface for managing the filesystem to allow it to scale. This feature ensures thatyou can perform a scale up or scale down of resourcesin the Hadoop cluster.It is a query language used tointeract with SQL databases.customer LOAD '/data/customer.dat' AS(c id,name,city);sales LOAD '/data/sales.dat' AS (s id,c id,date,amount);salesBLR FILTER customer BY city 'Bangalore';joined JOIN customer BY c id, salesBLR BY c id;grouped GROUP joined BY c id;summed FOREACH grouped GENERATE GROUP,SUM(joined.salesBLR::amount);spenders FILTER summed BY 1 100000;sorted ORDER spenders BY 1 DESC;DUMP sorted;SELECT c id , SUM(amount)AS CTotalFROM customers cJOIN sales s ON c.c id s.c idWHERE c.city 'Bangalore'GROUP BY c idHAVING SUM(amount) 100000ORDER BY CTotal DESC

Chapter: 5 — HBase componentsIntroductionApache HBase is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed Filesystem). HBase can scalehorizontally to thousands of commodity servers and petabytes by indexing the storage. HBase supports random real-time CRUDoperations. HBase also has linear and modular scalability. It supports an easy-to-use Java API for programmatic access.HBase is integrated with the MapReduce framework in Hadoop. It is an open-source framework that has been modeled after Google’sBigTable. Further, HBase is a type of NoSQL.HBase ComponentsThe components of HBase are HBase Master and Multiple WALAn explanation of the components of HBase is given below:HBase MasterMultiple RegionServersIt is responsible for managing the schema that isstored in Hadoop Distributed File System (HDFS).They act like availability servers that help in maintaining a part of the complete data, which is storedin HDFS according to the requirement of the user.They do this using the HFile and WAL (Write AheadLog) service. The RegionServers always stay in syncwith the HBase Master. The responsibility ofZooKeeper is to ensure that the RegionServers arein sync with the HBase Master.

Chapter: 6 — ClouderaCloudera is a commercial tool for deploying Hadoop in an enterprise setup.The salient features of Cloudera are as follows:It has its own user-friendly Cloudera Manager forsystem management, Cloudera Navigator for datamanagement, dedicated technical support, etc.It uses 100% open-source distribution of ApacheHadoop and related projects like Apache Pig,Apache Hive, Apache HBase, Apache Sqoop, etc.Chapter: 7 — Zookeeper and SqoopZooKeeper is an open-source and high performance co-ordination service for distributed applications. It offers services such asNaming, Locks and synchronization, Configuration management, and Group services.ZooKeeper Data ModelZooKeeper has a hierarchical namespace. Each node in the namespace is called Znode. The example given here shows the treediagram used to represent the namespace. The tree follows a top-down approach where '/' is the root and App1 and App2 resides inthe root. The path to access db is /App1/db. This path is called the hierarchical path.//App 1/App 1/db/App 2/App 1/conf/App 1/confSqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases likeMySQL, MSSQL, Oracle, etc. to HDFS. Following are the reasons for using Sqoop:SQL servers are deployed worldwide and are the primary ways to accept the data from a user.Nightly processing is done on SQL server for years.It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS.Transferring the data using some automated script is inefficient and time-consuming.Traditional DB has reporting, data visualization, and other applications built in enterprises but to handle large data, we need anecosystem.The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied bySqoop.

Chapter: 8 — Hadoop EcosystemThe image given here depicts the various Hadoop ecosystem components. The base of all the components is Hadoop Distributed FileSystem (HDFS). Above this component is YARN MapReduce v2. This framework component is used for the distributed processing in aHadoop cluster.The next component is Flume. Flume is used for collecting logs across a cluster. Sqoop is used for data exchange between a relationaldatabase and Hadoop HDFS.The ZooKeeper component is used for coordinating the nodes in a cluster. The next ecosystem component is Oozie. This componentis used for creating, executing, and modifying the workflow of a MapReduce job. The Pig component is used for performing scriptingfor MapReduce applications.The next component is Mahout. This component is used for machine learning based on machine inputs. R Connectors are used forgenerating statistics of the nodes in a cluster. Hive is used for interacting with Hadoop using SQL like query. The next component isHBase. This component is used for slicing of large data.The last component is Ambari. This component is used for provisioning, managing, and monitoring Hadoop clusters.

The salient features of Cloudera are as follows: It uses 100% open-source distribution of Apache Hadoop and related projects like Apache Pig, Apache Hive, Apache HBase, Apache Sqoop, etc. It has its own user-friendly Cloudera Manager for system management, Cloudera Navigator for data management, dedicated technical support, etc.

Related Documents:

Big Data Analytics Turning Big Data Into Big Money

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

53 Views

1y ago

BigDataBench: a Big Data Benchmark Suite from Internet Services

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

26 Views

1y ago

A Study on Big data security issues and challenges

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

49 Views

2y ago

Top Big Data Analytics Use Cases - oracle.com

Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers

28 Views

1y ago

Big Data: Impact and applications in Grocery Retail

Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6

21 Views

1y ago

big data processing with hadoop - CORE

Hadoop, Big Data, HDFS, MapReduce, Hbase, Data Processing . CONTENTS LIST OF ABBREVIATIONS (OR) SYMBOLS 5 1 INTRODUCTION TO BIG DATA 6 1.1 Current situation of the big data 6 1.2 The definition of Big Data 7 1.3 The characteristics of Big Data 7 2 BASIC DATA PROCESSING PLATFORM 9

11 Views

1y ago

BIG DATA - National Consumer Law Center

6 Big Data 2014 National Consumer Law Center www.nclc.org Conclusion and Recommendations Unfortunately, our analysis concludes that big data does not live up to its big promises. A review of the big data underwriting systems and the small consumer loans that use them leads us to believe that big data is a big disappointment.

21 Views

1y ago

Dell EMC Unity XT SQL Server 2019 Big Data Clusters

This platform addresses big-data challenges in a unique way, and solves many of the traditional challenges with building big-data and data-lake environments. See an overview of SQL Server 2019 Big Data Clusters on the Microsoft page SQL Server 2019 Big Data Cluster Overview and on the GitHub page SQL Server Big Data Cluster Workshops.

19 Views

1y ago

Recent Views

MOOSIC PRE ORDER OFFER 2018

9781860960147 Jazz Piano Grade 5: The CD 22.92 17.24 18.76 19.83 9781860960154 Jazz Piano from Scratch 55.00 41.36 45.02 47.58 9781860960161 Jazz Piano Aural Tests, Grades 1-3 18.15 13.65 14.86 15.70 9781860960505 Jazz Piano Aural Tests, Grades 4-5 15.29 11.50 12.52 13.23 Easier Piano Pieces (ABRSM)

3y ago

95 Views

Bethel A.M.E. Annual Women's Day Celebration

Annual Women's Day Celebration Theme: Steadfast and Faithful Women 1993 Bethel African Methodi st Epi scopal Church Champaign, Illinois The Ministry Thi.! Rev. Sleven A. Jackson, Pastor The Rev. O.G. Monroe. Assoc, Minister The Rl. Rev. James Haskell Mayo l1 ishop, f7011rt h Episcop;l) District The Rev. Lewis E. Grady. Jr. Prc. i ding Elder . Cover design taken from: Book of Black Heroes .

3y ago

97 Views

Automotive - Siemens Digital Industries Software

of this system requires a new level of close integration between mechanical, electrical and thermal domains. It becomes necessary to have true multi-domain data exchange between engineering software tools to inform the system design from an early concept stage. At the most progressive automotive OEMs, thermal, electrical

3y ago

51 Views

PRESENTER BIOGRAPHIES

PRESENTER BIOGRAPHIES. MDPH Commissioner Remarks: Cheryl Bartlett, RN Commissioner . MA Department of Public Health . Cheryl Bartlett was named Commissioner of the Massachusetts Department of Public Health in June 2013. As Commissioner Ms. Bartlett chairs the newly appointed Prevention and Wellness Advisory Board, which oversees a 60 million Prevention Trust Fund – the first of its kind in .

3y ago

116 Views

2019 SPLUNK INC. Splunk Certification Certification Exam .

Sample Questions Test Blueprint Splunk Core Certified Consultant Test Blueprint Splunk Certification Exams Table of Contents Please note: Sample questions (where available) are provided to give candidates a general idea of the formatting and type of questions for each of the exams listed above. The test blueprints provide much

3y ago

73 Views

Programme Specification BSc Chemistry (2020-21 )

The BSc Chemistry degree aims to enhance your enthusiasm for chemistry and to provide an intellectually stimulating learning environment. You will gain extensive in-depth knowledge and understanding of chemistry and related subjects, as well as a comprehensive training in practical chemistry and an appreciation of the importance of the discipline in different contexts. The programme will .

3y ago

51 Views

Chimney - Robot Virtual Games

Chimney Junior Each Total Correct balls on the Chimney Each ball will give you points if it is equal to the color indicated by the cube. 40 80 Incorrect balls on the Chimney Each ball will take you points if it is not equal to the color indicated by the cube. -5 -10 Park the robot Robot stops on Finish Area and simulation stops.

3y ago

42 Views

Timeline of the Cold War - truman.library

Timeline of the Cold War 1945 Defeat of Germany and Japan February 4-11: Yalta Conference meeting of FDR, Churchill, Stalin - the 'Big Three' Soviet Union has control of Eastern Europe. The Cold War Begins May 8: VE Day - Victory in Europe. Germany surrenders to the Red Army in Berlin July: Potsdam Conference - Germany was officially partitioned into four zones of occupation. August 6: The .

3y ago

253 Views

skinnytaste Cookbook Index

Naked Persian Turkey Burgers The Skinnytaste Cookbook Perfect Poultry 156 6 6 6 Orecchiette with Sausage, Baby Kale, and Bell Pepper The Skinnytaste Cookbook Perfect Poultry 181 11 11 4. RECIPE COOKBOOK CHAPTER PG SP Roasted Poblanos Rellenos with Chicken The Skinnytaste Cookbook Perfect Poultry 173 7 10 5

3y ago

71 Views

3-in-1 Cooking System - NinjaKitchen

5 ˆˇ 6 Getting to Know the Ninja 3-in-1 Cooking System Control Panel Function Dial Turn the dial to select Stovetop, Slow Cook or Oven mode. Stovetop - Use the Ninja 3-in-1 Cooking System as you would a stovetop.

3y ago

39 Views

BIOLOGY - Michigan

Credit for high school Earth Science, Biology, Physics, and Chemistry will be defined as meeting both essential and core subject area content expectations. Assessment Prerequisite Knowledge and Skills Basic Science Knowledge Orientation Towards Learning Reading, Writing, Communication Basic Mathematics Conventions, Probability, Statistics .

3y ago

27 Views

Investigatingrespiration*in*ectotherms(crickets)*

ets)*

Males"of" ud" chirpingsoundbyrubbingtheir forewingstogether;theydothisto p .

3y ago

34 Views

The Criminal Justice Response to Child Abuse: Lessons .

Rates of Criminal Justice Action on Investigated Cases Study Sample N Rate Tjaden & Thoennes, 1992 CPS 833 4% prosecuted Finkelhor, 1983 State clearing - house data 6096 24% criminal justice action taken Stroud, Martens & Barker, 2000 &KLOGUHQ¶V Advocacy Center 1043 56% referred to p rosecutors

3y ago

45 Views

Curriculum Adaptations for Exceptional Students

Adapting curriculum and instruction . The Center for School and Community Integration, Institute for the Study of Developmental Disabilities. Why do we want to use curriculum adaptations? Looking at learning in new and different ways. Get creative! EM 1.1.8 – Student understands concepts of

3y ago

29 Views

Brunei Darussalam In Brief - information.gov.bn

‘Brunei Darussalam In Brief’ is a publication where it discusses briefly on the socio-economic welfare of Brunei Darussalam in general. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form by any means without prior written permission from

3y ago

65 Views

What Is Big Data And Hadoop?

It looks like you're using an ad-blocker