Introduction To Big Data Tools

2y ago
6 Views
3 Downloads
520.48 KB
26 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Farrah Jaffe
Transcription

Introduction to Big Data toolsThe tools used for Big Data handling andanalysis and further reporting are calledBigData ToolsThe Big Data Tools are Hadoop Spark, Scala Impala etc

What is Hadoop? Apahe Hadoop is a framework that allows forthe distributed processing of large data setsacross clusters of commodity computers using asimple programming model. Hadoop used by Yahoo,IBM,Google,Amezon andmany more. Aadhar scheme is using Hadoop in India. Mapreduce is simple programming model used inhaddop.

Main components1. HDFS -Haddoop Dsitributed FileSystem(Storage)2. MapReduce(Processing)

HDFSHDFS is a file system designed for storing verylarge files with streaming data access patterns,running clusters on commodityhardware(Simple hardware)

HDFS-Hadoop Distributed File SystemFeatures of Hadoop: Highly fault totarent(replicate data on min of 3systems) High throughput—(in short time huge data canread processed) Suitable for applications with large data sets Streaming access to file system data--write oneceand read may times and analyzing logs Can be built out of Commoditty hardware

Apache Spark Apache Spark is an open source clustercomputing framework originally developed inthe AMPLab at University of California,Berkeley but was later donated to the ApacheSoftware Foundation where it remains today In contrast to Hadoop's two-stage disk-basedMap Reduce paradigm, Spark's multi-stage inmemory primitives provide performance up to100 times faster for certain applications.

By allowing user programs to load data into acluster's memory and query it repeatedly,Spark is well-suited to machine learningalgorithms Spark requires a cluster manager and adistributed storage system

For cluster management, Spark supports standalone(native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a widevariety, including Hadoop Distributed File System(HDFS), Cassandra, OpenStack Swift, Amazon S3, or acustom solution can be implemented. Spark also supports a pseudo-distributed local mode,usually used only for development or testing purposes,where distributed storage is not required and the localfile system can be used instead; in such a scenario, Spark is run on a single machinewith one executor per CPU core.

SCALA The name Scala is scalable" and "language", signifyingthat it is designed to grow with the demands of itsusers. Scala is a programming language for general softwareapplications. Scala has full support for functional programming anda very strong static type system. This allows programs written in Scala to be very conciseand thus smaller in size than other general-purposeprogramming languages. Many of Scala's design decisions were inspired bycriticism of the shortcomings of Java.

Scala source code is intended to be compiledto Java byte code, so that the resultingexecutable code runs on a Java virtualmachine. Java libraries may be used directly in Scalacode and vice versa Like Java, Scala is object-oriented, and usescurly-brace syntax reminiscent of the Cprogramming language

Unlike Java, Scala has many features of functionalprogramming languages like Schema, StandardML and Haskell, including currying, typeinference, immutability, lazy evaluation, andpattern matching It also has an advanced type system supportingalgebraic data types, covariance andcontravariance, higher-order types (but nothigher-rank types), and anonymous types. Other features of Scala not present in Javainclude operator overloading, optionalparameters, named parameters, raw strings, andno checked exceptions.

CLOUDERA IMPALA Open source massively parallel processing (MPP)SQL query engine for data stored in a computercluster running Apache Hadoop Cloudera Impala is a query engine that runs onApache Hadoop Impala brings scalable parallel databasetechnology to Hadoop, enabling users to issuelow-latency SQL queries to data stored in HDFSand Apache HBase without requiring datamovement or transformation.

Impala is integrated with Hadoop to use thesame file and data formats, metadata, securityand resource management frameworks usedby MapReduce, Apache Hive, Apache Pig andother Hadoop software Impala is promoted for analysts and datascientists to perform analytics on data storedin Hadoop via SQL or business intelligencetools.

The result is that large-scale data processing(via MapReduce) and interactive queries canbe done on the same system using the samedata and metadata – removing the need tomigrate data sets into specialized systemsand/or proprietary formats simply to performanalysis.

Features Supports HDFS and Apache HBase storage, Reads Hadoop file formats, including text, LZO,SequenceFile, Avro, RCFile, and Parquet Supports Hadoop security (Kerberos authentication) Fine-grained, role-based authorization with Apache Sentry, Uses metadata, ODBC driver, and SQL syntax from ApacheHive. In early 2013, a column-oriented file format called Parquetwas announced for architectures including Impala. In December 2013, Amazon Web Services announcedsupport for Impala. In early 2014, MapR added support forImpala

Identify gaps in the data and follow-upfor decision makingThere can be two types of gap in Data:1. Missing Data Imputation(Claim)2. Model based Techniques Missing values are replaced with Average valueor Removal. While for analysis to be proper we select thevariables for modeling based on correlationtest results

Techniques of dealing with missingdata Missing data reduce the representativeness ofthe sample and can therefore distortinferences about the population If it is possible try to think about how toprevent data from missingness before theactual data gathering takes place

Imputation Data analysis technique is good to considerimputing the missing dataImputation can be done in several ways: Use multiple imputations (5 or fewer)improves the quality of estimation.

Examples of imputations are listed below.Partial imputation The expectation-maximization algorithm is anapproach in which values of the statisticswhich would be computed if a completedataset were available are estimated(imputed), taking into account the pattern ofmissing data. In this approach, values for individual missingdata-items are not usually imputed

Partial deletion: Methods which involve reducing the dataavailable to a dataset having no missing valuesinclude: Listwise deletion/casewise deletion Pairwise deletion

Full analysis: Methods which take full account of allinformation available, without the distortionresulting from using imputed values as if theywere actually observed: The expectation-maximization algorithm full information maximum likelihoodestimation

Interpolation : In the mathematical field of numericalanalysis, interpolation is a method ofconstructing new data points within the rangeof a discrete set of known data points.

Model-Based Techniques Model-Based Techniques uses tools for testingmissing data types (MCAR, MAR, MNAR) andfor estimating parameters under missing dataconditions

For example, a test for refuting MAR/MCARreads as follows: For any three variables X,Y, and Z where Z isfully observed and X and Y partially observed,the data should satisfy: X Ry (Rx , Z) In words, the observed portion of X should beindependent on the missingness status of Y,conditional on every value of Z.

When data falls into MNAR categorytechniques are available for consistentlyestimating parameters when certainconditions hold in the model. For example, if Y explains the reason formissingness in X and Y itself has missingvalues, the joint probability distribution of Xand Y can still be estimated if the missingnessof Y is random

The estimand in this case will be:P(X,Y) P(X Y)P(Y) P(X Y, Rx 0, Ry 0)P(Y Ry 0)where Rx 0 and Ry 0 denote the observedportions of their respective variables.

cluster running Apache Hadoop Cloudera Impala is a query engine that runs on Apache Hadoop Impala brings scalable parallel database . ODBC driver, and SQL syntax from Apache Hive. In early 2013, a column-oriented file format called Parquet was announced for architectures including Impala.

Related Documents:

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers

Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6

tdwi.org 5 Introduction 1 See the TDWI Best Practices Report Next Generation Data Warehouse Platforms (Q4 2009), available on tdwi.org. Introduction to Big Data Analytics Big data analytics is where advanced analytic techniques operate on big data sets. Hence, big data analytics is really about two things—big data and analytics—plus how the two have teamed up to

Hadoop, Big Data, HDFS, MapReduce, Hbase, Data Processing . CONTENTS LIST OF ABBREVIATIONS (OR) SYMBOLS 5 1 INTRODUCTION TO BIG DATA 6 1.1 Current situation of the big data 6 1.2 The definition of Big Data 7 1.3 The characteristics of Big Data 7 2 BASIC DATA PROCESSING PLATFORM 9

Why Microsoft for Big Data? Microsoft is about making Big Data actionable for your business. When you choose Microsoft Big Data solutions, everyone in your company can tap into Big Data to get insights through familiar, easy-to-use tools they work with every day —whether at their desks or on their mobile devices. Because Microsoft Big Data .