The Hadoop Ecosystem - York University

2y ago
7 Views
3 Downloads
1.15 MB
16 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Farrah Jaffe
Transcription

The Hadoop EcosystemEECS 4415Big Data SystemsTilemachos Pechlivanogloutipech@eecs.yorku.ca

A lot of toolsdesigned to workwith Hadoop2

HDFS, MapReduce Hadoop Distributed File System– Core Hadoop component– Distributed storage and I/O for Hadoop MapReduce– Core Hadoop component– Software framework for data processing3

YARN Yet Another Resource Negotiator– Resource allocation and scheduling– Core Hadoop component Components: ResourceManager, NodeManager– ResourceManager: receives processing requests passes the parts of requests to corresponding NodeManagers Has Schedulers that allocate resources, time based on application requirements Has ApplicationsManager that monitors running jobs– NodeManager: Handles requests at every DataNode4

Apache Pig SQL-like command structure in Hadoop– Much more condensed (10 pig latin lines 200 Map-Reduce lines)– Allows actions like grouping, filtering etc.– Developed by Yahoo Pig Runtime and Pig Latin language– Analogy to Java: Pig Runtime - JVM, Pig Latin - Java– Compiler internally converts pig latin to MapReduce5

Apache HIVE SQL queries in Hadoop:– Uses Hive Query Language(HQL), very similar to SQL– Highly scalable, both batch and real-time processing support– Supports all SQL types, most commands etc. JDBC/ODBC driver and Hive Command Line :– Java Database Connectivity (JDBC), Object Database Connectivity (ODBC) Used to establish connection with data storage– Developed by Facebook6

Apache Mahout Machine Learning in Hadoop– Provides built-in algorithms for machine learning problems– Executed through a command line Supported algorithms:– Collaborative filtering: mining patterns/behaviors, makes predictions and recommendations Amazon product recommendation– Clustering: finding groups of similar data recommending groups in social media– Classification: classifying and categorizing data into various sub-departments identifying objects in image recognition7

Apache Spark Framework for real time data analytics– Executes in-memory computations, high-speed data processing (100x faster than MapReduce)– Written in Scala, but supports many languages Contains high-level libraries, processing based on DataFrames8

Apache HBASE Non-relational distributed database (No-SQL)– All types of data, absolutely everything is supported– Provides fault tolerance and fast retrieval of data– Open source, based on Google’s BigTable Runs on top of Hadoop, provides BigTable - like capabilities– Written in Java9

Apache Zookeeper, Oozie Zookeeper: Hadoop job coordination– Coordination between different distributed Hadoop jobs/services– Things like addresses, start-up/shutdown, configurations– Used in Rackspace, Yahoo, eBay Oozie: Hadoop clock/alarm– Oozie Workflow: sequential acts to be performed– Oozie Coordinator: triggers job execution when data is available10

Apache Flume, Sqoop Flume: Unstructured data ingestion– Handles the entry of data in the system– Collects, aggregates and moves large amounts of data– Handles real-time input streams Sqoop: Import/export structured data– Also handles data ingestion– Moves data from RDBMS or Enterprise data warehouses to HDFS or vice versa11

Apache Solr & Lucene Searching and indexing– Used for different data search tasks– Solr is the application, Lucene is the engine/kernel12

Apache Ambari Managing the whole ecosystem13 Hadoop cluster provisioning– Step by step process for installing hadoop on many hosts– Handles Hadoop cluster configurations Hadoop cluster management– Provides central management service for starting, stopping and re-configuring Hadoop services Hadoop cluster monitoring– Dashboard for monitoring cluster health and status– Amber Alert framework for notifying if something is wrong

Honorable mentions14 Avro: data serialization ( JSON) Cassandra: reliable NoSQL distributed database Cloudera: Hadoop environment management, commercial vendor Chukwa: data collection system Impala: analytic database Kafka: Hadoop messaging Tajo: robust big data relational and distributed data warehouse Tez: generalized data-flow programming framework

An example Hadoop system15

Thank you!Based p://www.bmc.com/guides/hadoop-ecosystem.html16

Apache HIVE 6 SQL queries in Hadoop: – Uses Hive Query Language(HQL), very similar to SQL – Highly scalable, both batch and real-time processing support – Supports all SQL types, most commands etc. JDBC/ODBC driver and Hive Command Line : – Java Database Connectivity (JDBC), Object Database Connectivity (ODBC)

Related Documents:

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.