WHITE PAPER StackIQ Enterprise Hadoop Enterprise Reference . - Cloudera

1y ago
5 Views
2 Downloads
640.33 KB
13 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Kairi Hasson
Transcription

WHITE PAPERStackIQ Enterprise HadoopEnterprise ReferenceArchitectureStackIQ and Hortonworks worked together toBring You World-class Reference Configurations forApache Hadoop Clusters.Abstract.As Big Data applications and the Big Infrastructure to support them havegrown in popularity, the complexity of managing these solutions has becomemore apparent. Nowhere is this more apparent than with Apache Hadoop, theContentsleading software framework for Big Data applications. Until now, one teamThe Need for Efficiency andAutomation in Hadoop Deployments 2of administrators has been responsible for installing and configuring clusterStackIQ Enterprise Hadoop 3of a Hadoop cluster. Another team has been responsible for deploying andKey Benefits of StackIQ EnterpriseHadoop 6managing Hadoop software on the cluster infrastructure. These tasks haveEnterprise Hadoop Use Cases 6hardware, networking components, software, and middleware in the foundationrelied on a variety of tools – both new and legacy – to handle deployment,scalability, and management. Most of the tools require management of changesStackIQ Enterprise Hadoop Reference Architecture 8to configurations and other fixes through the writing of software scripts. MakingSummary 12changes to one appliance or several appliances entails a manual, time-intensiveFor More Information 12About StackIQ 13process that leaves the administrator uncertain as to whether the changes havebeen implemented throughout the application cluster. Using homogenous, robusthardware/software solutions for Big Data applications from EMC, Teradata,Oracle, and others is another, very expensive, and more limited, alternative.Now, however, a paradigm shift in the design, deployment, and managementof Big Data applications is underway. For the first time in the industry, a bestof-breed Hadoop distribution has been combined with best-of-breed Hadoopand cluster management software. The result is StackIQ Enterprise Hadoop – acomplete enterprise solution. Cost, efficiency, agility, and reliability make StackIQEnterprise Hadoop unique. This whitepaper describes our reference architecture1 2012 StackIQ, Inc. All rights reserved.for Hadoop deployments.

The Need for Efficiency and Automation in Hadoop DeploymentsThe Apache Hadoop software framework has become the leading solution formassive, data-intensive, distributed applications. More mature than other solutions,it has also proven to be better at scaling; more useful, flexible, and affordable as ageneric rather than proprietary data platform; excellent at handling structured andunstructured data; and its many connector products have broadened its use beyondother software frameworks used to handle Big Data applications.The growing popularity of Hadoop has also puta spotlight on its shortcomings – specifically, thecomplexity of deploying and managing Hadoopinfrastructure. Early adopters of Hadoop have found“StackIQ hasengineered arevolutionarynew solution.”that they lack the tools, processes, and procedures todeploy and manage it efficiently. IT organizations arefacing challenges in coordinating the rich clusteredinfrastructure necessary for enterprise-grade Hadoopdeployments.The current generation of Hadoop products wasdesigned for IT environments in which different groupsof skilled personnel are required to deploy them. Onegroup of IT professionals installs and configures thecluster hardware, networking components, software, and middleware that formthe foundation of a Hadoop cluster. Another group is responsible for deploying theHadoop software as part of the cluster infrastructure. Most cluster managementproducts focus on the upper layers of the cluster (i.e. Hadoop products, includingthe Hadoop Distributed File System, MapReduce, Pig, Hive, HBase, and Zookeeper),leaving the installation and maintenance of the underlying server cluster to othersolutions. Thus the overall Hadoop infrastructure is deployed and managed bya collection of disparate products, policies, and procedures, which can lead tounpredictable and unreliable clusters.Combining the leading Apache Hadoop software stack with the leading cluster2management solution, StackIQ has engineered a revolutionary new solution that

makes Hadoop deployments of all sizes much faster, less costly, more reliable, andmore flexible. StackIQ Enterprise Hadoop optimizes and automates the deploymentand management of underlying cluster infrastructures of any size while alsoproviding a massively scalable, open source Hadoop platform for storing, processing,and analyzing large data volumes.With StackIQ Enterprise Hadoop, physical or virtual Hadoop clusters can be quicklyprovisioned, deployed, monitored, and managed. System administrators can managethe entire system using a single pane of glass. New nodes are also configuredautomatically from bare metal—with a single command—without the need forcomplex administrator assistance. If a node needs an update, it will be completely reprovisioned by the system to ensure it boots into a known good state. Since StackIQEnterprise Hadoop places every bit on every node, administrators have completecontrol and consistency across the entire infrastructure. Now administrators have theintegrated, holistic Hadoop tools and control they need to more easily and swiftlymeet their enterprise Big Data application requirements.StackIQ Enterprise HadoopStackIQ Enterprise Hadoop is a complete, integrated Hadoop solution for enterprisecustomers. For the first time, enterprises get everything they need to deploy andmanage Hadoop clusters throughout the entire operational lifecycle in one product(Figure 1). StackIQ Enterprise Hadoop includes:Hortonworks Data Platform powered by Apache Hadoop is an open-source,massively scalable, highly stable and extensible platform based on the most popularand essential Hadoop projects for storing, processing, and analyzing large volumesof structured and unstructured data. Hortonworks Data Platform platform makes iteasier than ever to integrate Apache Hadoop into existing data architectures. Highlyrecommended for anyone who has encountered difficulties installing and integratingHadoop projects downloaded directly from Apache, Hortonworks Data Platform isalso ideal for solution providers wanting to integrate or extend their solutions forApache Hadoop.3

The platform includes HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper, alongwith open source technologies that make the Hadoop platform more manageable,open, and extensible. These include HCatalog, a metadata management service forsimplifying data sharing between Hadoop and other enterprise information systems,and a complete set of open APIs such as WebHDFS to make it easier for ISVs tointegrate and extend Hadoop.Management nworks has contributed more thanStackIQ Hadoop Manager80% of the code in Apache HadoopConfigurationHortonworksData PlatformManagementServerStackIQ Cluster ManagerDeploymentMonitoringto date and is the main driving forceRHEL or CentOS LinuxOperating SystemManagementbehind the next generation of theCommodity Server(Physical or Virtual)software. The team has supported theStackIQHadoop Managerworld’s largest Hadoop deployment,HadoopNodesHadoop kIQCluster ManagerHBaseHiveCompetitive products offer altered,MapReduceHCatalogProcessingfeaturing more than 42,000 servers.HDFSStorageHortonworks Data PlatformRHEL or CentOS LinuxOperating systemCommodity Server(Physical or Virtual)non-standard versions of Hadoop, oftencomplicating integration with othersystems and data sources. Hortonworksis the only platform that is completelyconsistent with the open source version.Figure 1. StackIQ Enterprise Hadoop ComponentsStackIQ Hadoop Manager managesthe day-to-day operation of the Hadoop software running in the clusters, includingconfiguring, launching, and monitoring HDFS, MapReduce, ZooKeeper, Hbaseand Hive. A unified single pane of glass —with a command line interface (CLI) orgraphical user interface (GUI)—is used to control and monitor all of these, as well asmanage the infrastructure components in the cluster.Easy to use, the StackIQ Hadoop Manager allows for the deployment of Hadoopclusters of all shapes and sizes (including heterogeneous hardware support, paralleldisk formatting, and multi-distribution support). Typically, the installation andmanagement of a Hadoop cluster has required a long, manual process. The enduser or deployment team has had to install and configure each component of thesoftware stack by hand, causing the setup time for such systems and the ongoing4

management to be problematic and time-intensive with security and reliabilityimplications. StackIQ Enterprise Hadoop completely automates the process.StackIQ Cluster Manager manages all of the software that sits between bare metaland a cluster application, such as Hadoop. A dynamic database contains all ofthe configuration parameters for an entire cluster. This database is used to drivemachine configuration, software deployment (using a unique Avalanche peer-to-peerinstaller), management, and monitoring. Regarding specific features, the ClusterManager: Provisions and manages the operating system from bare metal, capturingnetworking information (such as MAC addresses) Configures host-based network settings throughout the cluster Captures hardware resource information (such as CPU and memory information)and uses this information to set cluster application parameters Captures disk information and using this information to programmatically partitiondisks across the cluster Installs and configuring a cluster monitoring system Provides a unified interface (CLI and GUI) to control and monitor all of this.The StackIQ Cluster Manager for Hadoop is based on StackIQ’s open source Linuxcluster provisioning and management solution, Rocks, originally developed in 2000by researchers at the San Diego Supercomputer Center at the University of California,San Diego. Rocks was initially designed to enable end users to easily, quickly, andcost-effectively build, manage, and scale application clusters for High PerformanceComputing (HPC). Thousands of environments around the world now use Rocks.In StackIQ Enterprise Hadoop, the Cluster Manager’s capabilities have beenexpanded to not only handle the underlying infrastructure but to also handlethe day-to-day operation of the Hadoop software running in the cluster. Othercompeting products fail to integrate the management of the hardware cluster withthe Hadoop software stack. By contrast, StackIQ Enterprise Hadoop operates from acontinually updated, dynamic database populated with site-specific information onboth the underlying cluster infrastructure and running Hadoop services. The product5includes everything from the operating system on up and packages CentOS Linux or

Red Hat Enterprise Linux, cluster management middleware, libraries,compilers, and monitoring tools.Enterprise Hadoop Use CasesKey Benefits of StackIQEnterprise HadoopHadoop enables organizations to move large volumes of complexand relational data into a single repository where raw data isalways available. With its low-cost, commodity servers and storagerepositories, Hadoop enables this data to be affordably stored and The first complete, integrated,Hadoop solution for theenterprise Faster time to deployment Automated, consistent,dependable deployment andmanagement Simplified operation that canbe quickly learned withoutsystems administrationexperience Reduced downtime due toconfiguration errors Reduced total cost ofownership for Hadoop clustersretrieved for a wide variety of analytic applications that can helporganizations increase revenues by extracting value such as strategicinsights, solutions to challenges, and ideas for new products andservices. By breaking up Big Data into multiple parts, Hadoop allowsfor the processing and analysis of each part simultaneously on serverclusters, greatly increasing the efficiency and speed of queries.The use cases for Hadoop are many and varied, impacting disciplinesas varied as public health, stock and commodities trading, sales andmarketing, product development, and scientific research. For thebusiness enterprise, Hadoop use cases include: Data Processing: Hadoop allows IT departments to extract,transform, and load (ETL) data from source systems and to transferdata stored in Hadoop to and from a database management systemfor the performance of advanced analytics; it is also used for thebatch processing of large quantities of unstructured and semi-6

structured data. Network Management: Hadoop can be used toeffectiveness of marketing campaigns; Big Dataallows marketing teams to incorporate highercapture, analyze, and display data collected fromvolumes of increasingly granular data, like click-servers, storage devices, and other IT hardware tostream data and call detail records, to increaseallow administrators to monitor network activitythe accuracy of analysis.and diagnose bottlenecks and other issues. Retail Fraud: Through monitoring, modeling, and Customer Influencer Analysis: Social networkingdata can be mined to determine which customersanalyzing high volumes of data from transactionshave the most influence over others withinand extracting features and patterns, retailers cansocial networks; this helps enterprises determineprevent credit card account fraud.which are their most important and influential Recommendation Engine: Web 2.0 companiescan use Hadoop to match and recommend userscustomers. Analyzing Customer Experience: Hadoop canto one another or to products and services basedbe used to integrate data from previously siloedon analysis of user profile and behavioral data.customer interaction channels (e.g., online chat, Opinion Mining: Used in conjunction withHadoop, advanced text analytics tools analyzethe unstructured text of social media and socialnetworking posts, including Tweets and Facebookposts, to determine the user sentiment relatedto particular companies, brands or products; theblogs, call centers) to gain a complete view ofthe customer experience; this enables enterprisesto understand the impact of one customerinteraction channel on another in order tooptimize the entire customer lifecycle experience. Research and Development: Enterprises likefocus of this analysis can range from the macro-pharmaceutical manufacturers use Hadoop tolevel down to the individual user.comb through enormous volumes of text-based Financial Risk Modeling: Financial firms, banks,and others use Hadoop and data warehousesfor the analysis of large volumes of transactionaldata in order to determine risk and exposure offinancial assets, prepare for potential “what-if”scenarios based on simulated market behavior,and score potential clients for risk. Marketing Campaign Analysis: Marketingdepartments across industries have long usedtechnology to monitor and determine theresearch and other historical data to assist in thedevelopment of new products.

Reference ArchitectureTable 1. StackIQ Enterprise Hadoop Reference Architecture (HardwareTable 1 shows the StackIQEnterprise Hadoop referencearchitecture hardware using DellPowerEdge servers.Usng 3 TB drives in 18 data nodesReference Hardware Configuration on Dell PowerEdge dgeR410PowerEdge R720xdCPU2 x E5620 (4core)2 x E5-2640 (6-coreRAM16 GB96 GBName NodeSecondaryName NodeData Nodein a single rack, this configurationrepresents 648 TB of raw storage.Using HDFS’s standard replicationfactor of 3, yields 216 TB ofusable storage.1 x Dell PowerConnect 5524 Switch, 24-ports 1 Gb Ethernet(per rack)Network1 x Dell PowerConnect 8024F 10Gb Ethernet switch(For rack interconnection in multi-rack configurations)Disk2 x 1 TB SATA3.5”12 x 3TB SATA 3.5”StorageControllerPERC H710RAIDRAID 1Minimumper Pod1NONE1* Based on HDFS’s standard replication factor of 3813*

Table 2. StackIQ Enterprise Hadoop Reference Architecture (Software)Table 2 shows the softwarecomponents of the StackIQEnterprise Hadoop referencearchitecture.The management node isReference Architecture (Software)StackIQ Enterprise Hadoop 1.0 ISO ContentsHadoop RollHortonworks Data Platform 1.0Base RollRocks 6.0.2 Base Command LineInterface (CLI)Kernel RollInstallation Support for Latest x86chipsetsCore RollRocks 6.0.2 Core, GUIOS RollCentOS 6.2Ganglia RollCluster MonitoringWeb Server RollApache Web Server and WordPressinstalled with StackIQ EnterpriseHadoop management software,which automatically installs andconfigures Hortonworks DataPlatform software on the NameNode, Secondary Name Node,and all Data Nodes.Rolls are pre-packaged softwaremodules that integrate softwarecomponents for site-specificrequirements. They may be selected and automatically configured in StackIQ Enterprise Hadoop and areavailable from StackIQ at http://www.stackiq.com/download/.9

rack switchcluster managerdata nodedata nodedata nodeSingle Rack Configurationdata nodeIn the single rack configuration,there is one Cluster Managerdata nodeNode, one Name Node, anddata nodeone Secondary Name Node.This configuration may includedata nodebetween one and 18 Datadata nodeNodes, depending upon howdata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodesecondary name nodename nodeFigure 2. Single Rack Configuration101 Gigabit Ethernetdata nodemuch storage is needed for thecluster. The top-of-rack switchconnects all of the nodes usingGigabit Ethernet. A samplesingle-rack configuration ofStackIQ Enterprise Hadoop isshown in Figure 2.

Multi-Rack ConfigurationMore racks may be added to build a multi-rack configuration. Eachrack may contain between one and 20 Data Nodes, dependingupon how much storage is needed for the cluster. A multiport 10GE switch should be added to the second rack, with all of the topof-rack switches connected to it via one of their 10 GE ports. Forsimplicity, a step and repeat layout is shown in the multi-rack sampleconfiguration in Figure 3.10 Gigabit Ethernetrack switchcluster managerinterconnect switchrack switchrack switchrack switchdata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata node.data nodedata nodedata nodedata nodedata nodedata node1 Gigabit Ethernetdata node1 Gigabit Ethernetdata node1 Gigabit Ethernetdata node1 Gigabit Ethernetdata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodedata nodesecondary name nodedata nodedata nodedata nodedata nodename nodedata nodedata nodedata nodedata nodeFigure 3. Multi-Rack Configuration111 Gigabit EthernetTOR switch

SummaryAs the leading software framework for massive, data-intensive,distributed applications, Apache Hadoop has gained tremendouspopularity, but the complexity of deploying and managing Hadoopserver clusters has become apparent. Early adopters of Hadoopmoving from proofs-of-concept in labs to full-scale deployment arefinding that they lack the tools, processes, and procedures to deployand manage these systems efficiently. For reliable, predictable,simplified, automated Hadoop enterprise deployments, StackIQ hascreated StackIQ Enterprise Hadoop. This powerful, holistic, simplifiedtool for Hadoop deployment and management combines the leadingApache Hadoop software stack with the leading cluster managementsolution. StackIQ Enterprise Hadoop makes it easy to deploy andmanage consistent Hadoop installations of all sizes and its automation,powerful features, and ease of use lower the total cost of ownership ofBig Data systems.For More InformationStackIQ White Paper on“OptimizingData Centers for Big InfrastructureApplications”bit.ly/N4haaLIntel Cloud Buyers Guide to CloudDesign and Deployment on Intel Platformsbit.ly/L3xXWIHadoop Training and ng/Why Apache Hadoop?hortonworks.com/why-hadoop/12

About StackIQStackIQ is a leading provider of Big Infrastructure managementsoftware for clusters and clouds. Based on open-source Rocks clustersoftware, StackIQ’s Rocks product simplifies the deployment andmanagement of highly scalable systems. StackIQ is based in LaJolla, California, adjacent to the University of California, San Diego,where the open-source Rocks Group was founded. Rocks includessoftware developed by the Rocks Cluster Group at the San DiegoSupercomputer Center at the University of California, San Diego, andits contributors. Rocks is a registered trademark of the Regents of theUniversity of California.4225 Executive SquareSuite 1000La Jolla, CA 92037858.380.2020info@stackIQ.comStackIQ and the StackIQ Logo are trademarks of StackIQ, Inc. in the U.S. and other countries. Third-party trademarks mentioned are the property of their respectiveowners. The use of the word partner does not imply a partnership relationship between StackIQ and any other company.WP.REF-ARH.v1

WHITE PAPER StackIQ Enterprise Hadoop Enterprise Reference Architecture Contents The Need for Efficiency and Automation in Hadoop Deploy-ments 2 StackIQ Enterprise Hadoop 3 . makes Hadoop deployments of all sizes much faster, less costly, more reliable, and more flexible. StackIQ Enterprise Hadoop optimizes and automates the deployment

Related Documents:

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

Hadoop Basics for the Enterprise Decision Maker Hadoop: What You Need to Know. Donald Miner Hadoop: What You Need to Know Hadoop Basics for the Enterprise Decision Maker Beijing Boston Farnham Sebastopol Tokyo. . Hadoop has revolutionized data processing and enterprise data warehousing. It has given birth to dozens of successful startups and

Configuring SSH: 6 Add hadoop user to sudoer's list: 8 Disabling IPv6: 8 Installing Hadoop: 8 Hadoop overview and HDFS 9 Chapter 2: Debugging Hadoop MR Java code in local eclipse dev environment. 12 Introduction 12 Remarks 12 Examples 12 Steps for configuration 12 Chapter 3: Hadoop commands 14 Syntax 14 Examples 14 Hadoop v1 Commands 14 1 .

-Type "sudo tar -xvzf hadoop-2.7.3.tar.gz" 6. I renamed the download to something easier to type-out later. -Type "sudo mv hadoop-2.7.3 hadoop" 7. Make this hduser an owner of this directory just to be sure. -Type "sudo chown -R hduser:hadoop hadoop" 8. Now that we have hadoop, we have to configure it before it can launch its daemons (i.e .

take the lead in rebuilding the criminal legal system so that it is smaller, safer, less puni-tive, and more humane. The People’s Justice Guarantee has three main components: 1. To make America more free by dra-matically reducing jail and prison populations 2. To make America more equal by elim-inating wealth-based discrimination and corporate profiteering 3. To make America more secure by .