Processing Big Data With Hadoop In Azure HDInsight

3y ago

9 Views

2 Downloads

1.02 MB

6 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Giovanna Wyche

Report this link

Download PDF

Transcription

Processing Big Data with Hadoop in AzureHDInsightLab 1 - Getting Started with HDInsightOverviewIn this lab, you will provision an HDInsight cluster. You will then run a sample MapReduce job on thecluster and view the results.What You’ll NeedTo complete the labs, you will need the following: A web browserA Microsoft accountA Microsoft Azure subscriptionA Microsoft Windows, Linux, or Apple Mac OS X computer on which the Azure CLI has beeninstalled.The lab files for this course.Note: To set up the required environment for the lab, follow the instructions in the Setup document forthis course.Provisioning and Configuring an HDInsight ClusterThe first task you must perform is to provision an HDInsight cluster.Note: The Microsoft Azure portal is continually improved in response to customer feedback. The steps inthis exercise reflect the user interface of the Microsoft Azure portal at the time of writing, but may notmatch the latest design of the portal exactly.Provision an HDInsight Cluster1.2.3.In a web browser, navigate to http://portal.azure.com. If prompted, sign in using the Microsoftaccount that is associated with your Azure subscription.In the Microsoft Azure portal, click All resources, and verify that there are no existing HDInsightclusters in your subscription.In the menu (on the left edge), click New (indicated by a ), and in the Data Analyticscategory, click HDInsight. Then use the New HDInsight Cluster blade to create a new clusterwith the following settings:

Cluster Name: Enter a unique name (and make a note of it!)Subscription: Select your Azure subscriptionCluster Type: Cluster Type: Hadoop Operating System: Linux Version: Choose the latest version of Hadoop available. Cluster Tier: Standard Cluster Login Username: Enter a user name of your choice (and make a note of it!) Cluster Login Password: Enter a strong password (and make a note of it!) SSH Username: Enter another user name of your choice (and make a note of it!) SSH Password: Use the same password as the cluster login password Resource Group: Create a new resource group: Enter a unique name (and make a note of it!) Location: Choose any available data center location. Storage: Primary storage type: Azure Storage Selection Method: My Subscriptions Create a new storage account: Enter a unique name consisting of lower-caseletters and numbers only (and make a note of it!) Default Container: Enter the cluster name you specified previously Applications: None Cluster Size: Number of Worker nodes: 1 Worker Node Size: View all and choose the smallest available size Head Node Size: View all and choose the smallest available size Advanced Settings: NoneAfter you have clicked Create, wait for the cluster to be provisioned and the status to show asRunning (this can take a while, so now is a good time for a coffee break!) 4.Note: As soon as an HDInsight cluster is running, the credit in your Azure subscription will start to becharged. Free-trial subscriptions include a limited amount of credit limit that you can spend over aperiod of 30 days, which should be enough to complete the labs in this course as long as clusters aredeleted when not in use. If you decide not to complete this lab, follow the instructions in the Clean Upprocedure at the end of the lab to delete your cluster to avoid using your Azure credit unnecessarily.View Cluster Configuration in the Azure Portal1.In the Microsoft Azure portal, browse your resources and select your cluster. Then on theHDInsight Cluster blade, view the summary information for your cluster.2. On the HDInsight Cluster blade, click Scale Cluster, and note that you can dynamically scale thenumber of worker nodes to meet processing demand.View the Cluster Dashboard1.On the HDInsight Cluster blade, click Dashboard, and when prompted, log in using the cluster2.login username and password you specified when provisioning the cluster (be careful to use thecluster login username, not the SSH username).Explore the dashboard for your cluster. The dashboard is an Ambari web application in which youcan view and configure settings for the Hadoop services running in the cluster. When you arefinished, close its tab and return to the Azure portal tab.

Connecting to an HDInsight ClusterNow that you have provisioned an HDInsight cluster, you can connect to it and process data for analysis.If you are using a Windows client computer:1. In the Microsoft Azure portal, on the HDInsight Cluster blade for your HDInsight cluster, clickSecure Shell, and then in the Secure Shell blade, in the hostname list, note the Host name foryour cluster (which should be your cluster name-ssh.azurehdinsight.net).2. Open PuTTY, and in the Session page, enter the host name into the Host Name box. Then underConnection type, select SSH and click Open.3. If a security warning that the host certificate cannot be verified is displayed, click Yes tocontinue.4. When prompted, enter the SSH username and password you specified when provisioning thecluster (not the cluster login username).If you are using a Mac OS X or Linux client computer:1. In the Microsoft Azure portal, on the HDInsight Cluster blade for your HDInsight cluster, clickSecure Shell, and then in the Secure Shell blade, in the hostname list, select the hostname foryour cluster. then copy the ssh command that is displayed, which should resemble the followingcommand – you will use this to connect to the head node.ssh sshuser@your cluster name-ssh.azurehdinsight.net2. Open a new terminal session, and paste the ssh command, specifying your SSH user name (notthe cluster login username).3. If you are prompted to connect even though the certificate can’t be verified, enter yes.4. When prompted, enter the password for the SSH username.Note: If you have previously connected to a cluster with the same name, the certificate for the oldercluster will still be stored and a connection may be denied because the new certificate does not

match the stored certificate. You can delete the old certificate by using the ssh-keygen command,specifying the path of your certificate file (f) and the host record to be removed (R) - for example:ssh-keygen -f "/home/usr/.ssh/known hosts" -R clstr-ssh.azurehdinsight.netBrowse Cluster StorageNow that you have opened an SSH console for your cluster, you can use it to work with the clustershared storage system. Hadoop uses a file system named HDFS, which in Azure HDInsight clusters isimplemented as a blob container in Azure Storage.Note: The commands in this procedure are case-sensitive.1. In the SSH console, enter the following command to view the contents of the root folder in theHDFS file system.hdfs dfs -ls /2. Enter the following command to view the contents of the /example folder in the HDFS filesystem. This folder contains subfolders for sample apps, data, and JAR components.hdfs dfs -ls /example3. Enter the following command to view the contents of the /example/data/gutenberg folder,which contains sample text files:hdfs dfs -ls /example/data/gutenberg4. Enter the following command to view the text in the davinci.txt file:hdfs dfs -text /example/data/gutenberg/davinci.txt5. Note that the file contains a large volume of unstructured text.Run a MapReduce JobHadoop uses MapReduce jobs to distribute the processing of data across nodes in the cluster. Each job isdivided into a map phase during which one or more mappers splits the data into key/value pairs, and areduce phase, during which one or more reducers process the values for each key.1. Enter the following command to view the sample Java jars stored in the cluster head node:ls /usr/hdp/current/hadoop-mapreduce-client2. Enter the following command on a single line to get a list of MapReduce functions in thehadoop-mapreduce-examples.jar:hadoop jar preduceexamples.jar3. Enter the following command on a single line to get help for the wordcount function in thehadoop-mapreduce-examples.jar that is stored in the cluster head:hadoop jar preduceexamples.jar wordcount4. Enter the following command on a single line to run a MapReduce job using the wordcountfunction in the hadoop-mapreduce-examples.jar jar to process the davinci.txt file you viewedearlier and store the results of the job in the /example/results folder:

hadoop jar preduceexamples.jar wordcount s5. Wait for the MapReduce job to complete, and then enter the following command to view theoutput folder, and note that a file named part-r-00000 has been created by the job.hdfs dfs -ls /example/results6. Enter the following command to view the results in the output file:hdfs dfs -text /example/results/part-r-000007. Minimize the SSH console window. Then proceed to the next exercise.Uploading and Processing Data FilesIn the previous exercise, you ran a Map Reduce job on a sample file that is provided with HDInsight. Inthis exercise, you will use Azure Storage Explorer to upload data to the Azure blob store for processingwith Hadoop, and then download the results for analysis on your local computer.Upload a File to Azure Blob Storage1. View the contents of the HDILabs\Lab01\reviews folder where you extracted the lab files forthis course, and verify that this folder contains a file named reviews.txt. This file containsproduct review text from a hypothetical web site on which cycles and cycling accessories aresold.2. Start Azure Storage Explorer, and if you are not already signed in, sign into your Azuresubscription.3. Expand your storage account and the Blob Containers folder, and then double-click the blobcontainer for your HDInsight cluster.4. In the Upload drop-down list, click folder. Then upload the reviews folder as a block blob to anew folder named reviews in root of the container.Process the Uploaded Data1. Switch to the SSH console for your HDInsight cluster, and enter the following command on asingle line to run a MapReduce job using the wordcount function in the hadoop-mapreduceexamples.jar jar to process the reviews.txt file you uploaded and store the results of the job inthe /reviews/results folder:hadoop jar preduceexamples.jar wordcount /reviews/reviews.txt /reviews/results2. Wait for the MapReduce job to complete, and then enter the following command to view theoutput folder, and verify that a file named part-r-00000 has been created by the job.hdfs dfs -ls /reviews/resultsDownload the Results1. Switch back to Azure Storage Explorer, and browse to the reviews/results folder (you may needto refresh the root folder to see the reviews folder)2. Double-click the part-r-0000 text file to download it and open it in a text editor and view theword counts for the review data (the file is tab-delimited, and if you prefer, you can open itusing a spreadsheet application such as Microsoft Excel).

3. Close the part-r-00000 file, all command windows, and Azure Storage Explorer.Clean UpNow that you have finished this lab, you can delete the HDInsight cluster and storage account. Thisensures that you avoid being charged for cluster resources when you are not using them. If you areusing a trial Azure subscription that includes a limited free credit value, deleting the cluster maximizesyour credit and helps to prevent using it all before the free trial period has ended.Note: If you are proceeding straight to the next lab, omit this task and use the same cluster in the nextlab. Otherwise, follow the steps below to delete your cluster and storage account.Delete Cluster Resources1. In the Microsoft Azure portal, click Resource Groups.2. On the Resource groups blade, click the resource group that contains your HDInsight cluster,and then on the Resource group blade, click Delete. On the confirmation blade, type the nameof your resource group, and click Delete.3. Wait for your resource group to be deleted, and then click All Resources, and verify that thecluster, and the storage account that was created with your cluster, have both been removed.4. Close the browser.

The first task you must perform is to provision an HDInsight cluster. . can view and configure settings for the Hadoop services running in the cluster. When you are finished, close its tab and return to the Azure portal tab. . Enter the following command to view the sample Java jars stored in the cluster head node: ls /usr/hdp/current .

Related Documents:

hadoop - riptutorial.com

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

35 Views

1y ago

Big Data Analytics - learnerspoint.org

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

10 Views

1y ago

Lecture @Dhbw: Data Warehouse Part Vii: Hadoop

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

13 Views

1y ago

IN-MEMORY ACCELERATOR FOR HADOOP - GridGain Systems

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

13 Views

1y ago

Big Data Hadoop Administrator - cognixia.com

BIG DATA THE WORLD OF BIG DATA HADOOP ADMINISTRATOR Hadoop Administrator is one of the most sought after skills in the world today. The global Hadoop market is expected to be worth 50.24 billion by 2020, offering great career opportunities to professionals. For any organization to start off with Hadoop, they would need Hadoop

12 Views

1y ago

Hadoop and MySQL for Big Data - Percona

Inside Hadoop Big Data with Hadoop MySQL and Hadoop Integration Star Schema benchmark . www.percona.com Hadoop: when it makes sense BIG DATA . www.percona.com Big Data Volume Petabytes Variety Any type of data - usually unstructured/raw data No normalization .

13 Views

1y ago

Real Time Micro-Blog Summarization based on Hadoop/HBase

Introduction Apache Hadoop . What is Apache Hadoop? MapReduce is the processing part of Hadoop HDFS is the data part of Hadoop Dept. of Computer Science, Georgia State University 05/03/2013 5 Introduction Apache Hadoop HDFS MapReduce Machine . What is Apache Hadoop? The MapReduce server on a typical machine is called a .

19 Views

1y ago

Big Data: Using ArcGIS with Apache Hadoop - Esri

The most common Hadoop data processing task is to reduce a large amount of data to a smaller, more manageable amount of data The GIS Tools for Hadoop provide query functions and API methods that enable Hadoop application developers to perform this data reduction process on spatial data . Big Data: Using ArcGIS with Apache Hadoop

8 Views

1y ago

Recent Views

Dear Members of the Harvard Community,

Life science graduate education at Harvard is comprised of 14 Ph.D. programs of study across four Harvard faculties—Harvard Faculty of Arts and Sciences, Harvard T. H. Chan School of Public Health, Harvard Medical School, and Harvard School of Dental Medicine. These 14 programs make up the Harvard Integrated Life Sciences (HILS).

3y ago

182 Views

Xavier Du Maine, Lara Roach, Perspectives - Harvard University

Sciences at Harvard University Richard A. and Susan F. Smith Campus Center 1350 Massachusetts Avenue, Suite 350 Cambridge, MA 02138 617-495-5315 gsas.harvard.edu Office of Diversity and Minority Affairs minrec@fas.harvard.edu gsas.harvard.edu/diversity Office of Admissions and Financial Aid admiss@fas.harvard.edu gsas.harvard.edu/apply

1y ago

146 Views

PROGRAM ON CRISIS LEADERSHIP - Harvard Kennedy School

Harvard Kennedy School Arnold M. Howitt Harvard Kennedy School Philip B. Heymann Harvard Law School April 2014 An earlier version of this white paper provided background for an expert dialogue on lessons learned from the events of the Boston Marathon bombing that was held at the John F. Kennedy School of Government at Harvard

2y ago

330 Views

Harvard Law School - WordPress

Law & Business, Harvard Law School, and H. Douglas Weaver Professor of Business Law. Harvard Business School. 10.30-10.55h. 13th Lecture "Cross-border Insolvency: the New European Regime". Pedro de Miguel Asensio. Full Professor of Private International Law. UCM. 11.00-12.00h. Round Table. "Latest reforms and tendencies on Insolvency Law".

1y ago

145 Views

Harvard Buildings Emergency Phones Harvard University .

Faculty of Arts and Sciences, Harvard University Class of 2018 LEGEND Harvard Buildings Emergency Phones Harvard University Police Department Designated Pathways Harvard Shuttle Bus Stops l e s R i v e r a C h r YOKE ST YMOR E DRIVE BEACON STREET OXFORD ST VENUE CAMBRIDGE STREET KIRKLAND STREET AUBURN STREET VE MEMORIAL

3y ago

171 Views

THE FIRST CENTURY OF THE AMERICAN . - Princeton

Harvard University Press, 1935) and Harvard College in the Seventeenth Century (Cambridge: Harvard University Press, 1936). Quotes, Founding of Harvard, 168, 449. These works are summarized in Three Centuries of Harvard (Cambridge: Harvard U

2y ago

225 Views

Catherine G. Barrera HARVARD UNIVERSITY

danbjork@fas.harvard.edu HARVARD UNIVERSITY Placement Director: Gita Gopinath GOPINATH@HARVARD.EDU 617-495-8161 Placement Director: Nathan Nunn NNUNN@FAS.HARVARD.EDU 617-496-4958 Graduate Administrator: Brenda Piquet BPIQUET@FAS.HARVARD.EDU 617-495-8927 Office Contact Information Department of Economics

2y ago

363 Views

SEAS Lab Safety Officer Orientation

Kuan ebrandin@harvard.edu akuan@fas.harvard.edu Donhee Ham MD B129, MDB132 Dongwan Ha dha@seas.harvard.edu Lene Hau Cruft 112-116 Danny Kim dannykim@seas.harvard.edu Robert Howe 60 Oxford, 312-317,319-321 Paul Loschak loschak@seas.harvard.edu Evelyn Hu McKay 222,226,232 Kathryn Greenberg greenber@fas.harvard.edu

2y ago

359 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Dangerous Defendants - Yale Law Journal

Law School, Louisiana State University Paul M. Hebert Law Center, Roger Williams University School of Law, Rutgers Law School, Sandra Day O'Connor College of Law, Southern Methodist University Dedman School of Law, University of Georgia School of Law, and University of Utah S.J. Quinney College of Law. For institutional support, I am grateful .

1y ago

169 Views

EMPLOYER GUIDE - Harvard Kennedy School

HARVARD KENNEDY SCHOOL EMPLOYER GUIDE At Harvard Kennedy School, our students are being trained in public policy . Post in the HKS JACK job bank or send to HKS_Career@hks.harvard.edu 2. Browse our resume book to identify a student with the skills and experience you need 3. Visit us and meet our talent HARVARD

2y ago

138 Views

2008-2009 FACT BOOK - Harvard University

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

364 Views

HARVARD UNIVERSITY 2007-08

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies* American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

314 Views

ANNA BRADY - Harvard University

Jun 02, 2008 · ANNA BRADY 12 Oxford Street Apt. 9 Cambridge, MA 02138 (617) 495-3108 abrady@jd11.law.harvard.edu EDUCATION HARVARD LAW SCHOOL, Candidate for J.D., June 2011 Activities: Harvard Civil Rights-Civil Liberties Law Review UNIVERSITY OF CHICAGO, B.A. i

2y ago

137 Views

Processing Big Data With Hadoop In Azure HDInsight

It looks like you're using an ad-blocker