Hadoop - RIP Tutorial

3y ago
42 Views
2 Downloads
1.15 MB
46 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Mika Lloyd
Transcription

hadoop#hadoop

Table of ContentsAbout1Chapter 1: Getting started with hadoop2Remarks2What is Apache Hadoop?2Apache Hadoop includes these modules:2Reference:2Versions2Examples3Installation or Setup on Linux3Installation of Hadoop on ubuntu5Creating Hadoop User:5Adding a user:5Configuring SSH:6Add hadoop user to sudoer's list:8Disabling IPv6:8Installing Hadoop:8Hadoop overview and HDFSChapter 2: Debugging Hadoop MR Java code in local eclipse dev eps for configurationChapter 3: Hadoop commands1214Syntax14Examples14Hadoop v1 Commands141. Print the Hadoop version142. List the contents of the root directory in HDFS14h1114

3. Report the amount of space used and14available on currently mounted filesystem14h12144. Count the number of directories,files and bytes under14the paths that match the specified file pattern14h13145. Run a DFS filesystem checking utility15h14156. Run a cluster balancing utility15h15157. Create a new directory named “hadoop” below the15/user/training directory in HDFS. Since you’re15currently logged in with the “training” user ID,15/user/training is your home directory in HDFS.15h16158. Add a sample text file from the local directory15named “data” to the new directory you created in HDFS15during the previous step.16h17169. List the contents of this new directory in HDFS.16h181610. Add the entire local directory called “retail” to the16/user/training directory in HDFS.16h191611. Since /user/training is your home directory in HDFS,16any command that does not have an absolute path is16interpreted as relative to that directory. The next16command will therefore list your home directory, and16should show the items you’ve just added there.17

h1101712. See how much space this directory occupies in HDFS.17h1111713. Delete a file ‘customers’ from the “retail” directory.17h1121714. Ensure this file is no longer in HDFS.17h1131715. Delete all files from the “retail” directory using a wildcard.17h1141716. To empty the trash17h1151817. Finally, remove the entire retail directory and all18of its contents in HDFS.18h1161818. List the hadoop directory again18h1171819. Add the purchases.txt file from the local directory18named “/home/training/” to the hadoop directory you created in HDFS18h1181820. To view the contents of your text file purchases.txt18which is present in your hadoop directory.18h1191821. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory19to the directory “data” which is present in your local directory19h1201922. cp is used to copy files between directories present in HDFS19h1211923. ‘-get’ command can be used alternaively to ‘-copyToLocal’ command19h12219

24. Display last kilobyte of the file “purchases.txt” to stdout.19h1231925. Default file permissions are 666 in HDFS19Use ‘-chmod’ command to change permissions of a file19h1242026. Default names of owner and group are training,training20Use ‘-chown’ to change owner name and group name simultaneously20h1252027. Default name of group is training20Use ‘-chgrp’ command to change group name20h1262028. Move a directory from one location to other20h1272029. Default replication factor to a file is 3.20Use ‘-setrep’ command to change replication factor of a file20h1282130. Copy a directory from one node in the cluster to another21Use ‘-distcp’ command to copy,21-overwrite option to overwrite in an existing files21-update command to synchronize both directories21h1292131. Command to make the name node leave safe mode21h1302132. List all the hadoop file system shell commands21h1312133. Get hdfs quota values and the current count of names and bytes in use.22h1322234. Last but not least, always ask for help!22h13322

Hadoop v2 CommandsChapter 4: Hadoop load dataExamplesLoad data into hadoop hdfs22262626hadoop fs -mkdir:26Usage:26Example:26hadoop fs -put:26Usage:26Example:26hadoop fs -copyFromLocal:26Usage:27Example:27hadoop fs :27Chapter 5: hue29Introduction29Examples29Setup process29Instalation Dependencies29Hue Installation in Ubuntu30Chapter 6: Introduction to MapReduce32Syntax32Remarks32Examples32Word Count Program(in Java & Python)Chapter 7: What is HDFS?3236

Remarks36Examples36HDFS - Hadoop Distributed File System36Finding files in HDFS36Blocks and Splits HDFS37Credits39

AboutYou can share this PDF with anyone you feel could benefit from it, downloaded the latest versionfrom: hadoopIt is an unofficial and free hadoop ebook created for educational purposes. All the content isextracted from Stack Overflow Documentation, which is written by many hardworking individuals atStack Overflow. It is neither affiliated with Stack Overflow nor official hadoop.The content is released under Creative Commons BY-SA, and the list of contributors to eachchapter are provided in the credits section at the end of this book. Images may be copyright oftheir respective owners unless otherwise specified. All trademarks and registered trademarks arethe property of their respective company owners.Use the content presented in this book at your own risk; it is not guaranteed to be correct noraccurate, please send your feedback and corrections to info@zzzprojects.comhttps://riptutorial.com/1

Chapter 1: Getting started with hadoopRemarksWhat is Apache Hadoop?The Apache Hadoop software library is a framework that allows for the distributed processing oflarge data sets across clusters of computers using simple programming models. It is designed toscale up from single servers to thousands of machines, each offering local computation andstorage. Rather than rely on hardware to deliver high-availability, the library itself is designed todetect and handle failures at the application layer, so delivering a highly-available service on top ofa cluster of computers, each of which may be prone to failures.Apache Hadoop includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.Reference:Apache HadoopVersionsVersionRelease Notes3.0.0-alpha1Release Date2016-08-302.7.3Click here - 2.7.32016-01-252.6.4Click here - 2.6.42016-02-112.7.2Click here - 2.7.22016-01-252.6.3Click here - 2.6.32015-12-172.6.2Click here - 2.6.22015-10-282.7.1Click here - 2.7.12015-07-06https://riptutorial.com/2

ExamplesInstallation or Setup on LinuxA Pseudo Distributed Cluster Setup ProcedurePrerequisites Install JDK1.7 and set JAVA HOME environment variable. Create a new user as "hadoop".useradd hadoop Setup password-less SSH login to its own accountsu - hadoopssh-keygen Press ENTER for all prompts cat /.ssh/id rsa.pub /.ssh/authorized keyschmod 0600 /.ssh/authorized keys Verify by performing sshlocalhost Disable IPV6 by editing /etc/sysctl.conf with the followings:net.ipv6.conf.all.disable ipv6 1net.ipv6.conf.default.disable ipv6 1net.ipv6.conf.lo.disable ipv6 1 Check that using cat/proc/sys/net/ipv6/conf/all/disable ipv6(should return 1)Installation and Configuration: Download required version of Hadoop from Apache archives using wget command.cd /opt/hadoop/wget http:/addresstoarchive/hadoop-2.x.x/xxxxx.gztar -xvf hadoop-2.x.x.gzmv hadoop-2.x.x.gz hadoop(or)ln -s hadoop-2.x.x.gz hadoopchown -R hadoop:hadoop hadoop Update .bashrc/.kshrc based on your shell with below environment variablesexport HADOOP PREFIX /opt/hadoop/hadoopexport HADOOP CONF DIR HADOOP PREFIX/etc/hadoopexport JAVA HOME /java/home/pathhttps://riptutorial.com/3

export PATH PATH: HADOOP PREFIX/bin: HADOOP PREFIX/sbin: JAVA HOME/bin In HADOOP HOME/etc/hadoop directory edit below files core-site.xml configuration property name fs.defaultFS /name value hdfs://localhost:8020 /value /property /configuration mapred-site.xmlCreate mapred-site.xml from its templatecp mapred-site.xml.template mapred-site.xml configuration property name mapreduce.framework.name /name value yarn /value /property /configuration yarn-site.xml configuration property name yarn.resourcemanager.hostname /name value localhost /value /property property name yarn.nodemanager.aux-services /name value mapreduce shuffle /value /property /configuration hdfs-site.xml configuration property name dfs.replication /name value 1 /value /property property name dfs.namenode.name.dir /name value file:///home/hadoop/hdfs/namenode /value /property property name dfs.datanode.data.dir /name value file:///home/hadoop/hdfs/datanode /value /property /configuration https://riptutorial.com/4

Create the parent folder to store the hadoop datamkdir -p /home/hadoop/hdfs Format NameNode (cleans up the directory and creates necessary meta files)hdfs namenode -format Start all services:start-dfs.sh && start-yarn.shmr-jobhistory-server.sh start historyserverInstead use start-all.sh (deprecated). Check all running java processesjps Namenode Web Interface: http://localhost:50070/ Resource manager Web Interface: http://localhost:8088/ To stop daemons(services):stop-dfs.sh && stop-yarn.shmr-jobhistory-daemon.sh stop historyserverInstead use stop-all.sh (deprecated).Installation of Hadoop on ubuntuCreating Hadoop User:sudo addgroup hadoopAdding a user:sudo adduser --ingroup hadoop hduser001https://riptutorial.com/5

Configuring SSH:su -hduser001ssh-keygen -t rsa -P ""cat .ssh/id rsa.pub .ssh/authorized keysNote: If you get errors [bash: .ssh/authorized keys: No such file or directory] whilst writing theauthorized key. Check here.https://riptutorial.com/6

https://riptutorial.com/7

Add hadoop user to sudoer's list:sudo adduser hduser001 sudoDisabling IPv6:Installing Hadoop:https://riptutorial.com/8

sudo add-apt-repository ppa:hadoop-ubuntu/stablesudo apt-get install hadoopHadoop overview and HDFSHadoop is an open-source software framework for storage and large-scale processing ofdata-sets in a distributed computing environment. It is sponsored by Apache SoftwareFoundation. It is designed to scale up from single servers to thousands of machines, eachoffering local computation and storage.https://riptutorial.com/9

History Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the search engine project.Major modules of hadoopHadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data. Hadoop MapReduce: A software framework fordistributed processing of large data sets on compute clusters.Hadoop File System Basic FeaturesHighly fault-tolerant. High throughput. Suitable for applications with large data sets. Can bebuilt out of commodity hardware.Namenode and DatanodesMaster/slave architecture. HDFS cluster consists of a single Namenode, a master server thatmanages the file system namespace and regulates access to files by clients. TheDataNodes manage storage attached to the nodes that they run on. HDFS exposes a filesystem namespace and allows user data to be stored in files. A file is split into one or moreblocks and set of blocks are stored in DataNodes. DataNodes: serves read, write requests,performs block creation, deletion, and replication upon instruction from Namenode.HDFS is designed to store very large files across machines in a large cluster. Each file is ahttps://riptutorial.com/10

sequence of blocks. All blocks in the file except the last are of the same size. Blocks arereplicated for fault tolerance. The Namenode receives a Heartbeat and a BlockReport fromeach DataNode in the cluster. BlockReport contains all the blocks on a Datanode.Hadoop Shell CommandsCommon commands used:ls Usage: hadoop fs –ls Path(dir/file path to list). Cat Usage: hadoop fs -catPathOfFileToViewLink for hadoop shell commands:- ctdist/hadoop-common/FileSystemShell.htmlRead Getting started with hadoop online: tartedwith-hadoophttps://riptutorial.com/11

Chapter 2: Debugging Hadoop MR Java codein local eclipse dev environment.IntroductionThe basic thing to remember here is that debugging a Hadoop MR job is going to be similar to anyremotely debugged application in Eclipse.A debugger or debugging tool is a computer program that is used to test and debug otherprograms (the “target” program). It is greatly useful specially for a Hadoop environment whereinthere is little room for error and one small error can cause a huge loss.RemarksThat is all you need to do.ExamplesSteps for configurationAs you would know, Hadoop can be run in the local environment in 3 different modes :1. Local Mode2. Pseudo Distributed Mode3. Fully Distributed Mode (Cluster)Typically you will be running your local hadoop setup in Pseudo Distributed Mode to leverageHDFS and Map Reduce(MR). However you cannot debug MR programs in this mode as eachMap/Reduce task will be running in a separate JVM process so you need to switch back to Localmode where you can run your MR programs in a single JVM process.Here are the quick and simple steps to debug this in your local environment:1. Run hadoop in local mode for debugging so mapper and reducer tasks run in a single JVMinstead of separate JVMs. Below steps help you do it.2. Configure HADOOP OPTS to enable debugging so when you run your Hadoop job, it will bewaiting for the debugger to connect. Below is the command to debug the same at port 8080.(export HADOOP OPTS ”agentlib:jdwp transport dt socket,server y,suspend y,address 8008“)3. Configure fs.default.name value in core-site.xml to file:/// from hdfs://. You won’t be usinghdfs in local mode.https://riptutorial.com/12

4. Configure mapred.job.tracker value in mapred-site.xml to local. This will instruct Hadoop torun MR tasks in a single JVM.5. Create debug configuration for Eclipse and set the port to 8008 – typical stuff. For that go tothe debugger configurations and create a new Remote Java Application type of configurationand set the port as 8080 in the settings.6. Run your hadoop job (it will be waiting for the debugger to connect) and then launch Eclipsein debug mode with the above configuration. Do make sure to put a break-point first.Read Debugging Hadoop MR Java code in local eclipse dev environment. environment-https://riptutorial.com/13

Chapter 3: Hadoop commandsSyntax Hadoop v1 commands: hadoop Hadoop v2 commands: hdfsfs - command dfs - command ExamplesHadoop v1 Commands1. Print the Hadoop versionhadoop version2. List the contents of the root directory inHDFShadoop fs -ls /3. Report the amount of space used andavailable on currently mounted filesystemhadoop fs -df hdfs:/4. Count the number of directories,files andbytes underthe paths that match the specified file patternhttps://riptutorial.com/14

hadoop fs -count hdfs:/5. Run a DFS filesystem checking utilityhadoop fsck – /6. Run a cluster balancing utilityhadoop balancer7. Create a new directory named “hadoop”below the/user/training directory in HDFS. Since you’recurrently logged in with the “training” userID,/user/training is your home directory inHDFS.hadoop fs -mkdir /user/training/hadoop8. Add a sample text file from the localdirectorynamed “data” to the new directory youhttps://riptutorial.com/15

created in HDFSduring the previous step.hadoop fs -put data/sample.txt /user/training/hadoop9. List the contents of this new directory inHDFS.hadoop fs -ls /user/training/hadoop10. Add the entire local directory called“retail” to the/user/training directory in HDFS.hadoop fs -put data/retail /user/training/hadoop11. Since /user/training is your homedirectory in HDFS,any command that does not have an absolutepath isinterpreted as relative to that directory. Thenexthttps://riptutorial.com/16

command will therefore list your homedirectory, andshould show the items you’ve just addedthere.hadoop fs -ls12. See how much space this directoryoccupies in HDFS.hadoop fs -du -s -h hadoop/retail13. Delete a file ‘customers’ from the “retail”directory.hadoop fs -rm hadoop/retail/customers14. Ensure this file is no longer in HDFS.hadoop fs -ls hadoop/retail/customers15. Delete all files from the “retail” directoryusing a wildcard.hadoop fs -rm hadoop/retail/*https://riptutorial.com/17

16. To empty the trashhadoop fs -expunge17. Finally, remove the entire retail directoryand allof its contents in HDFS.hadoop fs -rm -r hadoop/retail18. List the hadoop directory againhadoop fs -ls hadoop19. Add the purchases.txt file from the localdirectorynamed “/home/training/” to the hadoopdirectory you created in HDFShadoop fs -copyFromLocal /home/training/purchases.txt hadoop/20. To view the contents of your text filepurchases.txtwhich is present in your hadoop directory.https://riptutorial.com/18

hadoop fs -cat hadoop/purchases.txt21. Add the purchases.txt file from “hadoop”directory which is present in HDFS directoryto the directory “data” which is present inyour local directoryhadoop fs -copyToLocal hadoop/purchases.txt /home/training/data22. cp is used to copy files betweendirectories present in HDFShadoop fs -cp /user/training/*.txt /user/training/hadoop23. ‘-get’ command can be used alternaivelyto ‘-copyToLocal’ commandhadoop fs -get hadoop/sample.txt /home/training/24. Display last kilobyte of the file“purchases.txt” to stdout.hadoop fs -tail hadoop/purchases.txt25. Default file permissions are 666 in HDFShttps://riptutorial.com/19

Use ‘-chmod’ command to changepermissions of a filehadoop fs -ls hadoop/purchases.txtsudo -u hdfs hadoop fs -chmod 600 hadoop/purchases.txt26. Default names of owner and group aretraining,trainingUse ‘-chown’ to change owner name andgroup name simultaneouslyhadoop fs -ls hadoop/purchases.txtsudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt27. Default name of group is trainingUse ‘-chgrp’ command to change group namehadoop fs -ls hadoop/purchases.txtsudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt28. Move a directory from one location tootherhadoop fs -mv hadoop apache hadoop29. Default replication factor to a file is 3.https://riptutorial.com/20

Use ‘-setrep’ command to change replicationfactor of a filehadoop fs -setrep -w 2 apache hadoop/sample.txt30. Copy a directory from one node in thecluster to anotherUse ‘-distcp’ command to copy,-overwrite option to overwrite in an existingfiles-update command to synchronize bothdirectorieshadoop fs -distcp hdfs://namenodeA/apache hadoop hdfs://namenodeB/hadoop31. Command to make the name node leavesafe modehadoop fs -expungesudo -u hdfs hdfs dfsadmin -safemode leave32. List all the hadoop file system shellcommandshttps://riptutorial.com/21

hadoop fs33. Get hdfs quota values and the currentcount of names and bytes in use.hadoop fs -count -q [-h] [-v] directory . directory 34. Last but not least, always ask for help!hadoop fs -helpHadoop v2 CommandsappendToFile: Append single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and appends to destination file system. Keep the as hdfs dfs -appendToFile [localfile1 localfile2 .] [/HDFS/FILE/PATH.]cat: Copies source paths to stdout.hdfs dfs -cat URI [URI ]chgrp: Changes the group association of files. With -R, makes the change recursively by way ofthe directory structure. The user must be the file owner or the superuser.hdfs dfs -chgrp [-R] GROUP URI [URI ]chmod: Changes the permissions of files. With -R, makes the change recursively by way of thedirectory structure. The user must be the file owner or

Configuring SSH: 6 Add hadoop user to sudoer's list: 8 Disabling IPv6: 8 Installing Hadoop: 8 Hadoop overview and HDFS 9 Chapter 2: Debugging Hadoop MR Java code in local eclipse dev environment. 12 Introduction 12 Remarks 12 Examples 12 Steps for configuration 12 Chapter 3: Hadoop commands 14 Syntax 14 Examples 14 Hadoop v1 Commands 14 1 .

Related Documents:

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

11 am - Bernie O'Malley, RIP Kevin O'Brien, RIP 5 pm - Gary Gilliland, RIP Mon. April 19th - 9 am - John Blair, Jr., RIP Tues. April 20th - 9 am - Michael & Gwen LaHair, RIP Wed. April 21st - 9 am - Anthony Dunn Thurs. April 22nd - 9 am - David Acevedo, RIP Fri. April 23rd - 9 am - Edmund Kelly, RIP Sat. April 24th - 9 am - Louis White, RIP

Rip Van Winkle! Rip Van Winkle! NARRATOR: Rip looked all around but could see no one. RIP: Did you hear that, boy? STRANGER: (distantly yelling) Rip Van Winkle! Rip Van Winkle! WOLF: Grrrr. NARRATOR: Wolf bristled up his back, looking down the valley. Then Rip saw a strange figure slowly toiling up the side of

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

-Type "sudo tar -xvzf hadoop-2.7.3.tar.gz" 6. I renamed the download to something easier to type-out later. -Type "sudo mv hadoop-2.7.3 hadoop" 7. Make this hduser an owner of this directory just to be sure. -Type "sudo chown -R hduser:hadoop hadoop" 8. Now that we have hadoop, we have to configure it before it can launch its daemons (i.e .

List of Plates Plate 1 Tea break! 4 Plate 2 Outline of robbed out wall visible in Trench 2c. Taken from the N. 8 Plate 3 W facing fireplace [2055], during excavation. Taken from the SW. 9 Plate 4 General view of fire place and rake out area following excavation, Trench 2c. Taken from the SW. 9 Plate 5 Stake [2091], set into natural sand (2072). Taken from the N 10