Analysis Of Data Placement Strategy Based On Computing .

1y ago
8 Views
2 Downloads
746.45 KB
35 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Helen France
Transcription

Analysis of Data Placement Strategy based on Computing Power of Nodes onHeterogeneous Hadoop ClustersbySanket Reddy ChintapalliA project submitted to the Graduate Faculty ofAuburn Universityin partial fulfillment of therequirements for the Degree ofMaster of ScienceAuburn, AlabamaDecember 12, 2014Keywords: MapReduce, HDFS, Computing Capability, HeterogenousCopyright 2014 by Sanket Reddy ChintapalliApproved byXiao Qin, Associate Professor of Computer Science and Software EngineeringJames Cross, Professor of Computer Science and Software EngineeringJeffrey Overbey, Assistant Professor of Computer Science and Software Engineering

AbstractHadoop and the term ’Big Data’ go hand in hand. The information explosion causeddue to cloud and distributed computing lead to the curiosity to process and analyze massiveamount of data. The process and analysis helps to add value to an organization or derivevaluable information.The current Hadoop implementation assumes that computing nodes in a cluster arehomogeneous in nature. Hadoop relies on its capability to take computation to the nodesrather than migrating the data around the nodes which might cause a significant networkoverhead. This strategy has its potential benefits on homogeneous environment but it mightnot be suitable on an heterogeneous environment. The time taken to process the data on aslower node on a heterogeneous environment might be significantly higher than the sum ofnetwork overhead and processing time on a faster node. Hence, it is necessary to study thedata placement policy where we can distribute the data based on the processing power ofa node. The project explores this data placement policy and notes the ramifications of thisstrategy based on running few benchmark applications.ii

AcknowledgmentsI would like to express my deepest gratitude to my adviser Dr. Xiao Qin for shapingmy career. I would like to thank Dr. James Cross and Dr. Jeffrey Overbey for serving onmy advisory committee.Auburn University has my regard for making me a better engineer and I would like tothank all my professors and staff for shaping my intellect and personality. I would like tothank my friends and family for supporting me throughout my tenure at Auburn. I hadgreat fun exploring various technologies, meeting interesting people, taking up challengesand expanding my perspective on software development and engineering.iii

Table of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.12Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.1.1Data Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.3Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42.1Hadoop Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42.2MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.2.1JobTracker and TaskTracker . . . . . . . . . . . . . . . . . . . . . . .62.2.2YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.3.1NameNode, DataNode and Clients . . . . . . . . . . . . . . . . . . .112.3.2Federated Namenode . . . . . . . . . . . . . . . . . . . . . . . . . . .132.3.3Backup Node and Secondary NameNode . . . . . . . . . . . . . . . .142.3.4Replica and Block Management . . . . . . . . . . . . . . . . . . . . .15Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162.33413.1Data Placement: Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . .163.2Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18Experiment SetUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20iv

54.1Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204.2Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204.3Benchmark:WordCount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204.4BenchMark:Grep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235.1WordCount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235.2Grep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .245.3Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29v

List of Figures2.1Hadoop Architecture [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.2MapReduce data flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72.3YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.4HDFS Architecture [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112.5HDFS Federated Namenode . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133.1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173.2Computation Ratio Balancer . . . . . . . . . . . . . . . . . . . . . . . . . . . .194.1Word Count Descripion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .214.2Grep description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .225.1Calculating Computation Ratio for WordCount By Running On Individual Nodes 255.2WordCount Performance Evaluation After Running CRBalancer . . . . . . . . .255.3Calculating Computation Ratio for Grep By Running On Individual Nodes . . .275.4Grep Performance Evaluation After Running CRBalancer . . . . . . . . . . . .27vi

List of Tables4.1Node Information in Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . .205.1Computation Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23vii

Chapter 1IntroductionData intensive applications are growing at a fast pace. The need for storing the dataand processing them in order to extract value from skew data has paved way for parallel,distributed processing applications like Hadoop.The ability of such applications to process petabytes of data generated from websiteslike Facebook, Amazon, Yahoo and search engines like Google and Bing have eventuallylead to a data revolution where by processing each and every minute information relating tocustomers and users can add value, there by improving core competency.Hadoop is an open source application first developed by Yahoo inspired from the GoogleFile System. Hadoop comprises of task execution manager and a data storage unit known asHDFS (Hadoop Distributed File System) based on Google’s Big Data Table [1]. Hadoop hastwo predominant versions available today namely Hadoop 1.x and 2.x. The 2.x has featurespertaining to the improvement of the batch processing feature by introducing YARN (YetAnother Resource Negotiator). YARN comprises of a Resource Manager and Node Manager.The Resource Manger is responsible for managing and deploying resources and the NodeManager is responsible for managing the datanode and reporting the status of the datanodeto the resource manager.1.1Scope1.1.1Data DistributionWe observed that data locality is a determining factor for MapReduce performance. Tobalance workload in a cluster, Hadoop distributes data to multiple nodes based on disk space1

availability. Such a data placement strategy is very practical and efficient for a homogeneousenvironment, where computing nodes are identical in terms of computing and disk capacity.In homogeneous computing environments, all nodes have identical workloads, indicatingthat no data needs to be moved from one node to another. In a heterogeneous cluster,however, a high-performance node can complete local data processing faster than a lowperformance node. After the fast node finishes processing the data residing in its localdisk, the node has to handle the unprocessed data in a slow remote node. The overheadof transferring unprocessed data from slow nodes to fast ones is high if the amount ofdata moved is large. An approach to improve MapReduce performance in heterogeneouscomputing environments is to significantly reduce the amount of data moved between slowand fast nodes in a heterogeneous cluster. To balance the data load in a heterogeneousHadoop cluster, we investigate data placement schemes, which aim to partition a large dataset into small fragments being distributed across multiple heterogeneous nodes in a cluster.Unlike other data distribution algorithms, it takes care of replication and network topologybefore moving the data between the nodes.1.2ContributionData placement in HDFS. We develop a data placement mechanism in the Hadoopdistributed system or HDFS to initially distribute a large data set to multiple computingnodes in accordance with the computing capacity of each node. More specifically, a datareorganization algorithm is implemented in HDFS.1.3OrganizationThis project is organized as follows. Chapter 2 explains in detail the Hadoop’s architec-ture, HDFS and MapReduce framework in Hadoop along with the new YARN framework.Chapter 3 explains the problem and explains existing solutions. Chapter 4 describes the2

design of the redistribution mechanism. Chapter 5 analyzes the results and performance ofthe balancer/redistribution algorithm.3

Chapter 2HadoopApache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up froma single server to thousands of machines, with a very high degree of fault tolerance. Ratherthan relying on high-end hardware, the resiliency of these clusters comes from the softwaresability to detect and handle failures at the application layer. Hadoop enables a computingsolution that is scalable, cost effective, flexible and fault tolerant [2] [3].2.1Hadoop ArchitectureHadoop is implemented using Client-Master-Slave design pattern. Currently there aretwo varied implementations of Hadoop namely 1.x and 2.x. Hadoop 1.x generally managesthe data and the task parallelism through Namenode and JobTracker respectively. On theother hand Hadoop 2.x uses YARN (Yet Another Resource Negotiator).In Hadoop 1.x there are two masters in the architecture, which are responsible forthe controlling the slaves across the cluster. The first master is the NameNode, which isdedicated to manage the HDFS and control the slaves that store the data. Second masteris JobTracker, which manages parallel processing of HDFS data in slave nodes using theMapReduce programming model. The rest of the cluster is made up of slave nodes whichruns both DataNode and TaskTracker daemons. DataNodes obey the commands from itsmaster NameNode and store parts of HDFS data decoupled from the meta-data in theNameNode. TaskTrackers on the other hand obeys the commands from the JobTracker anddoes all the computing work assigned by the JobTracker. Finally, Client machines are neitherMaster or a Slave. The role of the Client machine is to give jobs to the masters to load data4

into HDFS, submit Map Reduce jobs describing how that data should be processed, andthen retrieve or view the results of the job when its finished.YARN uses ResourceManager which spawns multiple NodeManagers on subsequentnodes on the cluster. The NodeManager is responsible for deploying the tasks and reportingthe status of the node to the Resource Manager [6].Figure 2.1: Hadoop Architecture [5]Figure 2.1 shows the basic organization of the Hadoop cluster. The client machinescommunicates with the NameNode to add, move, manipulate, or delete files in HDFS. TheNameNode in turn calls the DataNodes to store, delete or make replicas of data beingadded to HDFS. When the client machines want to process the data in the HDFS, theycommunicate to the JobTracker to submit a job that uses MapReduce. JobTracker dividesthe jobs to map/reduce tasks and assigns it to the TaskTracker to process it.Typically, all nodes in Hadoop cluster are arranged in the air cooled racks in a datacenter. The racks are linked with each other with the help of rack switches which runs onTCP/IP.5

2.2MapReduceIn Introduction chapter we understood that MapReduce is a programming model de-signed for processing large volumes of data in parallel by dividing the work into a set ofindependent tasks. MapReduce programs are influenced by functional programming constructs used for processing lists of data. The MapReduce fetches the data from the HDFSfor parallel processing. These data are divided in to blocks as mentioned in the sectionabove.2.2.1JobTracker and TaskTrackerJobTracker is the master, to which the applications submit MapReduce jobs. TheJobTracker gets the map tasks based on input splits and assigns tasks to TaskTracker nodesin the cluster. The JobTracker is aware of the data block location in the cluster and machineswhich are near the data. The JobTracker assigns the job to TaskTracker that has the datawith it and if it cannot, then it schedules it to the nearest node to the data to optimizethe network bandwidth. The TaskTracker sends a HeartBeat message to the JobTrackerperiodically, to let JobTracker know that it is healthy, and in the message it includes thememory available, CPU frequency and etc. If the TaskTracker fails to send a HeartBeat tothe JobTracker, the JobTracker assumes that the TaskTracker is down and schedules thetask to the other node which is in the same rack as the failed node [3].The Figure 2.2 [4] shows the data flow of MapReduce in couple of nodes . The stepsbelow explains the flow of the MapReduce [4].1. Split the file: First the data in the HDFS are split up and read in InputFromat specified. InputFormat can be specified by the user and any InputFormat chosen wouldread the files in the directory, select the files to be split into InputSplits and give it to6

Figure 2.2: MapReduce data flow.7

RecordReader to read the records in (key, value) pair that would be processed in further steps. Standard InputFormats provided by the MapReduce are TextInputFormat,SequenceInputFormat, MultipleInputFormat [3].The InputSplit is the unit work that comprises a single map task in a MapReduceprogram. The job submitted by the client is divided into the number of tasks, whichis equal to the number of InputSplits. The default InputSplit size is 64MB / 128MBand can be configured by modifying split size parameter. The input splits enable theparallel processing of MapReduce by scheduling the map tasks on other nodes in clusterat same time. When the HDFS splits the file into blocks, the task assigned to thatnode accesses the data locally.2. Read the records in InputSplit: The InputSplit although is ready to be processed itstill does not make sense to the MapReduce program as the input to it is not in keyvalue format. The RecordReader actually loads the data and converts it to key, valuepair expected by the Mapper task. The calls to RecordReader calls map() method ofMapper.3. Process the records: When the Mapper gets the key-value pair from the RecordReader,it calls the map() function to process the input key-value pair and output an intermediate key-value pair. While these mappers are reading their share of data and processingit in parallel fashion across the cluster, they do not communicate with each other asthey have no data to share. Along with the key-value pair, the Mapper also gets coupleof objects, which indicates where to forward the output and report the status of task[2].4. Partition and Shuffle: The mappers output the key,value pair which is the input for thereducer. This stage the mappers begin exchanging the intermediate outputs and theprocess is called shuffling. The reducer reduces the intermediate value with the samekey and it partitions all the intermediate output with the same key. The partitioner8

Figure 2.3: YARNdetermines which partition a given ¡key,value¿ pair go to. The intermediate data aresorted before they are presented to the Reducer.5. Reduce the mapper’s output: For every key in the assigned partition in the reducer areduce() function is called. Because the reducer reduces the partition with the samekey, it iterates over the partition to generate the output. The OutputFormat willspecify the format of the output records, and the reporter object reports the status.The RecordWriter writes the data to file specified by the OutputFormat [2].2.2.2YARNYARN is a tool which is responsible for managing the job and data allocation.Thefundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM) [6].The ResourceManager and per-node slave, the NodeManager (NM), form the new,9

and generic, system for managing applications in a distributed manner. Figure 2.3gives an overview of the YARN architecture.The ResourceManager is the ultimate authority that arbitrates resources among allthe applications in the system. The per-application ApplicationMaster is, in effect, aframework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the componenttasks.The ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities,queues etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failedtasks either due to application failure or hardware failures. The Scheduler performsits scheduling function based on the resource requirements of the applications; it doesso based on the abstract notion of a Resource Container which incorporates resourceelements such as memory, cpu, disk, network etc.The NodeManager is the per-machine slave, which is responsible for launching theapplications containers, monitoring their resource usage (cpu, memory, disk, network)and reporting the same to the ResourceManager.The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring forprogress. From the system perspective, the ApplicationMaster itself runs as a normalcontainer.2.3HDFSHadoop Distributed File System is the filesystem designed for Hadoop to store thelarge sets of data reliably and stream those data to the user application at the high10

Figure 2.4: HDFS Architecture [3]throughput rather than providing low latency access. Hadoop is designed in Java andthat makes it incredibly portable across platform and operating systems. Like theother distributed file systems like Lustre and PVFS, HDFS too stores the meta dataand the data separately. The NameNode stores the meta-data and the DataNodes storethe application data. But, unlike Lustre and PVFS, the HDFS stores the replicas ofthe data to provide high throughput data access from multiple sources and also dataredundancy increases the fault tolerance of HDFS [5] [3].When the HDFS replicates it does not replicate the entire file, it divides the files intofixed sized blocks and the blocks are placed and replicated in the DataNodes. Thedefault block size in Hadoop is 64MB and is configurable.2.3.1NameNode, DataNode and ClientsThe Figure 2.4 shows the HDFS architecture in Hadoop which contains three importantentities- NameNode, DataNode and Client. The NameNode is responsible for storing11

the meta-data, and track the memory available and used in all the DataNodes. Theclient which wants to read the data in the HDFS first contacts the NameNode. TheNamenode then looks for the block’s DataNode which is nearest to the client and tellsthe client to access the data from it. Similarly, when the client wants to write a file tothe HDFS, it requests the NameNode to nominate 3 DataNodes to store the replicasand the client writes to it in streamline fashion. The HDFS would work efficientlyif it stored the files of larger size, at least size of a block because the HDFS storesthe Namespace RAM. If it were all smaller files in HDFS then the inodes informationwould occupy the entire RAM leaving no room for other operations.The NameNode would register all the DataNodes at the start-up based on the NamspaceID.The NameSpaceID would be generated when the NameNode formats the HDFS. TheDataNodes are not allowed to store any blocks of data if the NamespaceID does notmatch with the ID of the NameNode. Apart from the registering the DataNodes inthe start-up the DataNodes send the block reports to the NameNode periodically. Theblock report contains the block id, the generation report and the length of the eachblock that DataNode holds. Every tenth report sent from the DataNode is a blockreport to keep the NameNode updated about all the blocks. A DataNode also sendsthe HeartBeat messages that just notify the NameNode that it is still healthy and allthe blocks in it are intact. When the NameNode does not receive a heartbeat messagefrom the DataNode for about 10 seconds, it assumes that the DataNode is dead anduses it’s policies to replicate the data blocks in the dead node to other nodes that arealive.Similar to most conventional file systems, HDFS supports operations to read, writeand delete files, and operations to create and delete directories. The user referencesfiles and directories by paths in the namespace. The user application generally doesnot need to know that file system metadata and storage are on different servers, orthat blocks have multiple replicas.12

Figure 2.5: HDFS Federated Namenode2.3.2Federated NamenodeIn hadoop 2.x a new concept know as federated namenode has been introduced. Inorder to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated, that is, the Namenodes areindependent and dont require coordination with each other. The datanodes are usedas common storage for blocks by all the Namenodes. Each datanode registers with allthe Namenodes in the cluster. Datanodes send periodic heartbeats and block reportsand handles commands from the Namenodes. The Key benefits of a federated namenode are namespace scalability, performance and isolation. Figure 2.5 gives an overviewabout the architecture13

6. Namespace Scalability: HDFS cluster storage scales horizontally but the namespacedoes not. Large deployments or deployments using lot of small files benefit from scalingthe namespace by adding more Namenodes to the cluster7. Performance: File system operation throughput is limited by a single Namenode inthe prior architecture. Adding more Namenodes to the cluster scales the file systemread/write operations throughput.8. Isolation: A single Namenode offers no isolation in multi user environment. An experimental application can overload the Namenode and slow down production criticalapplications. With multiple Namenodes, different categories of applications and userscan be isolated to different namespaces.2.3.3Backup Node and Secondary NameNodeThe NameNode is the single point of failure for the Hadoop cluster, so the HDFS copiesthe of the Namespace in NameNode periodically to a persistent storage for reliabilityand this process is called checkpointing. Along with the NameSpace it also maintains alog of the actions that change the Namespace, this log is called journal. The checkpointnode copies the NameSpace and journal from NameNode to applies the transactionsin journal on the Namespace to create most up to date information of the namespacein NameNode. The backup node however copies the Namespace and accepts journalstream of Namespace and applies transactions on the namespace stored in its storagedirectory. It also stores the upto-date information of the Namespace in memory andsynchronizes itself with the NameSpace. When the NameNode fails, the HDFS picksup the Namespace from either BackupNode or CheckPointNode [5] [3].14

2.3.4Replica and Block ManagementHDFS makes replicas of a block with a strategy to enhance both the performance andreliability. By default the replica count is 3, and it places the first block in the nodeof the writer, the second is placed in the same rack but different node and the thirdreplica is placed in different rack. In the end, no DataNode contains more than onereplica of a block and no rack contains more than two replicas of same block. Thenodes chosen on the basis of proximity to the writer, to place the blocks.There are situations when the blocks might be over-replicated or under-replicated. Incase of over-replication the NameNode deletes the replicas within the same rack firstand from the DataNode, which has least available space. In case of under-replication,the NameNode maintains a priority queue for the blocks to replicate and the priorityis high for the least replicated blocks.There are tools in HDFS to maintain the balance and integrity of the data. Balanceris a tool that balances the data placement based on the node disk utilization in thecluster. The Block Scanner is a tool used to check integrity using checksums. Distcpis a tool that is used for inter/intra cluster copying. The intention of this project is tomodify the Balancer to take into consideration the computing capacity of the nodesas opposed to the space.15

Chapter 3DesignThe project focuses on distributing the data based on computing capacity. In thecase of a homogeneous environment, the computing capacity and disk capacity areindifferent. As a result, the data gets distributed based on the availability of spaceon the cluster. In a homogeneous environment, the transfer of data from one nodeto another in order to fill spaces will cause severe network congestion. Hadoop has abalancing functionality which balances the data before running the applications in theevent of severe data accumulation on few nodes. The replication factor also plays amajor role in the data movement and the balancer takes care of this aspect. We namethe balancer as CRBalancer, Computation Ratio Balancer.On the other hand, in a heterogeneous environment, it makes perfect sense to migratethe data from one node to another, as a faster node accounts for the overhead of datamigration and task processing. As a result, the data placement policy suggested inthe project tries to explore the ramifications of migrating large portion of data to afaster node in a heterogeneous environment. The data placement algorithm and theimplementation details give a greater overview of the implementation that has beencarried out to achieve the goal.3.1Data Placement: ProfilingProfiling in this scenario is done to compute the computation ratios of nodes withinthe cluster. In order to accomplish this task, we need run a set of benchmark applications namely Grep and WordCount. The computation ratio is calculated based on16

Figure 3.1: Motivationthe processing capacity of the nodes. Lets assume we have 3 nodes A, B and C with Abeing the fastest, then B and slowest being C. Lets assume it takes 10 seconds to runan application on A, 20 seconds on B and 30 seconds on C. Now, the computation ratioof A would be 6, 3 for B and 2 for C based on the least common multiple principle.17

3.2Implementation DetailsThe CRBalancer is responsible for migrating the data from one node to another. TheFigure 3.2 describes the pattern and the methodology in which the CRBalancer migrates the data. It first takes into consideration the network topology, then calculatesthe under utilized and over utilized datanodes. Mapping between the under utilizedand over utilized datanodes takes place by moving the blocks concurrently among thenodes. It also takes into consideration the replication factor by limiting the migrationif the data is already present on the node for the corresponding file. After balancingtakes place the benchmarks are run and tested for the results.The CRBalancer uses CRNamenodeConnector application to connect to the namenodein order to get the information about the datanodes on the fly in order to decide theamount of data to be transferred to the under-utilized nodes from the over-utilizednodes. It also uses CRBalancingPolicy application to keep track of the amount of datawithin each node.The CRBalancer can be run in the background while other applications are running onthe nodes. The CRBalancer takes care of the network overhead while transferring theblocks. It steals some bandwidth without interrupting the processing of the applications. The expectation is that a similar application performs better when it executedagain at some point of time.18

Figure 3.2: Computation Ratio Balancer19

Chapter 4Experiment SetUp4.1HardwareThe hardware setup consists of three nodes with a their respective configurations described in the Table 4.1 below.NodeABCDProcessorHP XeonHP XeonHP CeleronHP Celeron4411Speedcore * 2.4 GHzcore * 2.4 GHzcore * 2.2 GHzcore * 2.2 GHzCache8MB8MB512KB512KBStorage142.9 GB142.9 GB142.9 GB142.9 GBTable 4.1: Node Information in Cluster4.2SoftwareThe software used in this experiment comprises of Hadoop application 2.3.0. A node isassigned the ma

that no data needs to be moved from one node to another. In a heterogeneous cluster, however, a high-performance node can complete local data processing faster than a low-performance node. After the fast node nishes processing the data residing in its local disk, the node has to handle the unprocessed data in a slow remote node. The overhead

Related Documents:

Table of Contents Sequence strong List /strong . Unit 0 1 Introduction 2 How to take the placement tests 3 Placement Test I 4 Placement Test II 5 Placement Test III 6 Placement Test IV 7 Placement Test V 8 Placement Test VI 9 Placement Test VII 10 Placement Test VIII 11 Placement Test IX 12 Placement Test X

Table of Contents Sequence strong List /strong . 50-090816 Unit 0 1 Introduction 2 How to take the placement tests 3 Placement Test I 4 Placement Test II 5 Placement Test III 6 Placement Test IV 7 Placement Test V 8 Placement Test VI 9 Placement Test VII 10 Placement Test VIII 11 Placement Test IX

LOS ANGELES PIERCE COLLEGE APMS MATH PLACEMENT CRITERIA ( 114 - 315 ) Placement Rule Placement Courses Placement Placement Level Unblocked Name Message MDTP ALGEBRA READINESS TEST 26 ARS 5 MATH 115 Math 115 You may enroll in Math 115, ASAP or Math 228A.

Placement Memorandum/PM means this placement memorandum issued by the Issuer in respect of the Debentures proposed to be issued. Private Placement Offer cum Application Letter(s)/PPOA/Debt Disclosure Document means the private placement offer cum application letter(s) prepared in compliance with Section 42 of the Companies Act, 2013 read with the

Job Placement and Job Placement Fee Guidelines V2.4 TRIM ID: D16/1329258 ARC ID: D16/7772420 Effective Date: 01 October 2016 Job Placement and Job Placement Fee Guidelines V 2.4 Disclaimer This document is not a stand-alone document and does not contain the entirety of Disability Employment Services Programme Providers' obligations.

Compex saves time while retaining the highest workout quality. BIO After a decade-long triathlon career, Guy Hemmerlin put on the coaching hat in 1996, taking the reins of the . ELECTRODE PLACEMENT (WIRED) ELECTRODE PLACEMENT (WIRELESS) OR ELECTRODE PLACEMENT (WIRED) ELECTRODE PLACEMENT (WIRELESS) ELECTRODE PLACEMENT (WIRED) ELECTRODE PLACEMENT

Unit-V Generic competitive strategy:- Generic vs. competitive strategy, the five generic competitive strategy, competitive marketing strategy option, offensive vs. defensive strategy, Corporate strategy:- Concept of corporate strategy , offensive strategy, defensive strategy, scope and significance of corporate strategy

18] provide non-linear programs for tenant placement as a first step. Data placement has also been studied for parallel databases [16, 14, 19]. However, in all existing solutions, data placement is done statically, in the sense that diurnal changes in tenant load are not leveraged. In this paper, we introduce the Robust Tenant Placement