Challenges For MapReduce In Big Data

1y ago
14 Views
3 Downloads
577.43 KB
8 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Camden Erdman
Transcription

Challenges for MapReduce in Big DataKatarina Grolinger1, Michael Hayes1, Wilson A. Higashino1,2, Alexandra L'Heureux1David S. Allison1,3,4,5, Miriam A.M. Capretz11Department of Electrical and Computer EngineeringWestern University, London, ON, Canada N6A 5B9{kgroling, mhayes34, alheure2, whigashi, dallison,mcapretz}@uwo.ca2Instituto de ComputaçãoUniversidade Estadual de Campinas, Campinas, BrazilAbstract— In the Big Data community, MapReduce has beenseen as one of the key enabling approaches for meetingcontinuously increasing demands on computing resourcesimposed by massive data sets. The reason for this is the highscalability of the MapReduce paradigm which allows formassively parallel and distributed execution over a largenumber of computing nodes. This paper identifies MapReduceissues and challenges in handling Big Data with the objective ofproviding an overview of the field, facilitating better planningand management of Big Data projects, and identifyingopportunities for future research in this field. The identifiedchallenges are grouped into four main categories correspondingto Big Data tasks types: data storage (relational databases andNoSQL stores), Big Data analytics (machine learning andinteractive analytics), online processing, and security andprivacy. Moreover, current efforts aimed at improving andextending MapReduce to address identified challenges arepresented. Consequently, by identifying issues and challengesMapReduce faces when handling Big Data, this studyencourages future Big Data research.Keywords- Big Data, Big Data Analytics, MapReduce,NoSQL, Machine Learning, Interactive Analytics, OnlineProcessing, Privacy, SecurityI.INTRODUCTIONRecent developments in the Web, social media, sensorsand mobile devices have resulted in the explosion of data setsizes. For example, Facebook today has more than onebillion users, with over 618 million active users generatingmore than 500 terabytes of new data each day [1].Traditional data processing and storage approaches weredesigned in an era when available hardware, storage andprocessing requirements were very different than they aretoday. Thus, those approaches are facing many challenges inaddressing Big Data demands.The term “Big Data” refers to large and complex datasets made up of a variety of structured and unstructured datawhich are too big, too fast, or too hard to be managed bytraditional techniques. Big Data is characterized by the 4Vs[2]: volume, velocity, variety, and veracity. Volume refers tothe quantity of data, variety refers to the diversity of datatypes, velocity refers both to how fast data are generated andhow fast they must be processed, and veracity is the ability to3CNRS, LAAS, 7 avenue du colonel Roche, F-31400Toulouse, France4Univ de Toulouse, LAAS, F-31400 Toulouse, France5Univ de Toulouse, UT1-Capitole, LAAS, F-31000Toulouse, Francedallison@laas.frtrust the data to be accurate and reliable when making crucialdecisions.Enterprises are aware that Big Data has the potential toimpact core business processes, provide competitiveadvantage, and increase revenues [2]. Thus, organizations areexploring ways to make better use of Big Data by analyzingthem to find meaningful insights which would lead to betterbusiness decisions and add value to their business.MapReduce is a highly scalable programming paradigmcapable of processing massive volumes of data by means ofparallel execution on a large number of commoditycomputing nodes. It was recently popularized by Google [3],but today the MapReduce paradigm has been implemented inmany open source projects, the most prominent being theApache Hadoop [4]. The popularity of MapReduce can beaccredited to its high scalability, fault-tolerance, simplicityand independence from the programming language or thedata storage system.In the Big Data community, MapReduce has been seen asone of the key enabling approaches for meeting thecontinuously increasing demands on computing resourcesimposed by massive data sets. At the same time, MapReducefaces a number of obstacles when dealing with Big Dataincluding the lack of a high-level language such as SQL,challenges in implementing iterative algorithms, support foriterative ad-hoc data exploration, and stream processing.This paper aims to identify issues and challenges facedby MapReduce when confronted by Big Data with theobjectives of: a) providing an overview and categorizationof the MapReduce issues and challenges, b) facilitatingbetter planning and management of Big Data projects and c)identifying opportunities for future research in this field.Other MapReduce-related surveys have been previouslypublished, but this work has a different focus. Li et al. [5]presented a review of approaches focused on the support ofdistributed data management and processing usingMapReduce. They discussed implementations of databaseoperators in MapReduce and DBMS implementations usingMapReduce, while this paper is concerned with identifyingMapReduce challenges in Big Data.Doulkeridis and Nørvåg [6] surveyed the state of the artin improving the performance of MapReduce processingand reviewed generic MapReduce weaknesses andK. Grolinger, M. Hayes, W. Higashino, A. L'Heureux, D. S. Allison, M. A. M. Capretz, Challenges for MapReduce in BigData, Proc. of the IEEE 10th 2014 World Congress on Services (SERVICES 2014), Alaska, USA, June 27-July 2, 1014Copyright: f

TABLE I.AN OVERVIEW OF MAPREDUCE CHALLENGESMain challengesMain solution approachesSchema-free, index- In-database MapReducefreeNoSQL stores – MapReduce with variousindexing approachesLack of standardized Apache Hive – SQL on top of HadoopSQL-like language NoSQL stores: proprietary SQL-likelanguages (Cassandra, MongoDB) or Hive(HBase)Analytics Scaling complexUse computationally less expensive, thoughlinear algebraless accurate, algebraInteractive analysis Map interactive query processingtechniques for handling small data, toMapReduceIterative algorithms Extensions of MapReduce implementationsuch as Twister and HaLoopStatistical challenges Data pre-processing using MapReducefor learningOnlinePerformance /Direct communication between phases andprocessing Latency issuesjobsProgramming model Alternative models, such as MapUpdateand Twitter’s StormPrivacyAuditingTrusted third party monitoring, securityandanalyticssecurity Access controlOptimized access control approach withsemantic understandingPrivacyPrivacy policy enforcement with security toprevent information leakageDataStoragechallenges. Sakr et al. [7] also surveyed approaches to dataprocessing based on the MapReduce paradigm. Additionally,they analyzed systems which provide declarativeprogramming interfaces on top of MapReduce. While theworks of Doulkeridis and Nørvåg [6], and Sakr et al. [7]focused on systems built on top of MapReduce, this paperaims to identify challenges that MapReduce faces handlingBig Data. Moreover, this paper discusses security andprivacy issues, while those others do not.The identified MapReduce challenges are grouped intofour main categories corresponding to Big Data tasks types:data storage, analytics, online processing, security andprivacy. An overview of the identified challenges ispresented in Table I while details of each category arediscussed in sections III to VI. Additionally, this paperpresents current efforts aimed at improving and extendingMapReduce to address the identified challenges.The rest of this paper is organized as follows: Section IIintroduces the MapReduce paradigm. Section III identifiesstorage-related challenges while Section IV discusses BigData analytics issues. Online processing is addressed inSection V and privacy and security challenges in Section VI.Finally, Section VII concludes the paper.II.MAPREDUCE OVERVIEWMapReduce is a programming paradigm for processinglarge data sets in distributed environments [3]. In theMapReduce paradigm, the Map function performs filteringand sorting, while the Reduce function carries out groupingand aggregation operations. The ‘hello world’ of MapReduceis the word counting example: it counts the appearance ofeach word in a set of documents. The Map function splits thedocument into words and for each word in a document itproduces a (key, value) pair.function map(name, document)for each word in documentemit (word, 1)The Reduce function is responsible for aggregatinginformation received from Map functions. For each key,word, the Reduce function works on the list of values,partialCounts. To calculate the occurrence of each word,the Reduce function groups by word and sums the valuesreceived in the partialCounts list.function reduce (word, List partialCounts)sum 0for each pc in partialCountssum pcemit (word, sum)The final output is the list of words with the count ofappearance of each word.Figure 1 illustrates the MapReduce flow. One node iselected to be the master responsible for assigning the work,while the rest are workers. The input data is divided intosplits and the master assigns splits to Map workers. Eachworker processes the corresponding input split, generateskey/value pairs and writes them to intermediate files (on diskor in memory). The master notifies the Reduce workersabout the location of the intermediate files and the Reduceworkers read data, process it according to the Reducefunction, and finally, write data to output files.Figure 1. MapReduce flowThe main contribution of the MapReduce paradigm isscalability as it allows for highly parallelized and distributedexecution over a large number of nodes. In the MapReduceparadigm, the Map or Reduce task is divided into a highnumber of jobs which are assigned to nodes in the network.Reliability is achieved by reassigning any failed node’s jobto another node. A well known open source MapReduceimplementation is Hadoop which implements MapReduce ontop of the Hadoop Distributed File System (HDFS).III.DATA STORAGERelational database management systems (RDBMSs) aretraditional storage systems designed for structured data andaccessed by means of SQL. RDBMSs are facing challengesin handling Big Data and providing horizontal scalability,availability and performance required by Big Dataapplications. In contrast to relational databases, MapReduceprovides computational scalability, but it relies on datastorage in a distributed file system such as Google FileSystem (GFS) or Hadoop Distributed File System (HDFS).

NoSQL and NewSQL data stores have emerged asalternatives to Big Data storage. NoSQL refers to “Not OnlySQL”, highlighting that SQL is not a crucial objective ofthose systems. Their main defining characteristics includeschema flexibility and effective scaling over a large numberof commodity machines. NoSQL horizontal scalabilityincludes data storage scaling as well as scaling of read/writeoperations. Grolinger et al. [8] analyze features driving theNoSQL systems ability to scale such as partitioning,replication, consistency, and concurrency control. NoSQLsystems typically adopt the MapReduce paradigm and pushprocessing to the nodes where data is located to efficientlyscale read operations. Consequently, data analysis isperformed via MapReduce jobs.MapReduce itself is schema-free and index-free; thisprovides great flexibility and enables MapReduce to workwith semi-structured and unstructured data. Moreover,MapReduce can run as soon as data is loaded. However, thelack of indexes on standard MapReduce may result in poorperformance in comparison to relational databases. This maybe outweighed by MapReduce scalability and parallelization.Database vendors, such as Oracle, provide in-databaseMapReduce [9], taking advantage of database parallelization.Another example of providing analytics capabilities indatabase is the MAD Skills project [10] which implementsMapReduce within the database using an SQL runtimeexecution engine. Map and Reduce functions are written inPython, Perl, or R, and passed to the database for execution.NoSQL systems from column-family and documentcategories adopt the MapReduce paradigm while providingsupport for various indexing methods. In this approachMapReduce jobs can access data using the index, thereforequery performance is significantly improved. For exampleCassandra supports primary and secondary indexes [11]. InCouchDB [12] the primary way of querying and reporting isthrough views which use the MapReduce paradigm withJavaScript as a query language. A view consists of a Mapfunction and an optional Reduce function. Data emitted byMap function is used to construct an index and consequently,queries against that view run quickly.Another challenge related to MapReduce and datastorage is the lack of a standardized SQL-like language.Therefore one direction of research is concerned withproviding SQL on top of MapReduce. An example of thiscategory is Apache Hive [13] which provides an SQL-likelanguage on top of Hadoop. Another Apache effort, Mahout[14], aims to build scalable machine learning libraries on topof MapReduce. Although those efforts provide powerful dataprocessing capabilities, they lack data management featuressuch as advanced indexing and a sophisticated optimizer.NoSQL solutions choose different approaches forproviding querying abilities [8]: Cassandra and MongoDBprovide proprietary SQL-like querying while HBase usesHive.It is important to point out the efforts on integrationbetween traditional databases, MapReduce, and Hadoop. Forexample, the Oracle SQL connector for HDFS [15] providesability to query data in Hadoop within the database usingSQL. The Oracle Data Integrator for Hadoop generates Hive-like queries which are transformed into native MapReduceand executed on Hadoop clusters.Even though the presented efforts advanced the state ofthe art for Data Storage and MapReduce, a number ofchallenges remain, such as: the lack of a standardized SQL-like query language, limited optimization of MapReduce jobs, integration among MapReduce, distributed file system,RDBMSs and NoSQL stores.IV.BIG DATA ANALYTICSA. Machine LearningThe prevalence and pervasiveness of Big Data offers thepromise of building more intelligent decision makingsystems. This is because the typical premise for manydecision making algorithms is that more data can better teachthe algorithms to produce more accurate outputs. The key toextracting useful information from Big Data lies within theuse of Machine Learning (ML) approaches. However, theuse of massive datasets themselves for the purpose ofanalysis and training poses some problems and challenges tothe very execution of ML algorithms. The arithmetic andcomputational complexity brought on by the volumecomponent of Big Data renders traditional ML algorithmsalmost unusable in conventional development environments.This is due to the fact that ML algorithms were designed tobe used on much smaller dataset with the assumption that theentire data could be held in memory [16]. With the arrival ofBig Data, this assumption is no longer valid andconsequently greatly impedes the performance of thosealgorithms. In order to remediate to this problem, distributedprocessing algorithms such as MapReduce were broughtforward.Although some ML algorithms are inherently paralleland can therefore be adapted to the MapReduce paradigm[17], for others the transition is much more complex. Thefoundation of many ML algorithms relies on strategiesdirectly dependent on in-memory data and therefore oncethat assumption is severed, entire families of algorithms arerendered inadequate. The parallel and distributive nature ofthe MapReduce paradigm is a source of such a disconnect.This is what Parker [17] describes as the curse ofmodularity. The following families of algorithms areamongst those affected [18]: Iterative Graph algorithms: Multiple iterations arerequired in order to reach convergence, each of whichcorresponds to a job in MapReduce[18] and jobs areexpensive in terms of startup time. Furthermore, skewsin the data create stragglers in the Reduce phase, whichcauses backup execution to be launched, increasing thecomputational load [3]. Gradient Descent algorithms: The sequential nature ofthese algorithms requires a very large amount of jobs tobe chained. It also requires that parameters be updatedafter each iteration, which will add communicationoverhead to the process. Both of these steps aretherefore expensive in terms of time.

Expectation Maximization algorithms: Similarly thisfamily of algorithm also depends on iterations that areimplemented as jobs, causing the same performancelatencies as above.In order to address the shortcomings of MapReduce,alternatives have been developed to function eitherindependently or in addition to existing MapReduceimplementations [18]: Pregel [19] and Giraph [20], are alternative modelsbased on the Bulk Synchronous parallel paradigm. Theyenable all states to be retained in memory, facilitatingthe iterative process. Spark [21] is another alternative based on resilientdistributed datasets abstractions, which uses memory toupdate shared states and facilitate implementations suchas gradient descent. HaLoop [22] and Twister [23] are both extensionsdesigned for Hadoop [4] in order for this MapReduceimplementation to better support iterative algorithms.Each of these tools possesses its strengths and area offocus but the difficult integration and potentialincompatibilities between the tools and frameworks revealnew research opportunities that would fulfill the need for auniform ML solution.When considering the volume component of Big Data,additional statistical and computational challenges arerevealed. Regardless of the paradigm used to develop thealgorithms, an important determinant of the success ofsupervised ML approaches is the pre-processing of the data.This step is often critical in order to obtaining reliable andmeaningful results. Data cleaning, normalization, featureextraction and selection [24] are all essential in order toobtain an appropriate training set. This poses a massivechallenge in the light of Big Data as the preprocessing ofmassive amounts of tuples is often not possible.The variety component of Big Data, also introducesheterogeneity and high dimensionality, which in turnintroduces the following challenges [25]: Noise accumulation may be so great that it may overpower the significant data. Spurious or false correlation may present betweendifferent data points although no real relationship exist. Incidental endogeneity, meaning that regressors arerelated to the regression error, which could lead toinconsistencies and false discoveries [26].In particular, the concept of noise has provided aparadigm shift in the underlying algebra used for MLalgorithms. Dalessandro [27] illustrates the usefulness ofaccepting noise as a given, and then using more efficient, butless accurate, learning models. Dalessandro shows that usingcomputationally less expensive algorithms, which are alsoless accurate during intermediate steps, will define a modelwhich performs equally well in predicting new outputs whentrained on Big Data. These algorithms may take moreiterations than their computationally more expensivecounterparts; however, the iterations are much faster. Due tothis, the less expensive algorithms tend to converge muchfaster, while giving the same accuracy. An example of suchan algorithm is stochastic gradient descent [27].In addition to the challenges mentioned above,having a variety of dissimilar data sources, each storingdissimilar data types, can also affect the performance of theML algorithms. Data preprocessing could alleviate some ofthose challenges and is particularly important in theMapReduce paradigm where outliers can greatly influencethe performance of algorithms [28]. In order to remediate tothese problems, solutions have been developed toimplement data preprocessing algorithms using MapReduce[29]. However, it is still necessary to find ways to integratethe analysis and preprocessing phase, which create newresearch prospects.The velocity component of Big Data introduces the ideaof concept drift within the learning model. In MapReduce,this idea is aggravated by the necessity to pre-process data,which introduces additional delays. The fast arrival of dataalong with potentially long computing time may cause aconcept drift, which Yang and Fong define as “knownproblem in data analytics, in which the statistical propertiesof the attributes and their target classes shift over time,making the trained model less accurate”[30]. Thus accurateconcept drift detection constitutes an important researcharea to insure accuracy of ML approaches with Big Data.An important subset of ML algorithms is predictivemodeling. That is, given a set of known inputs and outputs,can we predict an unknown output with some probability?Being able to construct an accurate prediction model ishugely important in many disparate domains such as creditcard fraud detection, user recommendation systems,malicious URL identification, and many others. Forexample, to predict movies that clients will enjoy, companiessuch as Yahoo and Netflix collect a large variety ofinformation on their clients to build accurate recommendersystems.From the authors observation, parallelism techniques forpredictive modeling fall into three categories ofimplementation:1. Run the predictive algorithm on subsets of the data, andreturn all the results.2. Generate intermediate results from subsets of the data,and resolve the intermediate results into a final result.3. Parallelize the underlying linear algebra.The two most promising forms of implementation for BigData are categories 2 and 3. Category 2 is essentially thedefinition of a MapReduce job; where the algorithm attemptsto generate intermediate results using Map operations, andcombines these outputs using Reduce operations. Category 3can also be seen as a MapReduce job, if the underlying linearalgebra separable into Map and Reduce operations. Finally,Category 1 is essentially not a valid solution for Big Data asthe results are only indicative of small subsets of the data andnot the prediction over the entire dataset.MapReduce with predictive modeling has a majorconstraint which limits its usefulness when predicting highlycorrelated data. MapReduce works well in contexts whereobservations can be processed individually. In this case the

data can be split up, calculated, and then aggregated together.However, if there are correlated observations that need to beprocessed together, MapReduce offers little benefit over nondistributed architectures. This is because it will be quitecommon that the observations that are correlated are foundwithin disparate clusters, leading to large performanceoverheads for data communication between clusters. Usecases such as this are commonly found in predicting stockmarket fluctuations. To allow MapReduce to be used in thesetypes of predictive modeling problems, there are a fewpotential solutions based on solutions from predictivemodeling on traditional data sizes: data reduction, dataaggregation, and sampling [31].B. Interactive AnalyticsInteractive analytics can be defined as a set of approachesto allow data scientists to explore data in an interactive way,supporting exploration at the rate of human thought [32].Interactive analytics on Big Data provides some excitingresearch areas and unique problems. Most notably, andsimilar to other data analytic approaches, is the question howcan we build scalable systems that query and visualize dataat interactive rates? The important difference to other dataanalytic paradigms is the notion of interactive rates. Bydefinition, interactive analysis requires the user tocontinually tweak or modify their approach to generateinteresting analytics [33].MapReduce for interactive analytics poses a drastic shiftfrom the classic MapReduce use case of processing batchcomputations. Interactive analytics involves performingseveral small, short, and interactive jobs. As interactiveanalytics begins to move from RDBMSs to Big Data storagesystems some prior assumptions regarding MapReduce arebroken, such as uniform data access and prevalence of largebatch jobs. This type of analysis requires a new class ofMapReduce workloads to deal with the interactive, almostreal-time data models. Chen et al. [34] discuss theseconsiderations in their survey of industry solutions where theauthors find that extending MapReduce with queryingframeworks such as Pig and Hive are prevalent. Chen et al.note that interactive analysis for Big Data can be seen as anextension of the already well-researched area of interactivequery processing. Making this assumption, there existpotential solutions to optimize interactive analytics withMapReduce by mirroring the already existing work ininteractive query processing. One open area of future work isfinding the best method to bring these solutions to theMapReduce programming paradigm.MapReduce is one parallelism model for interactiveanalytics. Another approach tuned for interactivity isGoogle's Dremel system [35], which acts in complement toMapReduce. Dremel builds on a novel column-familystorage format, as well as algorithms that constructs thecolumns and reassemble the original data. Some highlightsof the Dremel system are: Real-time interactivity for scan-based queries. Near linear scalability in the number of clusters. Early termination, similar to progressive analytics, toprovide speed tradeoffs for accuracy.Other interactive analytics research have been based onthe column-family NoSQL data storage approach [36, 37].The main benefit of column-based approaches versus rowbased, traditional, approaches is that only a fraction of thedata needs to be accessed when processing typical queries[8]. However, most of these approaches are specialized forcertain types of datasets and certain queries and thus providean open research area for a generalized solution.C. Data VisualizationA large category of interactive analytics is datavisualization. There are two primary problems associatedwith Big Data visualization. First, many instances of BigData involve datasets with large amount of features, widedatasets, and building a highly multi-dimensionalvisualization is a difficult task. Second, as data grows largervertically, tall datasets, uninformative visualizations aregenerally produced. For these tall datasets, the resolution ofthe data must be limited, i.e. through a process to aggregateoutputs to ensure that highly dense data can still bedeciphered [32]. For highly wide datasets, a preprocessingstep to reduce the dimensionality is needed. Unfortunatelythis tends to be useful on tens to hundreds of dimensions, foreven higher dimensions a mixed-initiative method, includinghuman intervention, to determine subsets of relateddimensions is required [32]. This approach generally requireshuman input to determine an initial subset of "interesting"features, which is also a difficult task and open research area.MapReduce for data visualization currently performswell in two cases: memory-insensitive visualizationalgorithms, and inherently parallel visualization algorithms.Vo et al. [38] have provided a study on moving existingvisualization algorithms to the MapReduce paradigm. Onemajor contribution is empirically proving that MapReduceprovides a good solution to large-scale exploratoryvisualization. The authors present that this is becausescalability is achieved through data reduction tasks whichcan be highly parallel; these types of tasks are common indata visualization algorithms. Further, visualizationalgorithms that tend to increase the total amount of data forintermediate steps will perform poorly when mapping to theMapReduce paradigm. Another drawback to MapReducewith visualization is that a typical MapReduce job uses onepass over the data. Therefore, algorithms that requiremultiple iterations, such as mesh simplification, will sufferfrom a large overhead in trying to naively map the algorithmto the MapReduce paradigm. This is similar to the problemscreated for iterative machine learning algorithms discussedin Section IV-A. Therefore, there is the potential for researchaimed at providing optimized multiple iteration solutions forMapReduce.V.ONLINE PROCESSINGThe Velocity dimension, as one of the Vs used to defineBig Data, brings many new challenges to traditional dataprocessing approaches and especially to MapReduce.Handling Big Data velocity often requires applications withonline processing capabilities, which can be broadly definedas real-time or quasi real-time processing of fast and

continuously generated data (also known as data streams).From the business perspective, the goal is normally to obtaininsights from these data streams, and to enable promptreaction to them. This instantaneous reaction can bringbusiness value and competitive advantage to organizations,and therefore has been generating research and commercialinterest. Areas such as financial fraud detection andalgorithmic trading have been highly interested in this typeof solutions.The MapReduce paradigm is not an appropriate solutionfor this kind of low-latency processing because: MapReduce computations are batch processes that startand finish, while computations over streams arecontinuous tasks that only finish upon user request. The inputs of MapReduce computations are snapshots ofdata stored on files, and the content of these files do notchange during processing. Conversely, data streams arecontinuously generated and unbounded inputs [39]. In order to provide fault tolerance, most of MapReduceimplementations, such as Google’s [3] and Hadoop [4],write the results of the Map phase to local files beforesending them to the reducers. In addition, theseimplementations store the output files in distributed andhigh-overhead file systems (Google File System [40] orHDFS [4], respectively). This extensive filemanipulation adds significant latency to the processingpipelines. Not every computation can be

Apache Hadoop [4]. The popularity of MapReduce can be accredited to its high scalability, fault-tolerance, simplicity . in handling Big Data and providing horizontal scalability, availability and performance required by Big Data applications. In contrast to relational databases, MapReduce provides computational scalability, but it relies on .

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

MapReduce Design Patterns. MapReduce Restrictions I Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly de ned components that must t together in very speci c ways. I Synchronization is di cult. Within a single MapReduce job,

A. Hadoop and MDFS Overview The two primary components of Apache Hadoop are MapReduce, a scalable and parallel processing framework, and HDFS, the filesystem used by MapReduce (Figure 1). Within the MapReduce framework, the JobTracker and the TaskTracker are the two most important modules. The Job-Tracker is the MapReduce master daemon that .

Using MapReduce This example shows how to use the datastore and mapreduce functions to process a large amount of file-based data. The MapReduce algorithm is a mainstay of many modern "big data" appli-cations. This example operates on a single computer, but the code can scale up to use Hadoop .

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .