Assisting Developers Of Big Data Analytics Applications When Deploying .

1y ago
7 Views
2 Downloads
560.14 KB
10 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Vicente Bone
Transcription

Assisting Developers of Big Data AnalyticsApplications When Deploying on Hadoop CloudsWeiyi Shang† , Zhen Ming Jiang† , Hadi Hemmati† , Bram Adams‡ , Ahmed E. Hassan† , Patrick Martin§†Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen’s University, Kingston, Canada‡Département de Génie Informatique et Génie Logiciel, Polytechnique Montréal, Montréal, Québec, Canada§Database Systems Laboratory, School of Computing, Queen’s University, Kingston, Canada{swy, zmjiang,hhemmati, martin, ahmed}@cs.queensu.ca, bram.adams@polymtl.caAbstract—Big data analytics is the process of examining largeamounts of data (big data) in an effort to uncover hidden patternsor unknown correlations. Big Data Analytics Applications (BDAApps) are a new type of software applications, which analyze bigdata using massive parallel processing frameworks (e.g., Hadoop).Developers of such applications typically develop them using asmall sample of data in a pseudo-cloud environment. Afterwards,they deploy the applications in a large-scale cloud environmentwith considerably more processing power and larger input data(reminiscent of the mainframe days). Working with BDA Appdevelopers in industry over the past three years, we noticedthat the runtime analysis and debugging of such applications inthe deployment phase cannot be easily addressed by traditionalmonitoring and debugging approaches.In this paper, as a first step in assisting developers of BDAApps for cloud deployments, we propose a lightweight approachfor uncovering differences between pseudo and large-scale clouddeployments. Our approach makes use of the readily-available yetrarely used execution logs from these platforms. Our approachabstracts the execution logs, recovers the execution sequences,and compares the sequences between the pseudo and clouddeployments. Through a case study on three representativeHadoop-based BDA Apps, we show that our approach canrapidly direct the attention of BDA App developers to the majordifferences between the two deployments. Knowledge of suchdifferences is essential in verifying BDA Apps when analyzingbig data in the cloud. Using injected deployment faults, we showthat our approach not only significantly reduces the deploymentverification effort, but also provides very few false positives whenidentifying deployment failures.Index Terms—Big-Data Analytics Application, Cloud Computing, Monitoring and Debugging, Log Analysis, HadoopI. I NTRODUCTIONBig Data Analytics Applications (BDA Apps) are a newcategory of software applications that leverage large-scaledata, which is typically too large to fit in memory or even onone hard drive, to uncover actionable knowledge using largescale parallel-processing infrastructures [1]. The big data cancome from sources such as runtime information about traffic,tweets during the Olympic games, stock market updates, usageinformation of an online game [2], or the data from any otherrapidly growing data-intensive software system. For instance,EBAY 1 has deployed BDA Apps to optimize the search ofproducts by analyzing over 5 PBs data using more than 4,000CPU cores [3].1 www.ebay.comlast checked Feburary 2013.978-1-4673-3076-3/13 c 2013 IEEEOver the past three years we have been working closely withBDA App developers in industry. We noted and found thatdeveloping BDA Apps brings many new challenges comparedto traditional programming and testing practices. Among allchallenges in different phases of BDA App development,the deployment phase introduces unique challenges related toverifying and debugging the BDA executions, as BDA Appdevelopers want to know if their BDA App will functioncorrectly once deployed. Similar observations were recentlynoted in an interview of 16 professional BDA App developersat Microsoft [1].In practice, the deployment of BDA Apps in the cloudfollows these three steps: 1) developers implement and testthe BDA App in a small or pseudo cloud (using virtual orphysical machines) environment using a small data sample,2) developers deploy the application on a larger cloud witha considerably larger data set and processing power to testthe application in a real-life setting, and 3) developers verifythe execution of the application to make sure all data areprocessed and all jobs are successful. The traditional approachfor deployment verification is to simply search for knownerror keywords related to unusual executions. However, suchverification approaches are very ineffective in large clouddeployments. For instance, a common basic approach foridentifying deployment problems is searching for “killed” jobsin the generated execution logs (the output of the internal instrumentation) of the underlying platform hosting the deployedapplication [4]. However, a simple keyword search would leadto false positive results since a platform such as Hadoopmay intervene in the execution of a job, kill it and restartit elsewhere to achieve better performance, or it might startand kill speculative jobs [4]. Considering the large amountof data and logs, such false positives rapidly overwhelm thedeveloper of BDA Apps.In this paper, we propose an approach for verifying theruntime execution of BDA Apps after deployment. The approach abstracts the platform’s execution logs from both thesmall (pseudo) and large scale cloud deployments, groupsthe related abstracted log lines into execution sequences forboth deployments, then examines and reports the differencesbetween the two sets of execution sequences. Ideally, these twosets should be identical for a successful deployment. However,due to framework configurations and data size differences, the402ICSE 2013, San Francisco, CA, USAAccepted for publication by IEEE. c 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

underlying platform may execute the applications differently.Among the delta sets of execution sequences between thesetwo sets, we filter out sequences that are due to well-knownplatform-related (in our case study Hadoop) differences. Theremaining sets of sequences are potential deployment failures/anomalies that should be reported and carefully examined.We have implemented our approach as a prototype tool andperformed a case study on three representative Hadoop [4]BDA Apps. The choice of Hadoop is due to it being one ofthe most used platforms for Big Data Analytics in industrytoday. However, our general idea of using the underlyingplatform’s logs as a means for BDA App monitoring in thecloud, is easily extensible to other platforms, such as MicrosoftDryad [5]. The case study results show that our log abstractionand clustering into execution sequences not only significantlyreduces the amount of logs (by between 86% to 97%) thatshould be verified, but it also provides much higher precisionfor identifying deployment failures/anomalies compared toa traditional keyword search approach (commonly used inpractice today). In addition, practitioners who have used ourapproach in practice have noted that the reporting of theabstracted execution sequences, rather than raw log lines, provides a summarized context that dramatically improves theirefficiency in identifying and investigating failure/anomaly.The rest of this paper is organized as follows. We presenta motivating example in Section II. We present Hadoop,the platform that we studied in Section III. We present ourapproach to summarize logs into execution log sequencesin Section IV. We present the setup for our case studiesin Sections V. We present the results of our case study inSection VI. We discuss other features of our approach inSection VII and discuss the limitations of our approach inSection VIII. We present prior work related to our approachin Section IX. Finally, we conclude the paper in Section X.II. A M OTIVATING E XAMPLEWe now present a hypothetical but realistic motivatingexample to better explain the challenges of deploying BDAApps in a cloud environment.Assume developer Ian developed a BDA App that analyzesthe user information from a large-scale social network. Ianhas thoroughly tested the App on an in-house small-scalecloud environment with a small sample of testing data. Beforeofficially releasing the App, Ian needs to deploy the App ina large-scale cloud environment and run the App with reallife large-scale data. After the test run of the App in the realcloud setup, Ian needs to verify whether the App behaves asexpected or not, in the testing environment.Ian followed a traditional approach to examine the behaviour of the App in the cloud environment. He leveragedthe logs from the underlying platform (e.g., Hadoop) to findwhether there are any problematic log lines. After downloading all the logs from the cloud environment, Ian found thatthe logs are of enormous size because the cloud environmentcontains thousands of nodes and the processed real-life datais in PB scale, which makes the manual inspection of the logsimpossible. Therefore, Ian performed a simple keyword searchon the logs. The keywords are based on his own experienceof developing BDA Apps. However, the keyword search stillreturns thousands of problematic log lines. By manually exploring the problematic log lines, Ian found that a large portionof the log lines do not indicate any problematic executions(i.e., false positives). For example, the run-time scheduler ofthe underlying platform often kills remote processes and restarts them locally to achieve better performance. However,such kill operations lead to seemingly problematic logs that areretrieved by his keyword search. Moreover, for each log line,Ian must trace through the log files across multiple nodes togain some context about the generated log files (and in manyinstances he discovers that such log lines are expected andare not problematic ones). In short, identifying deploymentproblems of the BDA App is excessively difficult and timeconsuming. Moreover, this difficulty increases considerably asthe size of the analyzed data grows and the size of the cloudincreases.From the above example, we observe that verifying thedeployment of BDA Apps in a cloud environment with largescale data is challenging. Although today, developers primarilyuse grep [6] to locate possible troublesome instrumentationlogs, uncovering the related context of the troublesome logsis still challenging with enormously large data (as noted inrecent interviews of BDA App developers [1]).In the following sections, we present our approach, whichsummarizes the large amount of platform logs and presentsthem in tables where developers can easily note troublesomeevents and where they are able to easily view such events in thecontext of their execution (since the table shows summarizedexecution sequences).III. L ARGE - SCALE DATA A NALYSIS P LATFORMS : H ADOOPHadoop is one of the most widely used platforms forthe development of BDA Apps in practice today. We brieflypresent the programming model of Hadoop, then present theHadoop logs that we use in our case studies.A. The MapReduce Programming ModelHadoop is an open-source distributed platform [4] that issupported by Yahoo! and is used by Amazon, AOL and anumber of other companies. To achieve parallel execution,Hadoop implements a programming model named MapReduce. This programming model is implemented by many othercloud platforms as well [5], [7].MapReduce [8] is a distributed divide-and-conquer programming model that consists of two phases: a massivelyparallel “Map” phase, followed by an aggregating “Reduce”phase. The input data of MapReduce is broken down intoa list of key/value pairs. Mappers (processes assigned tothe “Map” phase) accept the incoming pairs, process themin parallel and generate intermediate key/value pairs. Allintermediate pairs having the same key are then passed toa specific Reducer (process assigned to the “Reduce” phase).Each Reducer performs computations to reduce the data to one403

single key/value pair. The output of all Reducers is the finalresult of a MapReduce run.To illustrate MapReduce, we consider an example MapReduce process that counts the frequency of word lengths ina book. Mappers take each single word from the book andgenerate a key/value pair of the form “word length/dummyvalue”. For example, a Mapper generates a key/value pairof “5/hello” from the input word “hello”. Afterwards, thekey/value pairs with the same key are grouped and sent toReducers. Each Reducer receives the list of all key/values pairsfor a particular word length and hence can simply output thesize of this list. If a reducer receives a list with key “5”, forexample, it will count the number of all the words with length“5”. If the size is n, it generates an output pair “5/n” whichmeans there are n words with length “5” in the book.B. Components of HadoopHadoop has three types of execution components. Eachcomponent has logging enabled in it. Such platform loggingtracks the operation of the platform itself (i.e., how theplatform is orchestrating the MapReduce processing). Today,such logging is enabled in all deployed Hadoop clusters and itprovides a glimpse into the inner working mechanism of theplatform itself. Such inner working mechanism is impacted byany problems in the cloud on which the platform is executing.The three execution components and a brief example of thelogs generated by them are as follows: Job. A Hadoop program consists of one or multipleMapReduce steps running as a pipeline. Each MapReducestep is a Job in Hadoop. A JobTracker is a process initialized by the Hadoop platform to track the status of theJobs. The information tracked by the JobTracker includesthe overall information of the Job (e.g., input data size)and the high-level information of the execution of theJob. The high-level information of the Job’s executioncorresponds to the executions of Map and Reduce. Forexample, a Job log may say that “the Job is split into 100Map Tasks” and “Map TaskId id is finished at time t1”. Task. The execution of a Job is divided into multipleTasks based on the MapReduce programming model.Therefore, a Task can be either a Map Task that corresponds to the Map in the MapReduce programmingmodel, or a Reduce Task. The Hadoop platform groupsa set of Map or Reduce executions together to createa Task. Therefore, each Task contains more than oneexecution of Map or Reduce. Similar to the JobTracker,the TaskTracker monitors the execution of a Task. Forexample, a Task log may say “received commit of TaskId id”. Attempt. To support fault tolerance, the Hadoop platformallows each Task to have multiple trials of execution.Each execution is an Attempt. Typically, only when anAttempt of a Task has failed, another Attempt of the sameTask will start. This restart process continues until theTask is successfully completed or the number of failedAttempts is larger than a threshold. However, there areexceptions, such as “speculative execution”, which wediscuss later in this paper. The attempt is also monitoredby the TaskTracker and the detailed execution informationof the Attempt, such as “Reading data for Map task withTaskID id”, is recorded in the Attempt logs.The Job, Task and Attempt logs form the source of information used by our approach. We use the former kinds oflogs instead of application-level logs since such logs provideinformation about the inner working of the platform itself,and not the application, which is assumed to be correctlyimplemented for our purposes. In particular, the platform logsprovide us with information about any deployment problems.IV. A PPROACHThe basic idea behind our approach is to cluster the platformlogs to improve their comprehensibility, and to help understandand flag differences in the run-time behaviour.As mentioned before, our approach is based on the analysisof platform logs of BDA Apps. These logs are generated bythe statements embedded by the platform developers becausethey consider the information to be particularly important.Containing rich knowledge, but not fully explored, platformlogs typically consist of the major system activities and theirassociated contexts (e.g., operation ids). The log is a valuableresource for studying the run-time behaviour of a softwaresystem, since they are generated by the internal instrumentations and are readily available. However, previous researchshows that logs are continuously changing and evolving [9].Therefore, ad hoc approaches based on keyword search maynot always work. Thus we propose an approach that does notrely on particular phrases or format of logs. Figure 1 showsan overview of our approach.Our approach compares the run-time behaviour of theunderlying platform of BDA Apps in testing environment witha small testing data sample to the cloud environment withlarge-scale data. To overcome the enormous amount of logsgenerated by a BDA platform and to provide useful context forthe developers looking at our results, we recover the executionsequences of the logs.A. Execution Sequence RecoveryIn this step, we recover sequences of the execution logs.The log sequence clustering includes three phases.1) Log Abstraction: Log files typically do not follow strictformats, but instead contain significant unstructured data. Forexample, log lines may contain the task type, the executiontime stamp and a free form – making it hard to extract anystructured information from them. In addition to being in freeform, log lines contain static and dynamic information. Thestatic information is specific to each particular event while thedynamic values of the logs describe the event context. We usea technique proposed by Jiang et al. [10] to abstract logs. Thistechnique is designed to be generalizable as it does not relyon any log formats. Using the technique, we first identify thestatic and dynamic values of the logs based on a small sampleof logs. Then we apply the identified static and dynamic parts404

rocess of our problem detection approach.Execution Sequence RecoveryTesting runwithsmall ngSequencesFigure2: AnSequenceexample ofour approach.ExecutionRecoveryRun with The log abstraction phase abstracts log lines into execuExecutionevents. In this phase, we choosesuchLog dynamic values,Loglarge data tionas “ id”,Logsto link the log lines into a sequence. The linking isAbstractionLinkingin cloud based on the heuristic on thename of the encedeltaExecutionTable 1: Exampleof log lines#SimplifyingLog linesequence1 Sequencestime 1, Task Trying to launch,reportTaskID 01A2 time 2, Task Trying to launch, TaskID 077For our example, TaskID will be used for event linking since3 time 3, Task JVM, TaskID 01ATaskID contains string “ID”. Therefore, line 1 and line 3 inre 2: An example of our approach.4 time 4, Task Reduce, TaskID 01Athe input data in Figure 2-a can be linked together since5 time 5, Task JVM, TaskID 077they contain the same TaskID.1. Overview6 time 6, Task Reduce, TaskID 01AFigure 2-c shows the result sequence afterFig.abstractingthe of our approach.nes into execu7 time 7,Task Reduce,TaskID 01ATable1: linkingExampleof intolog sequenceslinesTable2: Executioneventslogs andthemusing the TaskID valic values, suchEvent 8 Eventtemplate## Log linetime 8,Task Progress, TaskID 077ues. In the event linking result in Figure 2-c, Events E1, E2,The linking isE1 9 time t,to launch, TaskID id 1,2E3,Task TryingE5 and E6 arelinked together(note that Logevent E3 has1 time 1,to launch,TaskID 01Atime 9, Task TryingTask Done, TaskID 077ynamic values.E2 10 time t,TaskID idbeenexecuted twice)and EventE1, E2, E4, E6are linked2 time 2,Task Tryingto launch,TaskID 077time 10,Task JVM,Task CommitPending, TaskID 01A 3,5nt linking sinceabstractionE3 11 time t,TaskID id4,6,7togethersince thesame TaskID values are shared.3 time 3,Task JVM,TaskID 01Atime 11,Task Reduce,Task Done, TaskID 01A1 and line 3 inE4time t, Task Progress, TaskID id84 time 4, Task Reduce, TaskID 01Atogether sinceE5time t, Task Commit Pending, TaskID id 105 time 5,Task JVM,TaskID 0773.3.2EliminatingrepetitionsE6 Aftertime t,Task Done,TaskID id9,11eliminatinglooping, thefinal log sequences are shown6 time 6, Task Reduce, TaskID 01Aabstracting theThere can be event repetitions in the existing sequencesTable execution2: Execution eventsin ExecutionFigure 2-d. events, consists of normalized7 time 7,Task Reduce,TaskID 01A(b)he TaskID valcausedby loops. Forexample, for sequences about readingEvent Event template#8 time 8,Task Progress,Events E1, E2,time stamps, task types andtaskidentifiers.datafrom a remoteTaskID 077node, there would be repeated eventsE1 normalizedtime t,theTask Tryingto launch,TaskID idthe p-value,higher probabilitythatthe new run1,2hat event E3 has9 time 9,Task Done,TaskID 077about keeping fetching the data.TableSimilar4:logsequences sequencethatExecutionafter eliminatingloopE2 failure.time t,Task JVM,TaskID idTable 3:ExecutionsequenceTherefore,every newrun will be tested with3,5thE6 are linked10 time 10,Task CommitPending,TaskID 01Aincludedifferent timesof thesameing events are consideredE3 previoustime t,Task Reduce,TaskID id4,6,7TaskID Event sequencefailure-freerun to calculatethe p-value. A p-valuared.11 time 11,Task Done,TaskID 01Adifferent sequences, although they indicate thesame sysTaskIDEvent sequencetime t,Task Progress,Loglinkingis largerthanthe baselineTaskID idwould indicate failures.801AE1, E2,E4E3,thatE3, E3,E5, E6tem behaviour in essence. These repeated eventsneed to be01AE1, E2, E3, E5, E6077TaskID idFortime t,example,Task Commitwe start two Pending,failure-freebaseline run 10witE1, E2,E5E4, E6(a) Examplelog lines,consistssuppressed ofto easethe analysis.We use regularexpression077E1, E2, E4, E6E6 the representingtime t, Task Done,TaskID idsequencesdistributionof “4,of2, 2” and “4, 3, 2”. 9,11Th(c)Executionlogsequences,asequenceAfter hniquestodetectandsuppresstherepetitions.Fortheof execution time stamps, taskting sequencesp-value of the t-test would be 0.8137. If another run hain Figure 2-d. example shown in Figure 2, the sequence “E1 E2 E3 E3 E5systemexecutionswiththesameTaskIDabout readingtypesand task identifiers.detection the sequence distribution of “5, 2, 2”, we perform a t-tesE6”, our technique would detectformthe wouldrepetitionof E3 extraand logs.3.4The Failuregenerateextra logs containepeated eventsthe(thep-value,the distributiohigher probetween “5, 2, 2” and “4, 3, ��.Intuitively,ifanyfailurethe eliminatingcloud computingplatsequenceafterloopSimplifyevent sequences indicating the Tableprocess4:of Executionerror messageand exists,sequences thatfailure.Therefore,of the sequencesprevious failure-free run).The p-valueis 1, everywhichniTable 3: Execution sequenceing event sequences, whichfault recovery. Therefore, differentare consideredpreviousfailure-freeto clarger than the baseline (0.8137),and thenew run isrunconsidTaskID Event sequencereflect different system behaviours, should beTaskIDrecoveredEventbe- sequencethe same systhat recoveredis larger thanthehasbaselered failure-free. If next run withfailureth01AE1, E2, E3, E3, E3, E5, E601AE5, E6 distribution of “6, 5, 2,Fortween different runs of an application with andwithout E1,fail-E2, E3,ents need to bewe starttwsequence2”, example,the calculatedp-valu077E1, E2, E4, thedifferenteventular expressionthe thansequencesdistributionwould be 0.6533, which is smallerthe baseline.Thereosequences in logs can be usedidentifysystem failures.tions. For thep-valueanof alert.the t-test wouldfore, simplifyingthe administratorwould receive(d)toFinalexecutionsequences aftersequences.1 E2 E3 E3 E5the sequence distribution 3.4.1 Sequence counts Thevariance3.4 Failure detectionformwouldgenerateareextralogs. InThelogscontainion of E3 andbetween2” and3theextrafollowingsections, cesindicatingthe processof errormessageSequencecounts variance (SCV)the Coe cientof Vari6”.Intuitively, if any failure exists, the cloud computingplatof the previousfailure-freeies ona widelyused andcloud computingplatform,and ces,whichwith theancenumber forof sequencesamongrunsof thelargerthan entedthiFig. 2. An exampleof viourof BDAApps.behaviours,section.should be recovered besame application. It is definedreflectas thedifferentratio of systemthe standardered failure-free. If next rundeviation to the mean µ : tween different runs of an application with and without failsequence distribution of “6,ures. eventSeveralE3approachesthatexecutedidentify thedifferenteventand eventson full logs to abstract the logs. Figure 2 shows an example thathas beenthreetimes)would beE1,0.6533, which is sm 4. CASESCV sequences (1) to identifyin logs can be usedsystem STUDYfailures. SETUPfore,valuesthe administrator woulof a log file with 11 log lines and how we process it. Each log E2,µ E4, E6 are linked togetherWesincethesameTaskIDpresent the cloud computing platform that we chose3.4.1Sequencecountsvariancethe subject programs, the experimentalenvironmentand thline contains the execution time stamp, theIntuitively,task type,thethe heprobaIn the followingtwo sectionsinputdata.Sequencecountsvariance (SCV) isthe Coe cientof Varithata failureexists. Beforerunningan applicationies on a widely used cloud ctask ID. The log lines are abstracted xampleof repetitionis problems withanceofSimplifyingthe numberof sequencesamongAnmultipleruns of theperiodically on a cloud computinga certainnumthree injected4.1ratioCloudplatform:Hadoopevents, as shown in Figure 2-b. The “ id”berand“ t”ofidentifierssame application.It requiredisbydefinedas theof the computingstandardof runsthe program xample,forsequencesaboutdeviationto the meanµ:This sub-section introduces Hadoop, a widely used clouto setup a baseline of the SCV.After deployingthe proindicate two dynamic values.readingdata fromremotenode, therewouldrepeatedplatformthat webechoosefor our case studies. Wgram, any SCV among the numberof consecutiveruns athat computing4.model,CASESTUDY SESCV briefly present the programmingand the Hadoop lois largervalues,than thesuchbaselinebe consideredwarningofthe2) Log Linking: This phase uses dynamicaswouldeventsabout afetchingdata.Without this (1)step, similarlogµthat we use in our case studies. We present the cloud comfailure.“ id”, to link log lines into a sequence. TheForlinkingheuristicisour baselinesequencesincludeof the sameeventprograms, the exthe subjectIntuitively,thebiggerthe differentvariance is,occurrenceshigher the probaexample,if we setupbasedthaton threefailureinput data.bilitythata failureexists. BeforerunningProgrammingan application4.1.1modelfree runsTaskIDand recover8, 9 and are10logsequencesrespectively,based on the dynamic values. In our example,is atetheperiodicallyon a cloudplatform,aiscertainnumthe baseline SCV would be 0.111.If a followingruncomputinggenHadoopan open-sourcecloud computingplatform [30for log linking since TaskID represents erepeatedeventsneed4.1 Cloud computingber ofruns ofthetheSCVprogramproblemsare requirederates 10 log sequences, we wouldcalculatebasedwithoutthatis supportedby Yahoo! and is used by Amazon, fterdeployingthepro“ID”. Therefore, line 1 and line 3 in the ilityofthegeneratedon the number 9, 10 and 10 (number 9 and the first numand a number of other companies as their cloud computincomputingplatformthat wegram,any SCVamongthe number platform.of consecutiveruns thatber 10from TaskID.the previous summaries.two runs).TheSCV plementa can be linked together since they obriefly present the programmis largerthan the(0.111),baselineandwould beaconsidereda warningbe 0.060, which is smaller thanthe baselineprogrammingmodel ofnamed MapReduce.Similar to log abstraction, we also identifytheindicatelinkagethatamongdetectand suppressthe repetitions.For istheexample thatshownin in our casewe usestudfailure.wouldthe new runis failure-free.If anotherMapReducea hewesetupcalour baselinebasedon repetitionthreefailure- of E3has thenfailureapplyand generateslog sequence,SCVmingmodel.The programmingmodelconsistsof two phasea few IDs based on a small sample of rundata,the 15Figure2,ourtechniquedetectstheinthe4.1.1 Programming mofree 15runsand berecover9 and 10 logsequences parallelrespectively,culated by the number 10, 10 andwould0.247,8,which“Map” phase, followed by an aggregatinlinking on full data.sequence“E1,E2,to beE3,E3,a IfmassivelyE3,E5,E6”,andreducesthisis an open-sourcethealertbaselinewoulda followingrungenis larger than the baseline. AnwouldSCVbe sentthe0.111.“Reduce”phase. The input dataHadoopfor MapReduceis brokeerates 10 logtosequences,we E3,wouldcalculatetheSCVbasedthatissupportedby therinspection.down into a list of key/value pairs. Mappers (processesasFigure 2-c shows the resulting sequences after abstracting on the number 9, 10 and 10 (number9andthefirstn

Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop Clouds Weiyi Shang y, Zhen Ming Jiang , Hadi Hemmati , Bram Adams z, Ahmed E. Hassan y, Patrick Martin x y Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen's University, Kingston, Canada z D epartement de G enie Informatique et G enie Logiciel, Polytechnique Montr eal, Montr eal .

Related Documents:

Appendix B: Dental Assisting Program Policy Agreement Page 56 Appendix C: Dental Assisting Physical Exam./Immun. Record Page 57-58 Appendix D: Dental Assisting Immunization Information Sheet Page 59-60 . 4 Appendix E: Dental Assisting Pregnancy Policy Page 61 Appendix F: Dental Assisting Principles of Radiation Protection Page 62 .

The Dental Assisting Program consists of theory and laboratory skills to prepare the students for the Dental Assisting National Boards. All the dental assisting expanded functions allowed by the Massachusetts State Dental Practice Act are taught. The Dental Assistant learns all the fundamentals of the art and science of dental assisting.

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers

Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6

IN ARTIFICIAL INTELLIGENCE Stuart Russell and Peter Norvig, Editors FORSYTH & PONCE Computer Vision: A Modern Approach GRAHAM ANSI Common Lisp JURAFSKY & MARTIN Speech and Language Processing, 2nd ed. NEAPOLITAN Learning Bayesian Networks RUSSELL & NORVIG Artificial Intelligence: A Modern Approach, 3rd ed. Artificial Intelligence A Modern Approach Third Edition Stuart J. Russell and Peter .