Contributed Articles - People

3y ago
22 Views
2 Downloads
5.46 MB
10 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Gideon Hoey
Transcription

contributed articlesDOI:10.1145/ 2934664This open source computing frameworkunifies streaming, batch, and interactive bigdata workloads to unlock new applications.BY MATEI ZAHARIA, REYNOLD S. XIN, PATRICK WENDELL,TATHAGATA DAS, MICHAEL ARMBRUST, ANKUR DAVE,XIANGRUI MENG, JOSH ROSEN, SHIVARAM VENKATARAMAN,MICHAEL J. FRANKLIN, ALI GHODSI, JOSEPH GONZALEZ,SCOTT SHENKER, AND ION STOICAApache Spark:A UnifiedEngine forBig DataProcessingdata volumes in industry and researchposes tremendous opportunities, as well as tremendouscomputational challenges. As data sizes have outpacedthe capabilities of single machines, users have needednew systems to scale out computations to multiplenodes. As a result, there has been an explosion ofnew cluster programming models targeting diversecomputing workloads.1,4,7,10 At first, these models wererelatively specialized, with new models developed fornew workloads; for example, MapReduce4 supportedbatch processing, but Google also developed Dremel13THE GROWTH OF56COMM UNICATIO NS O F THE ACM NOV EM BER 201 6 VO L . 5 9 N O. 1 1for interactive SQL queries and Pregel11for iterative graph algorithms. In theopen source Apache Hadoop stack,systems like Storm1 and Impala9 arealso specialized. Even in the relationaldatabase world, the trend has been tomove away from “one-size-fits-all” systems.18 Unfortunately, most big dataapplications need to combine manydifferent processing types. The verynature of “big data” is that it is diverseand messy; a typical pipeline will needMapReduce-like code for data loading, SQL-like queries, and iterativemachine learning. Specialized enginescan thus create both complexity andinefficiency; users must stitch togetherdisparate systems, and some applications simply cannot be expressed efficiently in any engine.In 2009, our group at the University of California, Berkeley, startedthe Apache Spark project to designa unified engine for distributed dataprocessing. Spark has a programmingmodel similar to MapReduce but extends it with a data-sharing abstraction called “Resilient Distributed Datasets,” or RDDs.25 Using this simpleextension, Spark can capture a widerange of processing workloads thatpreviously needed separate engines,including SQL, streaming, machinelearning, and graph processing2,26,6(see Figure 1). These implementationsuse the same optimizations as specialized engines (such as column-orientedprocessing and incremental updates)and achieve similar performance butrun as libraries over a common engine, making them easy and efficientto compose. Rather than being specifickey insights!A simple programming model cancapture streaming, batch, and interactiveworkloads and enable new applicationsthat combine them.!Apache Spark applications range fromfinance to scientific data processingand combine libraries for SQL, machinelearning, and graphs.!In six years, Apache Spark hasgrown to 1,000 contributors andthousands of deployments.

Analyses performed using Spark of brain activity in a larval zebrafish: (left) matrix factorization to characterize functionally similarregions (as depicted by different colors) and (right) embedding dynamics of whole-brain activity into lower-dimensional trajectories.Source: Jeremy Freeman and Misha Ahrens, Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA.to these workloads, we claim this resultis more general; when augmented withdata sharing, MapReduce can emulate any distributed computation, soit should also be possible to run manyother types of workloads.24Spark’s generality has several important benefits. First, applicationsare easier to develop because they use aunified API. Second, it is more efficientto combine processing tasks; whereasprior systems required writing thedata to storage to pass it to another en-gine, Spark can run diverse functionsover the same data, often in memory.Finally, Spark enables new applications (such as interactive queries on agraph and streaming machine learning) that were not possible with previous systems. One powerful analogy forthe value of unification is to comparesmartphones to the separate portabledevices that existed before them (suchas cameras, cellphones, and GPS gadgets). In unifying the functions of thesedevices, smartphones enabled newapplications that combine their functions (such as video messaging andWaze) that would not have been possible on any one device.Since its release in 2010, Sparkhas grown to be the most active opensource project or big data processing,with more than 1,000 contributors. Theproject is in use in more than 1,000 organizations, ranging from technologycompanies to banking, retail, biotechnology, and astronomy. The largestpublicly announced deployment hasN OV E MB E R 2 0 1 6 VO L. 59 N O. 1 1 C OM M U N IC AT ION S OF T HE ACM57

contributed articlesFigure 1. Apache Spark software stack, with specialized processing libraries implementedover the core engine.Streaming SQLmore than 8,000 nodes.22 As Spark hasgrown, we have sought to keep buildingon its strength as a unified engine. We(and others) have continued to build anintegrated standard library over Spark,with functions from data import to machine learning. Users find this abilitypowerful; in surveys, we find the majority of users combine multiple of Spark’slibraries in their applications.As parallel data processing becomescommon, the composability of processing functions will be one of the mostimportant concerns for both usabilityand performance. Much of data analysis is exploratory, with users wishing tocombine library functions quickly intoa working pipeline. However, for “bigdata” in particular, copying data between different systems is anathema toperformance. Users thus need abstractions that are general and composable.In this article, we introduce the Sparkprogramming model and explain why itis highly general. We also discuss howwe leveraged this generality to buildother processing tasks over it. Finally,we summarize Spark’s most commonapplications and describe ongoing development work in the project.Programming ModelThe key programming abstraction inSpark is RDDs, which are fault-tolerant collections of objects partitioned58COM MUNICATIO NS O F TH E AC MMLGraphacross a cluster that can be manipulated in parallel. Users create RDDs byapplying operations called “transformations” (such as map, filter, andgroupBy) to their data.Spark exposes RDDs through a functional programming API in Scala, Java,Python, and R, where users can simplypass local functions to run on the cluster. For example, the following Scalacode creates an RDD representing theerror messages in a log file, by searching for lines that start with ERROR, andthen prints the total number of errors:lines spark.textFile(“hdfs://.”)errors lines.filter(s s.startsWith(“ERROR”))println(“Total errors:“ errors.count())The first line defines an RDD backedby a file in the Hadoop Distributed FileSystem (HDFS) as a collection of lines oftext. The second line calls the filtertransformation to derive a new RDDfrom lines. Its argument is a Scalafunction literal or closure.a Finally, thelast line calls count, another type ofRDD operation called an “action” thata The closures passed to Spark can call into anyexisting Scala or Python library or even reference variables in the outer program. Sparksends read-only copies of these variables toworker nodes. NOV EM BER 201 6 VO L . 5 9 N O. 1 1returns a result to the program (here,the number of elements in the RDD)instead of defining a new RDD.Spark evaluates RDDs lazily, allowing it to find an efficient plan forthe user’s computation. In particular,transformations return a new RDD object representing the result of a computation but do not immediately computeit. When an action is called, Spark looksat the whole graph of transformationsused to create an execution plan. For example, if there were multiple filter ormap operations in a row, Spark can fusethem into one pass, or, if it knows thatdata is partitioned, it can avoid movingit over the network for groupBy.5 Userscan thus build up programs modularlywithout losing performance.Finally, RDDs provide explicit support for data sharing among computations. By default, RDDs are “ephemeral” in that they get recomputed eachtime they are used in an action (suchas count). However, users can alsopersist selected RDDs in memory orfor rapid reuse. (If the data does notfit in memory, Spark will also spill itto disk.) For example, a user searchingthrough a large set of log files in HDFSto debug a problem might load just theerror messages into memory across thecluster by callingerrors.persist()After this, the user can run a variety ofqueries on the in-memory data:// Count errors mentioning MySQLerrors.filter(s s.contains(“MySQL”)).count()// Fetch back the time fields of errors that// mention PHP, assuming time is field #3:errors.filter(s s.contains(“PHP”)).map(line line.split(‘\t’)(3)).collect()This data sharing is the main difference between Spark and previous computing models like MapReduce; otherwise, the individual operations (suchas map and groupBy) are similar. Datasharing provides large speedups, oftenas much as 100 , for interactive queries and iterative algorithms.23 It is alsothe key to Spark’s generality, as we discuss later.Fault tolerance. Apart from providing data sharing and a variety of paral-

contributed articlesRDDs usually store only temporarydata within an application, thoughsome applications (such as the SparkSQL JDBC server) also share RDDsacross multiple users.2 Spark’s design as a storage-system-agnosticengine makes it easy for users to runcomputations against existing dataand join diverse data sources.Higher-Level LibrariesThe RDD programming model provides only distributed collections ofobjects and functions to run on them.Using RDDs, however, we have builta variety of higher-level libraries onSpark, targeting many of the use cases of specialized computing engines.The key idea is that if we control thedata structures stored inside RDDs,the partitioning of data across nodes,and the functions run on them, we canimplement many of the execution techniques in other engines. Indeed, as weshow in this section, these librariesoften achieve state-of-the-art performance on each task while offering significant benefits when users combinethem. We now discuss the four mainlibraries included with Apache Spark.SQL and DataFrames. One of themost common data processing paradigms is relational queries. Spark SQL2and its predecessor, Shark,23 implement such queries on Spark, usingtechniques similar to analytical databases. For example, these systemssupport columnar storage, cost-basedoptimization, and code generation forquery execution. The main idea behindthese systems is to use the same datalayout as analytical databases—compressed columnar storage—insideRDDs. In Spark SQL, each record in anRDD holds a series of rows stored in binary format, and the system generatesFigure 2. Lineage graph for the third queryin our example; boxes represent RDDs, andarrows represent OR”))errorsfilter(line.contains(“PHP”)))PHP errorsmap(line.split(‘\t’)(3))time fieldsFigure 3. A Scala implementation of logistic regression via batch gradient descent in Spark.// Load data into an RDDval points sc.textFile(.).map(readPoint).persist()// Start with a random parameter vectorvar w DenseVector.random(D)// On each iteration, update param vector with a sumfor (i - 1 to ITERATIONS) {val gradient points.map { p p.x * (1/(1 exp(-p.y*(w.dot(p.x))))-1) * p.y}.reduce((a, b) a b)w - gradient}Figure 4. Performance of logistic regression in Hadoop MapReduce vs. Spark for 100GB ofdata on 50 m2.4xlarge EC2 nodes.SparkHadoop2,500Running Time (s)lel operations, RDDs also automatically recover from failures. Traditionally,distributed computing systems haveprovided fault tolerance through datareplication or checkpointing. Sparkuses a different approach called “lineage.”25 Each RDD tracks the graph oftransformations that was used to buildit and reruns these operations on basedata to reconstruct any lost partitions.For example, Figure 2 shows the RDDs inour previous query, where we obtain thetime fields of errors mentioning PHP byapplying two filters and a map. If anypartition of an RDD is lost (for example,if a node holding an in-memory partitionof errors fails), Spark will rebuild it byapplying the filter on the correspondingblock of the HDFS file. For “shuffle” operations that send data from all nodes toall other nodes (such as reduceByKey),senders persist their output data locallyin case a receiver fails.Lineage-based recovery is significantly more efficient than replicationin data-intensive workloads. It savesboth time, because writing data overthe network is much slower than writing it to RAM, and storage space inmemory. Recovery is typically muchfaster than simply rerunning the program, because a failed node usuallycontains multiple RDD partitions, andthese partitions can be rebuilt in parallel on other nodes.A longer example. As a longer example, Figure 3 shows an implementation of logistic regression in Spark.It uses batch gradient descent, asimple iterative algorithm thatcomputes a gradient function overthe data repeatedly as a parallelsum. Spark makes it easy to load thedata into RAM once and run multiplesums. As a result, it runs faster thantraditional MapReduce. For example,in a 100GB job (see Figure 4), MapReduce takes 110 seconds per iterationbecause each iteration loads the datafrom disk, while Spark takes only onesecond per iteration after the first load.Integration with storage systems.Much like Google’s MapReduce,Spark is designed to be used withmultiple external systems for persistent storage. Spark is most commonly used with cluster file systemslike HDFS and key-value stores likeS3 and Cassandra. It can also connectwith Apache Hive as a data catalog.2,0001,5001,0005000151020Number of IterationsN OV E MB E R 2 0 1 6 VO L. 59 N O. 1 1 C OM M U N IC AT ION S OF T HE ACM59

contributed articlescode to run directly against this layout.Beyond running SQL queries,we have used the Spark SQL engineto provide a higher-level abstraction for basic data transformationscalled DataFrames,2 which are RDDsof records with a known schema.DataFrames are a common abstractionfor tabular data in R and Python, withprogrammatic methods for filtering,computing new columns, and aggregation. In Spark, these operations mapdown to the Spark SQL engine and receive all its optimizations. We discussDataFrames more later.One technique not yet implementedin Spark SQL is indexing, though otherlibraries over Spark (such as IndexedRDDs3) do use it.Spark Streaming. Spark Streaming26implements incremental stream processing using a model called “discretizedstreams.” To implement streaming overSpark, we split the input data into smallbatches (such as every 200 milliseconds)that we regularly combine with statestored inside RDDs to produce new results. Running streaming computationsthis way has several benefits over traditional distributed streaming systems.For example, fault recovery is less expensive due to using lineage, and it is possible to combine streaming with batchand interactive queries.GraphX. GraphX6 provides a graphcomputation interface similar to Pregeland GraphLab,10,11 implementing thesame placement optimizations as thesesystems (such as vertex partitioningschemes) through its choice of partitioning function for the RDDs it builds.MLlib. MLlib,14 Spark’s machinelearning library, implements morethan 50 common algorithms for distributed model training. For example, itincludes the common distributed algorithms of decision trees (PLANET), Latent Dirichlet Allocation, and Alternating Least Squares matrix factorization.Combining processing tasks. Spark’slibraries all operate on RDDs as thedata abstraction, making them easy tocombine in applications. For example,Figure 5 shows a program that readssome historical Twitter data usingSpark SQL, trains a K-means clusteringmodel using MLlib, and then appliesthe model to a new stream of tweets.The data tasks returned by each library(here the historic tweet RDD and the K60COM MUNICATIO NS O F TH E AC MSpark has a similarprogrammingmodel toMapReduce butextends it witha data-sharingabstractioncalled “resilientdistributeddatasets,” or RDDs. NOV EM BER 201 6 VO L . 5 9 N O. 1 1means model) are easily passed to other libraries. Apart from compatibilityat the API level, composition in Sparkis also efficient at the execution level,because Spark can optimize across processing libraries. For example, if one library runs a map function and the nextlibrary runs a map on its result, Sparkwill fuse these operations into a singlemap. Likewise, Spark’s fault recoveryworks seamlessly across these libraries, recomputing lost data no matterwhich libraries produced it.Performance. Given that these libraries run over the same engine, do theylose performance? We found that byimplementing the optimizations wejust outlined within RDDs, we can oftenmatch the performance of specializedengines. For example, Figure 6 compares Spark’s performance on threesimple tasks—a SQL query, streaming word count, and Alternating LeastSquares matrix factorization—versusother engines. While the results varyacross workloads, Spark is generallycomparable with specialized systemslike Storm, GraphLab, and Impala.b Forstream processing, although we showresults from a distributed implementation on Storm, the per-node throughput is also comparable to commercialstreaming engines like Oracle CEP.26Even in highly competitive benchmarks, we have achieved state-of-theart performance using Apache Spark.In 2014, we entered the Daytona GraySort benchmark (http://sortbenchmark.org/) involving sorting 100TB ofdata on disk, and tied for a new recordwith a specialized system built onlyfor sorting on a similar number of machines. As in the other examples, thiswas possible because we could implement both the communication andCPU optimizations necessary for largescale sorting inside the RDD model.ApplicationsApache Spark is used in a wide rangeof applications. Our surveys of Sparkb One area in which other designs have outperformed Spark is certain graph computations.12,16However, these results are for algorithms withlow ratios of computation to communication(such as PageRank) where the latency from synchronized communication in Spark is significant. In applications with more computation(such as the ALS algorithm) distributing the application on Spark still helps.

contributed articlesmaking applications. Published usecases for Spark Streaming includenetwork security monitoring at Cisco, prescriptive analytics at SamsungSDS, and log mining at Netflix. Manyof these applications also combinestreaming with batch and interactivequeries. For example, video companyConviva uses Spark to continuouslymaintain a model of content distribution server performance, querying itautomatically when it moves clientsFigure 5. Example combining the SQL, machine learning, and streaming libraries in Spark.// Load historical data as an RDD using Spark SQLval trainingData sql(“SELECT location, language FROM old tweets”)// Train a K-means model using MLlibval model new ionCol(“language”).fit(trainingData)// Apply the model to new tweets in a streamTwitterUtils.createStream(.).map(tweet model.predict(tweet.location))Figure 6. Comparing Spark’s performance with several widely used specialized systemsfor SQL, streaming, and machine learning. Data is from Zaharia24 (SQL query and streaming word count) and Sparks et al.17 (alternating least squares matrix factorization).Spark21SQLSparkStorm2003MahoutSpark (disk)4645GraphLab58Spark (mem)10Redshift15Response Time(hours)610 x 106Impala (disk)20Throughput(records/s)MATLABResponse Time(sec)Impala (mem)users have identified more than 1,000companies u

Big Data Processing key insights! A simple programming model can capture streaming, batch, and interactive workloads and enable new applications that combine them. ! Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs. ! In six years, Apache Spark has

Related Documents:

Interest in Executive Function in ChildrenInterest in Executive Function in Children 5 articles in 19855 articles in 1985 14 articles in 199514 articles in 1995 501 articles by 2005501 articles by 2005 –– Bernstein &Bernstein &a

Interest in Executive Function in ChildrenInterest in Executive Function in Children 5 articles in 19855 articles in 1985 14 articles in 199514 articles in 1995 501 articles by 2005501 articles by 2005 -- Bernstein &Bernstein & WaberWaber Executive Function inExecutive Function in Education, 2007Education, 2007 0 100 200 300 400 500 600

Except for the above-designated amendment(s), the restated articles set out without change the provisions of the articles being amended. AS 10.06.504(b)(1) The restated articles, together with the above-designated amendment(s), supersede the original articles and all amendments to the original articles. AS 10.06.504(b)(2)

STATUT ADMINISTRATIF TABLE DES MATIERES Chapitre 1 - Principes généraux Articles 1 à 8 Chapitre 2 - Du dossier administratif Articles 9 à 12 Chapitre 3 - Des droits et devoirs particuliers Articles 13 à 40 quater Chapitre 4 - Du recrutement Titre 1 - Dispositions générales Articles 41 à 46 Titre 2 - Des épreuves de recrutement Section 1 - De l’appel Articles 47 à 48 Section 2 - De .

Chapter 72 Iron and steel Chapter 73 Articles of iron or steel Chapter 74 Copper and articles thereof Chapter 75 Nickel and articles thereof Chapter 76 Aluminium and articles thereof Chapter 77 (Reserved for possible future use) Chapter 78 Lead and articles thereof Chapter 79 Zinc

IV. - Other alloy steel; hollow drill bars and rods, of alloy or non-alloy steel 73 Articles of iron or steel 74 Copper and articles thereof 75 Nickel and articles thereof 76 Aluminum and articles thereof 77 (Reserved for possible future use in the Harmonized System) 78 Lead and articles thereof 79 Zinc

Thomson Reuters in 2009, the number of submitted articles has increased considerably from 400 articles in 2009 to 1,600 articles in 2014 and 2,000 articles in 2015. In this study, two types of similarity levels, i.e., ISI and MSI, are used for pub-lished and rejected articles using CrossCheck. Based on the CrossCheck statistical analysis on the .

4. Under the Articles of Confederation, the more people a state had, the more votes it got in Congress. (F) 5. The Congress created by the Articles did not have the power to collect taxes. (T) 6. Under the Articles, states had to obey the laws Congress passed. (F) 7. The Articles of Confederati