Machine Learning On Spark - UC Berkeley AMP Camp

3y ago
40 Views
2 Downloads
1.19 MB
34 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Jamie Paz
Transcription

Machine Learning onSparkShivaram VenkataramanUC Berkeley

Machine learningComputer ScienceStatistics

Spam filtersClick predictionMachine learningRecommendationsSearch ranking

ClassificationClusteringMachine learningtechniquesRegressionActive learningCollaborative filtering

Implementing Machine Learning§ Machine learning algorithms are- Complex, multi-stage- Iterative§ MapReduce/Hadoop unsuitable§ Need efficient primitives for data sharing

Machine Learning using Spark§ Spark RDDs à efficient data sharing§ In-memory caching accelerates performance- Up to 20x faster than Hadoop§ Easy to use high-level programming interface- Express complex algorithms 100 lines.

ClassificationClusteringMachine learningtechniquesRegressionActive learningCollaborative filtering

K-Means Clustering using SparkFocus: Implementation and Performance

Grouping data according tosimilarityDistance NorthClusteringE.g. archaeological digDistance East

Grouping data according tosimilarityDistance NorthClusteringE.g. archaeological digDistance East

Benefits Popular Fast Conceptually straightforwardDistance NorthK-Means AlgorithmE.g. archaeological digDistance East

K-Means: preliminariesdata lines.map(line parseVector(line))Feature 2Data: Collection of valuesFeature 1

K-Means: preliminariesdist p.squaredDist(q)Feature 2Dissimilarity:Squared Euclidean distanceFeature 1

K-Means: preliminariesData assignments to clustersS1, S2,. . ., SKFeature 2K Number of clustersFeature 1

K-Means: preliminariesData assignments to clustersS1, S2,. . ., SKFeature 2K Number of clustersFeature 1

Initialize K cluster centers Repeat until convergence:Assign each data point tothe cluster with the closestcenter.Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centers Repeat until convergence:Assign each data point tothe cluster with the closestcenter.Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:Assign each data point tothe cluster with the closestcenter.Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:Assign each data point tothe cluster with the closestcenter.Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:Assign each data point tothe cluster with the closestcenter.Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:closest data.map(p (closestPoint(p,centers),p))Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:closest data.map(p (closestPoint(p,centers),p))Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:closest data.map(p (closestPoint(p,centers),p))Assign each cluster centerto be the mean of itscluster’s data points.Feature 2K-Means AlgorithmFeature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:closest data.map(p (closestPoint(p,centers),p))Feature 2K-Means AlgorithmpointsGroup closest.groupByKey()Feature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:closest data.map(p (closestPoint(p,centers),p))Feature 2K-Means AlgorithmpointsGroup closest.groupByKey()newCenters pointsGroup.mapValues(ps average(ps))Feature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:closest data.map(p (closestPoint(p,centers),p))Feature 2K-Means AlgorithmpointsGroup closest.groupByKey()newCenters pointsGroup.mapValues(ps average(ps))Feature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:closest data.map(p (closestPoint(p,centers),p))pointsGroup Feature 2K-Means Algorithmclosest.groupByKey()newCenters pointsGroup.mapValues(ps average(ps))Feature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:while(dist(centers,newCenters) ɛ)closest data.map(p (closestPoint(p,centers),p))pointsGroup Feature 2K-Means Algorithmclosest.groupByKey()newCenters pointsGroup.mapValues(ps average(ps))Feature 1

Initialize K cluster centerscenters data.takeSample(false,K,seed) Repeat until convergence:while(dist(centers,newCenters) ɛ)closest data.map(p (closestPoint(p,centers),p))pointsGroup Feature 2K-Means Algorithmclosest.groupByKey()newCenters pointsGroup.mapValues(ps average(ps))Feature 1

centers data.takeSample(false,K,seed)while(d ɛ){closest data.map(p (closestPoint(p,centers),p))pointsGroup closest.groupByKey()Feature 2K-Means SourcenewCenters pointsGroup.mapValues(ps average(ps))d distance(centers,newCenters)centers newCenters.map( )}Feature 1

Ease of use§ Interactive shell:Useful for featurization, pre-processing data§ Lines of code for K-Means- Spark 90 lines – (Part of hands-on tutorial !)- Hadoop/Mahout 4 files, 300 lines

PerformanceLogistic Regression2550100Number of machines[Zaharia et. al, ion time MemSpark250HadoopHadoopBinMemSpark197Iteration time (s)300274K-Means50100Number of machines

Conclusion§ Spark: Framework for cluster computing§ Fast and easy machine learning programs§ K means clustering using Spark§ Hands-on exercise this afternoon !Examples and more: www.spark-project.org

Machine learning! techniques! Classification! Regression! Clustering! Active learning! Collaborative filtering! Implementing Machine Learning!! Machine learning algorithms are!- Complex, multi-stage!- Iterative!!! MapReduce/Hadoop unsuitable!! Need efficient primitives for data sharing!! Spark RDDs " efficient data sharing!! In-memory .

Related Documents:

Contents at a Glance Preface xi Introduction 1 I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark 5 2 Deploying Spark 27 3 Understanding the Spark Cluster Architecture 45 4 Learning Spark Programming Basics 59 II: Beyond the Basics 5 Advanced Programming Using the Spark Core API 111 6 SQL and NoSQL Programming with Spark 161 7 Stream Processing and Messaging Using Spark 209

The overview of Spark and how it is better Hadoop, deploying Spark without Hadoop, Spark history server and Cloudera distribution Spark Basics Spark installation guide, Spark configuration, memory management, executor memory vs. driver memory Working with Spark Shell, the concept of resilient distributed datasets (RDD) Learning to do functional .

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Challenge How to design a distributed memory abstraction that is bothfault tolerantande cient? Solution

2. Spark English-Teachers Manual Book II 10 3. Spark English-Teachers Manual Book III 19 4. Spark English-Teachers Manual Book IV 31 5. Spark English-Teachers Manual Book V 45 6. Spark English-Teachers Manual Book VI 59 7. Spark English-Teachers Manual Book VII 73 8. Spark English-Teachers Manual Book VIII 87 Revised Edition, 2017

A General Platform Spark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning Standard libraries included with Spark MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution from AMPLab, UC Berkeley Shipped with Spark since Sept 2013 MLlib: Available algorithms

Performance Tuning Tips for SPARK Machine Learning Workloads 12 Bottom Up Approach Methodology: Alternating Least Squares Based Matrix Factorization application Optimization Process: Spark executor Instances Spark executor cores Spark executor memory Spark shuffle location and manager RDD persistence storage level Application

(d) Reinstall the spark plug. 3. REMOVE SPARK PLUGS 4. VISUALLY INSPECT SPARK PLUGS Check the spark plug for thread damage and insulator dam-age. If abnormal, replace the spark plug. Recommended spark plug: ND PK20R11 NGK BKR6EP11 5. INSPECT ELECTRODE GAP Maximum electrode gap for used spark plug

Here is a tip sheet guide to get you started using Adobe Spark. Adobe Spark Video is a free App that can be accessed either through your Web Browser at . https://spark.adobe.com, or, you can download the Spark Video app for any mobile device. To download Adobe Spark to your mobile device, visit your App store and search for “Adobe Spark video”