Scala And The JVM For Big Data: Lessons From Spark

3y ago

35 Views

2 Downloads

7.94 MB

100 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Jamie Paz

Report this link

Download PDF

Transcription

Spark2

A DistributedComputing Engineon the JVM3

ClusterNodeNodeNodeRDDPartition 1Partition 1Partition 1Resilient DistributedDatasets4

Productivity?Very concise, elegant, functional APIs. Scala, Java Python, R . and SQL!5

Productivity?Interactive shell (REPL) Scala, Python, R, and SQL6

Notebooks Jupyter Spark Notebook Zeppelin Beaker Databricks7

Example:Inverted Index9

Web Crawlwikipedia.org/hadoopindexHadoop providesMapReduce and HDFSblock.wikipedia.org/hbaseHBase stores data in HDFS.wikipedia.org/hadoopHadoop provides.block.wikipedia.org/hbaseHBase stores.10

lCompute Inverted Indexindexinverse indexblockblock.wikipedia.org/hadoopHadoop .block.wikipedia.org/hbaseHBase stores.Miracle!!block.block.wikipedia.org/hiveHive queries.block.block.11

nverted Indexracle!!inverse ./hive,1).12

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext.val sparkContext new SparkContext(master, “Inv. ap { line val array line.split(",", 2)(array(0), array(1)) // (id, content)}.flatMap {case (id, content) toWords(content).map(word ((word,id),1)) // toWords not shown}.reduceByKey( ).map {case ((word,id),n) (word,(id,n))}.groupByKey.mapValues {seq sortByCount(seq) // Sort the value seq by count, desc.}.saveAsTextFile("/path/to/output")13

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext.val sparkContext newRDD[String]: ./hadoop, Hadoop provides.SparkContext(master, “Inv. ap { line val array line.split(",", 2)(array(0), array(1))}.flatMap se (id, contents) toWords(contents).map(w ((w,id),1))15

val array line.split(",", 2)(array(0), array(1))}.flatMap {case (id, contents) toWords(contents).map(w ((w,id),1))}.reduceByKey( ).map {RDD[((String,String),Int)]: ((Hadoop,./hadoop),20)case ((word,id),n) (word,(id,n))}.groupByKey.mapValues {seq )16

val array line.split(",", 2)(array(0), array(1))}.flatMap {case (id, contents) toWords(contents).map(w ((w,id),1))}.reduceByKey( ).map {case ((word,id),n) (word,(id,n))}.groupByKey.mapValues {RDD[(String,Iterable((String,Int))]: (Hadoop,seq(./hadoop,20),.))seq )17

val array line.split(",", 2)(array(0), array(1))}.flatMap {case (id, contents) toWords(contents).map(w ((w,id),1))}.reduceByKey( ).map {case ((word,id),n) (word,(id,n))RDD[(String,Iterable((String,Int))]: (Hadoop,seq(./hadoop,20),.))}.groupByKey.mapValues {seq )18

Productivity?textFilemapIntuitive API: Dataflow of steps. Inspired by Scala collectionsand functional AsTextFile19

Performance?textFilemapLazy API: Combines steps into “stages”. Cache intermediate data extFile20

Higher-LevelAPIs22

SQL:Datasets/DataFrames23

import org.apache.spark.SparkSessionval spark eries").getOrCreate()Exampleval flights spark.read.parquet("./flights")val planes aceTempView("flights")planes. createOrReplaceTempView("planes")flights.cache(); planes.cache()val planes for flights1 sqlContext.sql("""SELECT * FROM flights fJOIN planes p ON f.tailNum p.tailNum LIMIT 100""")val planes for flights2 flights.join(planes,flights("tailNum") planes ("tailNum")).limit(100)24

import org.apache.spark.SparkSessionval spark eries").getOrCreate()val flights spark.read.parquet("./flights")val planes aceTempView("flights")planes. createOrReplaceTempView("planes")flights.cache(); planes.cache()25

planes. createOrReplaceTempView("planes")flights.cache(); planes.cache()val planes for flights1 sqlContext.sql("""SELECT * FROM flights fJOIN planes p ON f.tailNum p.tailNumLIMIT 100""")Returns anotherDataset.val planes for flights2 flights.join(planes,flights("tailNum") planes ("tailNum")).limit(100)27

val planes for flights2 flights.join(planes,flights("tailNum") planes ("tailNum")).limit(100)Not an “arbitrary”anonymous funcRon, but a“Column” instance.29

PerformanceThe Dataset API has thesame performance for alllanguages:Scala, Java,Python, R,and SQL!30

in(right: Dataset[ ], joinExprs: Column): DataFrame {groupBy(cols: Column*): RelationalGroupedDataset {orderBy(sortExprs: Column*): Dataset[T] {select(cols: Column*): Dataset[.] {where(condition: Column): Dataset[T] {limit(n: Int): Dataset[T] {intersect(other: Dataset[T]): Dataset[T] {sample(withReplacement: Boolean, fraction, seed) {drop(col: Column): DataFrame {map[U](f: T U): Dataset[U] {flatMap[U](f: T Traversable[U]): Dataset[U] {foreach(f: T Unit): Unit {take(n: Int): Array[Row] {count(): Long {distinct(): Dataset[T] {agg(exprs: Map[String, String]): DataFrame {31

StructuredStreaming33

Time 1 RDDTime 2 RDDTime 3 RDDTime 4 RDD EventEventEventEvent EventDStream (discretized stream) Window of 3 RDD Batches #1Window of 3 RDD Batches #234

ML/MLlib

K-Means Machine Learning requires: Iterative training of models. Good linear algebra perf.

GraphX

PageRank Graph algorithms require: Incremental traversal. Eﬃcient edge and node reps.

Foundation:The JVM39

20 Years ofDevOpsLots of Java Devs40

Tools and LibrariesAkkaBreezeAlgebirdSpire & CatsAxle.41

Big Data Ecosystem42

But it’snot perfect.43

Richer data libs.in Python & R44

GarbageCollection45

GC Challenges Typical Spark heaps: 10s-100s GB. Uncommon for “generic”, non-dataservices.46

GC Challenges Too many cached RDDs leads to hugeold generation garbage. Billions of objects long GC pauses.47

Tuning GC Best for Spark: -XX:UseG1GC -XX:-ResizePLAB Xms. -Xmx. XX:InitiatingHeapOccupancyPercent . -XX:ConcGCThread e-collection-for-sparkapplications.html48

JVM Object Model49

Java Objects? “abcd”: 4 bytes for raw UTF8, right? 48 bytes for the Java object: 12 byte header. 8 bytes for hash code. 20 bytes for array overhead. 8 bytes for UTF16 chars.50

val myArray: �Arrays“zeroth”51

val person: Personname: Stringage: Int29“Buck Trends”addr: Address Class Instances52

Hash Maph/c1h/c2keyvalueh/c3h/c4 “a value”“a key”Hash Maps53

Improving PerformanceWhy obsess about this?Spark jobs are CPU bound: Improve network I/O? 2% better. Improve disk I/O? 20% better.54

What changed? Faster HW (compared to 2000) 10Gbs networks SSDs.55

What changed? Smarter use of I/O Pruning unneeded data sooner. Caching more eﬀectively. Eﬃcient formats, like Parquet.56

What changed? But more CPU use today: More Serialization. More Compression. More Hashing (joins, group-bys).57

Improving PerformanceTo improve performance, we need tofocus on the CPU, the: Better algorithms, sure. And optimize use of memory.58

Project TungstenInitiative to greatly improveDataset/DataFrame performance.59

Goals60

Reduce Referencesval myArray: Array[String]val person: Person0123“second”name: Stringage: Int29“Buck Trends”“first”addr: Address“third” “zeroth”Hash Maph/c1h/c2keyvalueh/c3h/c4 “a value”“a key”61

Reduce References Fewer, bigger objects to GC. Fewer cache missesval myArray: Array[String]val person: Person0123“second”name: Stringage: Int29“Buck Trends”“first”addr: Address“third” “zeroth”Hash Maph/c1h/c2keyvalueh/c3h/c4 “a value”“a key”62

Less Expression Overheadsql("SELECT a b FROM table") Evaluating expressions billions oftimes: Virtual function calls. Boxing/unboxing. Branching (if statements, etc.)63

Implementation64

Object EncodingNew CompactRow type:null bit set (1bit/field)values (8bytes/field)variable lengthoﬀset to var. len. data Compute hashCode and equals onraw bytes.65

val person: Person Compare:name: Stringage: Int29“Buck Trends”addr: Address null bit set (1bit/field)values (8bytes/field)variable lengthoﬀset to var. len. data66

BytesToBytesMap:h/c1h/c2h/c3h/c4Tungsten Memory Pagek1k3v1k2v3k4v2v4 67

Hash Maph/c1keyh/c2 Compareh/c3h/c2h/c3h/c4 “a value”h/c4h/c1value“a key”Tungsten Memory Pagek1k3v1k2v3k4v2v4 68

Memory Management Some allocations oﬀ heap. sun.misc.Unsafe.69

Less Expression Overheadsql("SELECT a b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions.70

Less Expression Overheadsql("SELECT a b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions. Spark 2.0 - for whole queries.71

No Value Types(Planned for Java 9 or 10)73

case class Timestamp(epochMillis: Long) {def toString: String { . }def add(delta: TimeDelta): Timestamp {/* return new shifted time */}.}Don’t allocate on the heap;just push the primiRve longon the stack.(scalac does this now.)74

Long operationsaren’t atomicAccording to theJVM spec75

No Unsigned TypesWhat’sfactorial(-1)?76

Arrays Indexedwith IntsByte Arrayslimited to 2GB!77

scala val N 1100*1000*1000N2: Int 1100000000 // 1.1 billionscala val array Array.fill[Short](N)(0)array: Array[Short] Array(0, 0, .)scala importorg.apache.spark.util.SizeEstimatorscala SizeEstimator.estimate(array)res3: Long 2200000016 // 2.2GB78

scala val b t]] .scala SizeEstimator.estimate(b)res0: Long 2368scala sc.parallelize(0 until 100000). map(i b.value(i))79

scala SizeEstimator.estimate(b)res0: Long 2368scala sc.parallelize(0 until 100000). map(i ed array size exceeds VM limitat java.util.Arrays.copyOf(.).80

But wait.I actually liedto you.81

Spark handles largebroadcast variablesby breaking theminto blocks.82

ScalaREPL83

java.lang.OutOfMemoryError:Requested array size exceeds VM limitat java.util.Arrays.copyOf(.).at java.io.ByteArrayOutputStream.write(.).at java.io.ObjectOutputStream.writeObject(.)at ect(.).at .spark.util.ClosureCleaner .ensureSerializable(.).at org.apache.spark.rdd.RDD.map(.)84

java.lang.OutOfMemoryError:Requested array size exceeds VM limitat java.util.Arrays.copyOf(.).at java.io.ByteArrayOutputStream.write(.).at java.io.ObjectOutputStream.writeObject(.)Pass this closure toat writeObject(.).i b.value(i)at .spark.util.ClosureCleaner .ensureSerializable(.).at org.apache.spark.rdd.RDD.map(.)85

java.lang.OutOfMemoryError:Requested array size exceeds VM limitat java.util.Arrays.copyOf(.).at ’s.“clean” (serializable).at java.io.ObjectOutputStream.writeObject(.)at .spark.serializer.JavaSerializationStreami b.value(i).writeObject(.).at .spark.util.ClosureCleaner .ensureSerializable(.).at org.apache.spark.rdd.RDD.map(.)86

java.lang.OutOfMemoryError:Requested array size exceeds VM limitat java.util.Arrays.copyOf(.).at java.io.ByteArrayOutputStream.write(.).which requires copying.an array.at java.io.ObjectOutputStream.writeObject(.)at ect(.)What array?.i b.value(i)at .spark.util.ClosureCleaner .ensureSerializable(.).at org.apache.spark.rdd.RDD.map(.)scala val array Array.fill[Short](N)(0).88

Why did thishappen?89

You write:scala scala scala val array Array.fill[Short](N)(0)val b sc.broadcast(array)sc.parallelize(0 until 100000).map(i b.value(i))90

scala scala scala Scala compiles:val array Array.fill[Short](N)(0)val b sc.broadcast(array)sc.parallelize(0 until 100000).map(i b.value(i))class iwC extends Serializable {val array Array.fill[Short](N)(0)val b sc.broadcast(array)class iwC extends Serializable {sc.parallelize(.).map(i b.value(i))}}91

scala scala scala Scala compiles:val array Array.fill[Short](N)(0)val b sc.broadcast(array)sc.parallelize(0 until 100000).map(i b.value(i)). sucks in the whole object!class iwC extends Serializable {val array Array.fill[Short](N)(0)val b sc.broadcast(array)So,thisclosureover“b”.class iwC extends Serializable {sc.parallelize(.).map(i b.value(i))}}92

Lightbend isinvestigatingre-engineeringthe REPL93

Workarounds.94

Transient is often all you need:scala @transient val array Array.fill[Short](N)(0)scala .95

object Data { // Encapsulate in objects!val N 1100*1000*1000val array Array.fill[Short](N)(0)val getB sc.broadcast(array)}object Work {def run(): Unit {val b Data.getB // local ref!val rdd sc.parallelize(.).map(i b.value(i)) // only needs brdd.take(10).foreach(println)}}96

Why Scala?See the longer versionof this talk atpolyglotprogramming.com/talks97

polyglotprogramming.com/talks

bend.com@deanwamplerQuestions?

Bonus MaterialYou can find an extended version of thistalk with more details atpolyglotprogramming.com/talks100

Scala, Java Python, R . and functional programming. reduceByKey flatMap textFile map map groupByKey map saveAsTextFile. 20 reduceByKey flatMap textFile map map groupByKey map saveAsTextFile Performance? Lazy API:

Related Documents:

Name of thé élément in thé language and script of thé ... - UNESCO

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

106 Views

9m ago

Nonprofit Self-Assessment Checklist

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

1.4K Views

2y ago

Employee Benefits Event - Schneider Downs Tax Services

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

323 Views

1y ago

Study Investigating thè Effect of E- Service Quality on Customer's ...

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

117 Views

9m ago

[Kl - Mauritius

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

464 Views

1y ago

The Scala Programming Language - TAU

Getting Started in Scala scala Runs compiled scala code Or without arguments, as an interpreter! scalac - compiles fsc - compiles faster! (uses a background server to minimize startup time) Go to scala-lang.org for downloads/documentation Read Scala: A Scalable Language

9 Views

9m ago

Understanding the JVM

Java Virtual Machine (JVM) Java Virtual Machine (JVM) –is a “virtual” computer that resides in the “real” computer as a software process. The JVM gives Java the flexibility of platform independence. The .class files can be run on any OS, once a JVM has been in

11 Views

2y ago

Bruksanvisning för bilstereo Bruksanvisning for bilstereo ... - Jula

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

369 Views

1y ago

Recent Views

AUTOMOTIVE INDUSTRY ANALYSIS REPORT and GUIDE

3.1 General Outlook of the Automotive Industry in the World 7 3.2 Overview of the Automotive Industry in Turkey 10 3.3 Overview of the Automotive Industry in TR42 Region 12 4 Effects of COVID-19 Outbreak on the Automotive Industry 15 5 Trends Specific to the Automotive Industry 20 5.1 Special Trends in the Automotive Industry in the World 20

1y ago

86 Views

Automotive Pathway Automotive Services Fundamentals

Automotive Pathway Automotive Services Fundamentals Course Number: IT11 Prerequisite: None Aligned Industry Credential: S/P2- Safety and Pollution Prevention and SP2- Mechanical and Pollution Prevention Description: This course introduces automotive safety, basic automotive terminology, system & component identification, knowledge and int

2y ago

228 Views

Articulation Agreements: College of Applied Technologies .

Hernando High School FL Automotive . Central Nine Career Center IN Automotive Elkhart Area Career Center IN Automotive . Kokomo Area Career Center IN Automotive North Lawrence Vo-Tech IN MLR Porter County Career Center IN Automotive Richmond High School IN Automotive Southeastern Career

2y ago

376 Views

Automotive Basics - Auto Upkeep

Automotive Basics - Course Description "Automotive Basics includes knowledge of the basic automotive systems and the theory and principles of the components that make up each system and how to service these systems. Automotive Basics includes applicable safety and environmental rules and regulations. In Automotive Basics, students will gain

1y ago

197 Views

Automotive Automotive Automotive - HSBC Bank Malaysia

This Merchant list is subject to change from time to time. Merchant(s) who are terminated from the Instalment program after the published date might still be reflected in this list. HSBC Cardholder(s) are advised to confirm the availability of HSBC Card Instalment Plan with the merchant. Automotive Automotive Automotive

1y ago

173 Views

On the Road: U.S. Automotive Parts Industry Annual Assessment

Table 12: Acquisitions of U.S. Automotive Parts Companies (SIC 3714) Table 13: Automotive Parts Exports, 2000-2010 Table 14: Automotive Parts Imports, 2000-2010 . Automotive parts consumption is linked to the demand for new vehicles, since roughly 70 percent of U.S. automotive parts production is for Original Equipment (OE) products. .

10m ago

72 Views

EMC TEST SYSTEMS FOR AUTOMOTIVE

AUTOMOTIVE EMC TEST SYSTEMS FOR AUTOMOTIVE ELECTRONICS AUTOMOTIVE EMC TEST SYSTEMS FOR AUTOMOTIVE ELECTRONICS Step 1 Step 2 Step 3: Set the parameters Step 4: Active test. Load dump pulses have high pulse energy, which can be highly destructive to electrical or electronic equipment. The LD 200N series simulates these pulses with high energy in a range of up to 1.2 seconds. The LD 200N .

3y ago

266 Views

Automotive Manufacturing - Select Georgia

Jobs created by Georgia’s automotive-related locations Toyo Tire North America Manufacturing and expansions in the last three years 32,000 Automotive-related engineers and production workers in Georgia Sources: EMSI 2020.3, press releases and Automotive Database, Georgia Power Community & Economic Development, 2020 Automotive Manufacturing

2y ago

166 Views

#1 OSAT for Automotive Packaging and Test

We Know Automotive Amkor has extensive experience with automotive process requirements shipping billions of units every year for automotive applications. Our packages meet or exceed automotive quality, reliability, burn-in and safe launch plan criteria. Amkor also has failure analysis, tri-temp test and statistical process capability in all .

1y ago

145 Views

Ipsos Automotive Center of Excellence

Global Automotive Center of Excellence -2014 Ipsos Automotive 9 Automotive Center of Excellence As global automotive markets get more sophisticated, they require vehicle manufacturers to offer the most relevant market propositions to match consumer needs. There is greater value than ever before for a global research partner, who understands

1y ago

126 Views

All about automotive engineering in a pocketbook The 8th edition has .

Automotive Automotive Handbook Handbook All about automotive engineering in a pocketbook The 8th edition has been revised and extended. Automotive Handbook Reference handbook for academic and personal use. ISBN 978--7680-4851-3 Contents - central themes Basic principles: physics, materials, machine parts, joining and bonding techniques

1y ago

135 Views

Brochure: Advanced Flash Storage Solutions for Automotive Applications

iNAND Automotive Embedded Flash Drives (EFDs) are designed to support the harsh environments, high reliability and quality required by the automotive industry. The automotive iNAND product portfolio supports both UFS and e.MMC interfaces in a small 11.5x13mm package with a wide range of capacities to provide automotive OEMs and Tier-1

1y ago

161 Views

Industry Skills Forecast and Proposed Schedule of Work Automotive

Executive summary The Automotive Retail, Service and Repair (AUR) and Automotive Manufacturing (AUM) Training Packages are critical elements in the Vocational Education and Training (VET) system, playing central roles in the training of learners that engage in the automotive industries. A productive and valuable Automotive Training

1y ago

134 Views

Automotive Programs Student Handbook - SCCIowa

include a basic knowledge of all facets of the automotive repair industry, followed by classroom practice and drills of basic skills utilized in the automotive repair industry. The curriculum includes an internship experience in an automotive repair business. The curriculum is evaluated and revised as automotive repair needs change in the industry.

10m ago

72 Views

automotIve

automotive manufacturers worldwide. Those companies that take a forward-thinking approach will gain a competitive advantage and secure a leadership position in a realigned automotive value chain. At Seco, we partner with OEMs and other vehicle-based organisations around the globe to help automotive manufacturers overcome their

3y ago

145 Views

Scala And The JVM For Big Data: Lessons From Spark

It looks like you're using an ad-blocker