Spark And Spark SQL - Department Of Computer Science, University Of Oxford

1y ago
22 Views
2 Downloads
4.05 MB
90 Pages
Last View : 2d ago
Last Download : 5m ago
Upload by : Nadine Tse
Transcription

Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71

What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 2 / 71

Big Data . everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. - Dan Ariely Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 3 / 71

Big Data Big data is the data characterized by 4 key attributes: volume, variety, velocity and value. - Oracle Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71

Big Data s d r o zzw Big data is the data characterized by 4 key attributes: volume, variety, velocity and value. Bu Amir H. Payberah (SICS) Spark and Spark SQL - Oracle June 29, 2016 4 / 71

Big Data In Simple Words Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 5 / 71

The Four Dimensions of Big Data I Volume: data size I Velocity: data generation rate I Variety: data heterogeneity I This 4th V is for Vacillation: Veracity/Variability/Value Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 6 / 71

How To Store and Process Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 7 / 71

Scale Up vs. Scale Out Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 8 / 71

Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 9 / 71

The Big Data Stack Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 10 / 71

Data Analysis Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 11 / 71

Programming Languages Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 12 / 71

Platform - Data Processing Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 13 / 71

Platform - Data Storage Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 14 / 71

Resource Management Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 15 / 71

Spark Processing Engine Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 16 / 71

Why Spark? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 17 / 71

Motivation (1/4) I Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

Motivation (1/4) I Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. I Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

Motivation (1/4) I Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. I Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. I E.g., MapReduce Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

Motivation (2/4) I MapReduce programming model has not been designed for complex operations, e.g., data mining. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 19 / 71

Motivation (3/4) I Very expensive (slow), i.e., always goes to disk and HDFS. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 20 / 71

Motivation (4/4) I Extends MapReduce with more operators. I Support for advanced data flow graphs. I In-memory and out-of-core processing. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 21 / 71

Spark vs. MapReduce (1/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

Spark vs. MapReduce (1/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Solution Resilient Distributed Datasets (RDD) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

Resilient Distributed Datasets (RDD) (1/2) I A distributed memory abstraction. I Immutable collections of objects spread across a cluster. Like a LinkedList MyObjects Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 25 / 71

Resilient Distributed Datasets (RDD) (2/2) I An RDD is divided into a number of partitions, which are atomic pieces of information. I Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

Resilient Distributed Datasets (RDD) (2/2) I An RDD is divided into a number of partitions, which are atomic pieces of information. I Partitions of an RDD can be stored on different nodes of a cluster. I Built through coarse grained transformations, e.g., map, filter, join. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

Resilient Distributed Datasets (RDD) (2/2) I An RDD is divided into a number of partitions, which are atomic pieces of information. I Partitions of an RDD can be stored on different nodes of a cluster. I Built through coarse grained transformations, e.g., map, filter, join. I Fault tolerance via automatic rebuild (no replication). Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

Programming Model Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 27 / 71

Spark Programming Model I A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

Spark Programming Model I A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. I Operators are higher-order functions that execute user-defined functions in parallel. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

Spark Programming Model I A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. I Operators are higher-order functions that execute user-defined functions in parallel. I Two types of RDD operators: transformations and actions. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

RDD Operators (1/2) I Transformations: lazy operators that create new RDDs. I Actions: lunch a computation and return a value to the program or write data to the external storage. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 29 / 71

RDD Operators (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 30 / 71

RDD Transformations - Map I All pairs are independently processed. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71

RDD Transformations - Map I All pairs are independently processed. // passing each element through a function. val nums sc.parallelize(Array(1, 2, 3)) val squares nums.map(x x * x) // {1, 4, 9} // selecting those elements that func returns true. val even squares.filter( % 2 0) // {4} Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71

RDD Transformations - Reduce I Pairs with identical key are grouped. I Groups are independently processed. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 32 / 71

RDD Transformations - Reduce I Pairs with identical key are grouped. I Groups are independently processed. val pets sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2))) pets.groupByKey() // {(cat, (1, 2)), (dog, (1))} pets.reduceByKey((x, y) x y) or pets.reduceByKey( ) // {(cat, 3), (dog, 1)} Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 32 / 71

RDD Transformations - Join I Performs an equi-join on the key. I Join candidates are independently processed. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 33 / 71

RDD Transformations - Join I Performs an equi-join on the key. I Join candidates are independently processed. val visits sc.parallelize(Seq(("h", "1.2.3.4"), ("a", "3.4.5.6"), ("h", "1.3.3.1"))) val pageNames sc.parallelize(Seq(("h", "Home"), ("a", "About"))) visits.join(pageNames) // ("h", ("1.2.3.4", "Home")) // ("h", ("1.3.3.1", "Home")) // ("a", ("3.4.5.6", "About")) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 33 / 71

RDD Transformations - CoGroup I Groups each input on key. I Groups with identical keys are processed together. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 34 / 71

RDD Transformations - CoGroup I Groups each input on key. I Groups with identical keys are processed together. val visits sc.parallelize(Seq(("h", "1.2.3.4"), ("a", "3.4.5.6"), ("h", "1.3.3.1"))) val pageNames sc.parallelize(Seq(("h", "Home"), ("a", "About"))) visits.cogroup(pageNames) // ("h", (("1.2.3.4", "1.3.3.1"), ("Home"))) // ("a", (("3.4.5.6"), ("About"))) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 34 / 71

RDD Transformations - Union and Sample I Union: merges two RDDs and returns a single RDD using bag semantics, i.e., duplicates are not removed. I Sample: similar to mapping, except that the RDD stores a random number generator seed for each partition to deterministically sample parent records. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 35 / 71

Basic RDD Actions (1/2) I Return all the elements of the RDD as an array. val nums sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71

Basic RDD Actions (1/2) I Return all the elements of the RDD as an array. val nums sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) I Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71

Basic RDD Actions (1/2) I Return all the elements of the RDD as an array. val nums sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) I Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) I Return the number of elements in the RDD. nums.count() // 3 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71

Basic RDD Actions (2/2) I Aggregate the elements of the RDD using the given function. nums.reduce((x, y) x y) or nums.reduce( ) // 6 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 37 / 71

Basic RDD Actions (2/2) I Aggregate the elements of the RDD using the given function. nums.reduce((x, y) x y) or nums.reduce( ) // 6 I Write the elements of the RDD as a text file. nums.saveAsTextFile("hdfs://file.txt") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 37 / 71

SparkContext I Main entry point to Spark functionality. I Available in shell as variable sc. I Only one SparkContext may be active per JVM. // master: the master URL to connect to, e.g., // "local", "local[4]", "spark://master:7077" val conf new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 38 / 71

Creating RDDs I Turn a collection into an RDD. val a sc.parallelize(Array(1, 2, 3)) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 39 / 71

Creating RDDs I Turn a collection into an RDD. val a sc.parallelize(Array(1, 2, 3)) I Load text file from local FS, HDFS, or S3. val a sc.textFile("file.txt") val b sc.textFile("directory/*.txt") val c sc.textFile("hdfs://namenode:9000/path/file") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 39 / 71

Example 1 val textFile sc.textFile("hdfs://.") val words textFile.flatMap(line line.split(" ")) val ones words.map(word (word, 1)) val counts ones.reduceByKey( ) counts.saveAsTextFile("hdfs://.") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 40 / 71

Example 2 val val val val val textFile sc.textFile("hdfs://.") sics textFile.filter( .contains("SICS")) cachedSics sics.cache() ones cachedSics.map( 1) count ones.reduce( ) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 41 / 71

Example 2 val val val val val textFile sc.textFile("hdfs://.") sics textFile.filter( .contains("SICS")) cachedSics sics.cache() ones cachedSics.map( 1) count ones.reduce( ) val textFile sc.textFile("hdfs://.") val count textFile.filter( .contains("SICS")).count() Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 41 / 71

Execution Engine Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 42 / 71

Spark Programming Interface I A Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 43 / 71

Lineage I Lineage: transformations used to build an RDD. I RDDs are stored as a chain of objects capturing the lineage of each RDD. val val val val val file sc.textFile("hdfs://.") sics file.filter( .contains("SICS")) cachedSics sics.cache() ones cachedSics.map( 1) count ones.reduce( ) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 44 / 71

RDD Dependencies (1/3) I Two types of dependencies between RDDs: Narrow and Wide. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 45 / 71

RDD Dependencies: Narrow (2/3) I Narrow: each partition of a parent RDD is used by at most one partition of the child RDD. I Narrow dependencies allow pipelined execution on one cluster node, e.g., a map followed by a filter. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 46 / 71

RDD Dependencies: Wide (3/3) I Wide: each partition of a parent RDD is used by multiple partitions of the child RDDs. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 47 / 71

Job Scheduling (1/2) I When a user runs an action on an RDD: the scheduler builds a DAG of stages from the RDD lineage graph. I A stage contains as many pipelined transformations with narrow dependencies. I The boundary of a stage: Shuffles for wide dependencies. Already computed partitions. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 48 / 71

Job Scheduling (2/2) I The scheduler launches tasks to compute missing partitions from each stage until it computes the target RDD. I Tasks are assigned to machines based on data locality. If a task needs a partition, which is available in the memory of a node, the task is sent to that node. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 49 / 71

RDD Fault Tolerance I Logging lineage rather than the actual data. I No replication. I Recompute only the lost partitions of an RDD. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 50 / 71

Spark SQL Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 51 / 71

Spark and Spark SQL Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 52 / 71

DataFrame I A DataFrame is a distributed collection of rows I Homogeneous schema. I Equivalent to a table in a relational database. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 53 / 71

Adding Schema to RDDs I Spark RDD: functional transformations on partitioned collections of opaque objects. I SQL DataFrame: declarative transformations on partitioned collections of tuples. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 54 / 71

Creating DataFrames I The entry point into all functionality in Spark SQL is the SQLContext. I With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources. val sc: SparkContext // An existing SparkContext. val sqlContext new org.apache.spark.sql.SQLContext(sc) val df sqlContext.read.json(.) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 55 / 71

DataFrame Operations (1/2) I Domain-specific language for structured data manipulation. // Show the content of the DataFrame df.show() // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df.printSchema() // root // -- age: long (nullable true) // -- name: string (nullable true) // Select only the "name" column df.select("name").show() // name // Michael // Andy // Justin Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 56 / 71

DataFrame Operations (2/2) I Domain-specific language for structured data manipulation. // Select everybody, but increment the age by 1 df.select(df("name"), df("age") 1).show() // name (age 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df.filter(df("age") 21).show() // age name // 30 Andy // Count people by age df.groupBy("age").count().show() // age count // null 1 // 19 1 // 30 1 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 57 / 71

Running SQL Queries Programmatically I Running SQL queries programmatically and returns the result as a DataFrame. I Using the sql function on a SQLContext. val sqlContext . // An existing SQLContext val df sqlContext.sql("SELECT * FROM table") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 58 / 71

Converting RDDs into DataFrames I Inferring the schema using reflection. // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people sc.textFile(.).map( .split(",")) .map(p Person(p(0), p(1).trim.toInt)).toDF() people.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers sqlContext .sql("SELECT name, age FROM people WHERE age 13 AND age 19") // The results of SQL queries are DataFrames. teenagers.map(t "Name: " t(0)).collect().foreach(println) teenagers.map(t "Name: " t.getAs[String]("name")).collect() .foreach(println) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 59 / 71

Data Sources I Supports on a variety of data sources. I A DataFrame can be operated on as normal RDDs or as a temporary table. I Registering a DataFrame as a table allows you to run SQL queries over its data. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 60 / 71

Advanced Programming Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 61 / 71

Shared Variables I When Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. I Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. I General read-write shared variables across tasks is inefficient. I Two types of shared variables: accumulators and broadcast variables. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 62 / 71

Accumulators (1/2) I Aggregating values from worker nodes back to the driver program. Example: counting events that occur during job execution. I Worker code can add to the accumulator with its method. I The driver program can access the value by calling the value property on the accumulator. scala val accum sc.accumulator(0) accum: spark.Accumulator[Int] 0 scala sc.parallelize(Array(1, 2, 3, 4)).foreach(x accum x) . scala accum.value res2: Int 10 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 63 / 71

Accumulators (2/2) I How many lines of the input file were blank? val sc new SparkContext(.) val file sc.textFile("file.txt") val blankLines sc.accumulator(0) // Create an Accumulator[Int] initialized to 0 val callSigns file.flatMap(line { if (line "") { blankLines 1 // Add to the accumulator } line.split(" ") }) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 64 / 71

Broadcast Variables (1/4) I The broadcast values are sent to each node only once, and should be treated as read-only variables. I The process of using broadcast variables can access its value with the value property. scala val broadcastVar sc.broadcast(Array(1, 2, 3)) broadcastVar: spark.Broadcast[Array[Int]] spark.Broadcast(b5c40191-.) scala broadcastVar.value res0: Array[Int] Array(1, 2, 3) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 65 / 71

Broadcast Variables (2/4) // Load RDD of (URL, name) pairs val pageNames sc.textFile("pages.txt").map(.) // Load RDD of (URL, visit) pairs val visits sc.textFile("visits.txt").map(.) val joined visits.join(pageNames) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 66 / 71

Broadcast Variables (3/4) // Load RDD of (URL, name) pairs val pageNames sc.textFile("pages.txt").map(.) val pageMap pageNames.collect().toMap() // Load RDD of (URL, visit) pairs val visits sc.textFile("visits.txt").map(.) val joined visits.map(v (v. 1, (pageMap(v. 1), v. 2))) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 67 / 71

Broadcast Variables (4/4) // Load RDD of (URL, name) pairs val pageNames sc.textFile("pages.txt").map(.) val pageMap pageNames.collect().toMap() val bc sc.broadcast(pageMap) // Load RDD of (URL, visit) pairs val visits sc.textFile("visits.txt").map(.) val joined visits.map(v (v. 1, (bc.value(v. 1), v. 2))) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 68 / 71

Summary Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 69 / 71

Summary I Dataflow programming I Spark: RDD I Two types of operations: Transformations and Actions. I Spark execution engine I Spark SQL: DataFrame Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 70 / 71

Questions? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 71 / 71

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Challenge How to design a distributed memory abstraction that is bothfault tolerantande cient? Solution

Related Documents:

running Spark, use Spark SQL within other programming languages. Performance-wise, we find that Spark SQL is competitive with SQL-only systems on Hadoop for relational queries. It is also up to 10 faster and more memory-efficient than naive Spark code in computations expressible in SQL. More generally, we see Spark SQL as an important .

2.Configuring Hive 3.Configuring Spark & Hive 4.Starting the Spark Service and the Spark Thrift Server 5.Connecting Tableau to Spark SQL 5A. Install Tableau DevBuild 8.2.3 5B. Install the Spark SQL ODBC 5C. Opening a Spark SQL ODBC Connection 6.Appendix: SparkSQL 1.1 Patch Installation Steps 6A. Pre-Requisites: 6B. Apache Hadoop Install .

MS SQL Server: MS SQL Server 2017, MS SQL Server 2016, MS SQL Server 2014, MS SQL Server 2012, MS SQL Server 2008 R2, 2008, 2008 (64 bit), 2008 Express, MS SQL Server 2005, 2005 (64 bit), 2005 Express, MS SQL Server 2000, 2000 (64 bit), 7.0 and mixed formats. To install the software, follow the steps: 1. Double-click Stellar Repair for MS SQL.exe.

SQL Server supports ANSI SQL, which is the standard SQL (Structured Query Language) language. However, SQL Server comes with its own implementation of the SQL language, T-SQL (Transact- SQL). T-SQL is a Microsoft propriety Language known as Transact-SQL. It provides further capab

Server 2005 , SQL Server 2008 , SQL Server 2008 R2 , SQL Server 2012 , SQL Server 2014 , SQL Server 2005 Express Edition , SQL Server 2008 Express SQL Server 2008 R2 Express , SQL Server 2012 Express , SQL Server 2014 Express .NET Framework 4.0, .NET Framework 2.0,

Contents at a Glance Preface xi Introduction 1 I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark 5 2 Deploying Spark 27 3 Understanding the Spark Cluster Architecture 45 4 Learning Spark Programming Basics 59 II: Beyond the Basics 5 Advanced Programming Using the Spark Core API 111 6 SQL and NoSQL Programming with Spark 161 7 Stream Processing and Messaging Using Spark 209

Spark Dataframe, Spark SQL, Hadoop metrics Guoshiwen Han, gh2567@columbia.edu 10/1/2021 1. Agenda Spark Dataframe Spark SQL Hadoop metrics 2. . ambari-server setup service ambari-server start point your browser to AmbariHost :8080 and login with the default user admin and password admin. Third-party tools 22

Magnetic Flux Controllers in Induction Heating and Melting Robert Goldstein, Fluxtrol, Inc. MAGNETIC FLUX CONTROLLERS are materials other than the copper coilthat are used