Spark SQL: Relational Data Processing In Spark

3y ago
25 Views
2 Downloads
536.69 KB
12 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Adele Mcdaniel
Transcription

Spark SQL: Relational Data Processing in SparkMichael Armbrust† , Reynold S. Xin† , Cheng Lian† , Yin Huai† , Davies Liu† , Joseph K. Bradley† ,Xiangrui Meng† , Tomer Kaftan‡ , Michael J. Franklin†‡ , Ali Ghodsi† , Matei Zaharia† †Databricks Inc. MIT CSAILABSTRACTSpark SQL is a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. Builton our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarativequeries and optimized storage), and lets SQL users call complexanalytics libraries in Spark (e.g., machine learning). Compared toprevious systems, Spark SQL makes two main additions. First,it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrateswith procedural Spark code. Second, it includes a highly extensibleoptimizer, Catalyst, built using features of the Scala programminglanguage, that makes it easy to add composable rules, control codegeneration, and define extension points. Using Catalyst, we havebuilt a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases)tailored for the complex needs of modern data analysis. We seeSpark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.Categories and Subject DescriptorsH.2 [Database Management]: SystemsKeywordsDatabases; Data Warehouse; Machine Learning; Spark; Hadoop1 IntroductionBig data applications require a mix of processing techniques, datasources and storage formats. The earliest systems designed forthese workloads, such as MapReduce, gave users a powerful, butlow-level, procedural programming interface. Programming suchsystems was onerous and required manual optimization by the userto achieve high performance. As a result, multiple new systemssought to provide a more productive user experience by offeringrelational interfaces to big data. Systems like Pig, Hive, Dremeland Shark [29, 36, 25, 38] all take advantage of declarative queriesto provide richer automatic optimizations.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from permissions@acm.org.SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2758-9/15/05 . �AMPLab, UC BerkeleyWhile the popularity of relational systems shows that users oftenprefer writing declarative queries, the relational approach is insufficient for many big data applications. First, users want to perform ETL to and from various data sources that might be semior unstructured, requiring custom code. Second, users want toperform advanced analytics, such as machine learning and graphprocessing, that are challenging to express in relational systems.In practice, we have observed that most data pipelines would ideally be expressed with a combination of both relational queries andcomplex procedural algorithms. Unfortunately, these two classesof systems—relational and procedural—have until now remainedlargely disjoint, forcing users to choose one paradigm or the other.This paper describes our effort to combine both models in SparkSQL, a major new component in Apache Spark [39]. Spark SQLbuilds on our earlier SQL-on-Spark effort, called Shark. Ratherthan forcing users to pick between a relational or a procedural API,however, Spark SQL lets users seamlessly intermix the two.Spark SQL bridges the gap between the two models through twocontributions. First, Spark SQL provides a DataFrame API thatcan perform relational operations on both external data sources andSpark’s built-in distributed collections. This API is similar to thewidely used data frame concept in R [32], but evaluates operationslazily so that it can perform relational optimizations. Second, tosupport the wide range of data sources and algorithms in big data,Spark SQL introduces a novel extensible optimizer called Catalyst.Catalyst makes it easy to add data sources, optimization rules, anddata types for domains such as machine learning.The DataFrame API offers rich relational/procedural integrationwithin Spark programs. DataFrames are collections of structuredrecords that can be manipulated using Spark’s procedural API, orusing new relational APIs that allow richer optimizations. Theycan be created directly from Spark’s built-in distributed collectionsof Java/Python objects, enabling relational processing in existingSpark programs. Other Spark components, such as the machinelearning library, take and produce DataFrames as well. DataFramesare more convenient and more efficient than Spark’s proceduralAPI in many common situations. For example, they make it easyto compute multiple aggregates in one pass using a SQL statement,something that is difficult to express in traditional functional APIs.They also automatically store data in a columnar format that is significantly more compact than Java/Python objects. Finally, unlikeexisting data frame APIs in R and Python, DataFrame operationsin Spark SQL go through a relational optimizer, Catalyst.To support a wide variety of data sources and analytics workloads in Spark SQL, we designed an extensible query optimizercalled Catalyst. Catalyst uses features of the Scala programminglanguage, such as pattern-matching, to express composable rulesin a Turing-complete language. It offers a general framework for

transforming trees, which we use to perform analysis, planning, andruntime code generation. Through this framework, Catalyst canalso be extended with new data sources, including semi-structureddata such as JSON and “smart” data stores to which one can pushfilters (e.g., HBase); with user-defined functions; and with userdefined types for domains such as machine learning. Functionallanguages are known to be well-suited for building compilers [37],so it is perhaps no surprise that they made it easy to build an extensible optimizer. We indeed have found Catalyst effective in enablingus to quickly add capabilities to Spark SQL, and since its releasewe have seen external contributors easily add them as well.Spark SQL was released in May 2014, and is now one of themost actively developed components in Spark. As of this writing,Apache Spark is the most active open source project for big dataprocessing, with over 400 contributors in the past year. Spark SQLhas already been deployed in very large scale environments. Forexample, a large Internet company uses Spark SQL to build datapipelines and run queries on an 8000-node cluster with over 100PB of data. Each individual query regularly operates on tens of terabytes. In addition, many users adopt Spark SQL not just for SQLqueries, but in programs that combine it with procedural processing. For example, 2/3 of customers of Databricks Cloud, a hostedservice running Spark, use Spark SQL within other programminglanguages. Performance-wise, we find that Spark SQL is competitive with SQL-only systems on Hadoop for relational queries. It isalso up to 10 faster and more memory-efficient than naive Sparkcode in computations expressible in SQL.More generally, we see Spark SQL as an important evolutionof the core Spark API. While Spark’s original functional programming API was quite general, it offered only limited opportunitiesfor automatic optimization. Spark SQL simultaneously makes Sparkaccessible to more users and improves optimizations for existingones. Within Spark, the community is now incorporating SparkSQL into more APIs: DataFrames are the standard data representation in a new “ML pipeline” API for machine learning, and we hopeto expand this to other components, such as GraphX and streaming.We start this paper with a background on Spark and the goals ofSpark SQL (§2). We then describe the DataFrame API (§3), theCatalyst optimizer (§4), and advanced features we have built onCatalyst (§5). We evaluate Spark SQL in §6. We describe externalresearch built on Catalyst in §7. Finally, §8 covers related work.2 Background and Goals2.1 Spark OverviewApache Spark is a general-purpose cluster computing engine withAPIs in Scala, Java and Python and libraries for streaming, graphprocessing and machine learning [6]. Released in 2010, it is to ourknowledge one of the most widely-used systems with a “languageintegrated” API similar to DryadLINQ [20], and the most activeopen source project for big data processing. Spark had over 400contributors in 2014, and is packaged by multiple vendors.Spark offers a functional programming API similar to other recent systems [20, 11], where users manipulate distributed collections called Resilient Distributed Datasets (RDDs) [39]. Each RDDis a collection of Java or Python objects partitioned across a cluster.RDDs can be manipulated through operations like map, filter,and reduce, which take functions in the programming languageand ship them to nodes on the cluster. For example, the Scala codebelow counts lines starting with “ERROR” in a text file:lines spark . textFile (" hdfs ://.")errors lines . filter (s s. contains (" ERROR "))println ( errors . count ())This code creates an RDD of strings called lines by reading anHDFS file, then transforms it using filter to obtain another RDD,errors. It then performs a count on this data.RDDs are fault-tolerant, in that the system can recover lost datausing the lineage graph of the RDDs (by rerunning operations suchas the filter above to rebuild missing partitions). They can alsoexplicitly be cached in memory or on disk to support iteration [39].One final note about the API is that RDDs are evaluated lazily.Each RDD represents a “logical plan” to compute a dataset, butSpark waits until certain output operations, such as count, to launcha computation. This allows the engine to do some simple query optimization, such as pipelining operations. For instance, in the example above, Spark will pipeline reading lines from the HDFS filewith applying the filter and computing a running count, so that itnever needs to materialize the intermediate lines and errors results. While such optimization is extremely useful, it is also limitedbecause the engine does not understand the structure of the data inRDDs (which is arbitrary Java/Python objects) or the semantics ofuser functions (which contain arbitrary code).2.2 Previous Relational Systems on SparkOur first effort to build a relational interface on Spark was Shark [38],which modified the Apache Hive system to run on Spark and implemented traditional RDBMS optimizations, such as columnar processing, over the Spark engine. While Shark showed good performance and good opportunities for integration with Spark programs,it had three important challenges. First, Shark could only be usedto query external data stored in the Hive catalog, and was thus notuseful for relational queries on data inside a Spark program (e.g., onthe errors RDD created manually above). Second, the only wayto call Shark from Spark programs was to put together a SQL string,which is inconvenient and error-prone to work with in a modularprogram. Finally, the Hive optimizer was tailored for MapReduceand difficult to extend, making it hard to build new features such asdata types for machine learning or support for new data sources.2.3 Goals for Spark SQLWith the experience from Shark, we wanted to extend relationalprocessing to cover native RDDs in Spark and a much wider rangeof data sources. We set the following goals for Spark SQL:1. Support relational processing both within Spark programs (onnative RDDs) and on external data sources using a programmerfriendly API.2. Provide high performance using established DBMS techniques.3. Easily support new data sources, including semi-structured dataand external databases amenable to query federation.4. Enable extension with advanced analytics algorithms such asgraph processing and machine learning.3Programming InterfaceSpark SQL runs as a library on top of Spark, as shown in Figure 1. It exposes SQL interfaces, which can be accessed throughJDBC/ODBC or through a command-line console, as well as theDataFrame API integrated into Spark’s supported programming languages. We start by covering the DataFrame API, which lets usersintermix procedural and relational code. However, advanced functions can also be exposed in SQL through UDFs, allowing them tobe invoked, for example, by business intelligence tools. We discussUDFs in Section 3.7.

JDBCConsoleSpark SQLUser Programs(Java, Scala, Python)DataFrame APICatalyst OptimizerSparkResilient Distributed DatasetsFigure 1: Interfaces to Spark SQL, and interaction with Spark.3.1 DataFrame APIThe main abstraction in Spark SQL’s API is a DataFrame, a distributed collection of rows with a homogeneous schema. A DataFrameis equivalent to a table in a relational database, and can also bemanipulated in similar ways to the “native” distributed collectionsin Spark (RDDs).1 Unlike RDDs, DataFrames keep track of theirschema and support various relational operations that lead to moreoptimized execution.DataFrames can be constructed from tables in a system catalog (based on external data sources) or from existing RDDs ofnative Java/Python objects (Section 3.5). Once constructed, theycan be manipulated with various relational operators, such as whereand groupBy, which take expressions in a domain-specific language(DSL) similar to data frames in R and Python [32, 30]. EachDataFrame can also be viewed as an RDD of Row objects, allowingusers to call procedural Spark APIs such as map.2Finally, unlike traditional data frame APIs, Spark DataFramesare lazy, in that each DataFrame object represents a logical plan tocompute a dataset, but no execution occurs until the user calls a special “output operation” such as save. This enables rich optimizationacross all operations that were used to build the DataFrame.To illustrate, the Scala code below defines a DataFrame from atable in Hive, derives another based on it, and prints a result:ctx new HiveContext ()users ctx. table (" users ")young users . where ( users (" age ") 21)println ( young . count ())In this code, users and young are DataFrames. The snippetusers("age") 21 is an expression in the data frame DSL, whichis captured as an abstract syntax tree rather than representing aScala function as in the traditional Spark API. Finally, each DataFramesimply represents a logical plan (i.e., read the users table and filterfor age 21). When the user calls count, which is an output operation, Spark SQL builds a physical plan to compute the final result.This might include optimizations such as only scanning the “age”column of the data if its storage format is columnar, or even usingan index in the data source to count the matching rows.We next cover the details of the DataFrame API.3.2 Data ModelSpark SQL uses a nested data model based on Hive [19] for tables and DataFrames. It supports all major SQL data types, including boolean, integer, double, decimal, string, date, and timestamp,1 We chose the name DataFrame because it is similar to structured data libraries in R and Python, and designed our API to resemble those.2 These Row objects are constructed on the fly and do not necessarily represent the internal storage format of the data, which is typically columnar.as well as complex (i.e., non-atomic) data types: structs, arrays,maps and unions. Complex data types can also be nested togetherto create more powerful types. Unlike many traditional DBMSes,Spark SQL provides first-class support for complex data types inthe query language and the API. In addition, Spark SQL also supports user-defined types, as described in Section 4.4.2.Using this type system, we have been able to accurately modeldata from a variety of sources and formats, including Hive, relational databases, JSON, and native objects in Java/Scala/Python.3.3 DataFrame OperationsUsers can perform relational operations on DataFrames using adomain-specific language (DSL) similar to R data frames [32] andPython Pandas [30]. DataFrames support all common relationaloperators, including projection (select), filter (where), join, andaggregations (groupBy). These operators all take expression objects in a limited DSL that lets Spark capture the structure of theexpression. For example, the following code computes the numberof female employees in each department.employees.join(dept , employees (" deptId ") dept (" id ")). where ( employees (" gender ") " female "). groupBy (dept (" id"), dept (" name ")).agg( count (" name "))Here, employees is a DataFrame, and employees("deptId") isan expression representing the deptId column. Expression objects have many operators that return new expressions, includingthe usual comparison operators (e.g., for equality test, forgreater than) and arithmetic ones ( , -, etc). They also support aggregates, such as count("name"). All of these operators build up anabstract syntax tree (AST) of the expression, which is then passedto Catalyst for optimization. This is unlike the native Spark APIthat takes functions containing arbitrary Scala/Java/Python code,which are then opaque to the runtime engine. For a detailed listingof the API, we refer readers to Spark’s official documentation [6].Apart from the relational DSL, DataFrames can be registered astemporary tables in the system catalog and queried using SQL. Thecode below shows an example:users . where ( users (" age ") 21). registerTempTable (" young ")ctx.sql (" SELECT count (*) , avg(age) FROM young ")SQL is sometimes convenient for computing multiple aggregatesconcisely, and also allows programs to expose datasets through JDBC/ODBC. The DataFrames registered in the catalog are still unmaterialized views, so that optimizations can happen across SQLand the original DataFrame expressions. However, DataFrames canalso be materialized, as we discuss in Section 3.6.3.4 DataFrames versus Relational Query LanguagesWhile on the surface, DataFrames provide the same operations asrelational query languages like SQL and Pig [29], we found thatthey can be significantly easier for users to work with thanks totheir integration in a full programming language. For example,users can break up their code into Scala, Java or Python functionsthat pass DataFrames between them to build a logical plan, andwill still benefit from optimizations across the whole plan whenthey run an output operation. Likewise, developers can use controlstructures like if statements and loops to structure their work. Oneuser said that the DataFrame API is “concise and declarative likeSQL, except I can name intermediate results,” referring to how it iseasier to structure computations and debug intermediate steps.To simplify programming in DataFrames, we also made API analyze logical plans eagerly (i.e., to identify whether the column

names used in expressions exist in the underlying tables, and whethertheir data types are appropriate), even though query results arecomputed lazily. Thus, Spark SQL reports an error as soon as usertypes an invalid line of code instead of waiting until execution. Thisis again easier to work with than a large SQL statement.3.5 Querying Native DatasetsReal-world pipelines often extract data from heterogeneous sourcesand run a wide variety of algorithms from different programminglibraries. To i

something that is difficult to express in traditional functional APIs. They also automatically store data in a columnar format that is sig-nificantly more compact than Java/Python objects. Finally, unlike existing data frame APIs in R and Python, DataFrame operations in Spark SQL go through a relational optimizer, Catalyst.

Related Documents:

running Spark, use Spark SQL within other programming languages. Performance-wise, we find that Spark SQL is competitive with SQL-only systems on Hadoop for relational queries. It is also up to 10 faster and more memory-efficient than naive Spark code in computations expressible in SQL. More generally, we see Spark SQL as an important .

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Challenge How to design a distributed memory abstraction that is bothfault tolerantande cient? Solution

2.Configuring Hive 3.Configuring Spark & Hive 4.Starting the Spark Service and the Spark Thrift Server 5.Connecting Tableau to Spark SQL 5A. Install Tableau DevBuild 8.2.3 5B. Install the Spark SQL ODBC 5C. Opening a Spark SQL ODBC Connection 6.Appendix: SparkSQL 1.1 Patch Installation Steps 6A. Pre-Requisites: 6B. Apache Hadoop Install .

Contents at a Glance Preface xi Introduction 1 I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark 5 2 Deploying Spark 27 3 Understanding the Spark Cluster Architecture 45 4 Learning Spark Programming Basics 59 II: Beyond the Basics 5 Advanced Programming Using the Spark Core API 111 6 SQL and NoSQL Programming with Spark 161 7 Stream Processing and Messaging Using Spark 209

Spark Dataframe, Spark SQL, Hadoop metrics Guoshiwen Han, gh2567@columbia.edu 10/1/2021 1. Agenda Spark Dataframe Spark SQL Hadoop metrics 2. . ambari-server setup service ambari-server start point your browser to AmbariHost :8080 and login with the default user admin and password admin. Third-party tools 22

70 Microsoft SQL Server 2008: A Beginner’s Guide SQL_2008 / Microsoft SQL Server 2008: ABG / Petkovic / 154638-3 / Chapter 4 In Transact-SQL, the use of double quotation marks is defined using the QUOTED_ IDENTIFIER option of the SET statement. If this option is set to ON, which is theFile Size: 387KBPage Count: 26Explore furtherLanguage Elements (Transact-SQL) - SQL Server Microsoft Docsdocs.microsoft.comThe 33 languages of SQL Server Joe Webb Blogweblogs.sqlteam.comThe Language of SQL Pdf - libribooklibribook.comSql And The Standard Language For Relational Database .www.bartleby.comdatabase - What are good alternatives to SQL (the language .stackoverflow.comRecommended to you based on what's popular Feedback

The overview of Spark and how it is better Hadoop, deploying Spark without Hadoop, Spark history server and Cloudera distribution Spark Basics Spark installation guide, Spark configuration, memory management, executor memory vs. driver memory Working with Spark Shell, the concept of resilient distributed datasets (RDD) Learning to do functional .

Apache Spark Apache Spark is a fast and general open-source engine for large-scale data processing. Includes the following libraries: SPARK SQL, SPARK Streaming, MLlib (Machine Learning) and GraphX (graph processing). Spark capable to run programs up to 100x faste