Spark SQL - Tutorialspoint

1y ago
38 Views
7 Downloads
743.96 KB
7 Pages
Last View : 12d ago
Last Download : 2m ago
Upload by : Camryn Boren
Transcription

Spark SQLAbout the TutorialApache Spark is a lightning-fast cluster computing designed for fast computation. It wasbuilt on top of Hadoop MapReduce and it extends the MapReduce model to efficiently usemore types of computations which includes Interactive Queries and Stream Processing.This is a brief tutorial that explains the basics of Spark SQL programming.AudienceThis tutorial has been prepared for professionals aspiring to learn the basics of Big DataAnalytics using Spark Framework and become a Spark Developer. In addition, it wouldbe useful for Analytics Professionals and ETL developers as well.PrerequisiteBefore you start proceeding with this tutorial, we assume that you have prior exposureto Scala programming, database concepts, and any of the Linux operating systemflavors.Copyright & Disclaimer Copyright 2015 by Tutorials Point (I) Pvt. Ltd.All the content and graphics published in this e-book are the property of Tutorials Point(I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute orrepublish any contents or a part of contents of this e-book in any manner without writtenconsent of the publisher.We strive to update the contents of our website and tutorials as timely and as preciselyas possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I)Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness ofour website or its contents including this tutorial. If you discover any errors on ourwebsite or in this tutorial, please notify us at contact@tutorialspoint.comi

Spark SQLTable of ContentsAbout the Tutorial . iAudience. iPrerequisite . iCopyright & Disclaimer . iTable of Contents. ii1. SPARK SQL – INTRODUCTION . 1Apache Spark . 1Evolution of Apache Spark . 1Features of Apache Spark . 1Spark Built on Hadoop . 2Components of Spark . 32. SPARK SQL – RDD . 4Resilient Distributed Datasets. 4Data Sharing is Slow in MapReduce . 4Iterative Operations on MapReduce . 4Interactive Operations on MapReduce . 5Data Sharing using Spark RDD . 6Iterative Operations on Spark RDD . 6Interactive Operations on Spark RDD . 63. SPARK SQL – INSTALLATION . 8Step 1: Verifying Java Installation . 8Step 2: Verifying Scala installation . 8Step 3: Downloading Scala . 8Step 4: Installing Scala . 9Step 5: Downloading Apache Spark . 9ii

Spark SQLStep 6: Installing Spark . 10Step 7: Verifying the Spark Installation . 104. SPARK SQL – FEATURES AND ARCHITECTURE . 12Features of Spark SQL . 12Spark SQL Architecture . 135. SPARK SQL – DATAFRAMES . 14Features of DataFrame . 14SQLContext . 14DataFrame Operations . 15Running SQL Queries Programmatically . 17Inferring the Schema using Reflection . 18Programmatically Specifying the Schema . 216. SPARK SQL – DATA SOURCES . 25JSON Datasets . 25DataFrame Operations . 26Hive Tables . 27Parquet Files . 29iii

1. SPARK SQL – INTRODUCTIONSpark SQLIndustries are using Hadoop extensively to analyze their data sets. The reason is thatHadoop framework is based on a simple programming model (MapReduce) and itenables a computing solution that is scalable, flexible, fault-tolerant and cost effective.Here, the main concern is to maintain speed in processing large datasets in terms ofwaiting time between queries and waiting time to run the program.Spark was introduced by Apache Software Foundation for speeding up the Hadoopcomputational computing software process.As against a common belief, Spark is not a modified version of Hadoop and is not,really, dependent on Hadoop because it has its own cluster management. Hadoop is justone of the ways to implement Spark.Spark uses Hadoop in two ways – one is storage and second is processing. SinceSpark has its own cluster management computation, it uses Hadoop for storage purposeonly.Apache SparkApache Spark is a lightning-fast cluster computing technology, designed for fastcomputation. It is based on Hadoop MapReduce and it extends the MapReduce model toefficiently use it for more types of computations, which includes interactive queries andstream processing. The main feature of Spark is its in-memory cluster computingthat increases the processing speed of an application.Spark is designed to cover a wide range of workloads such as batch applications,iterative algorithms, interactive queries and streaming. Apart from supporting all theseworkload in a respective system, it reduces the management burden of maintainingseparate tools.Evolution of Apache SparkSpark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab byMatei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated toApache software foundation in 2013, and now Apache Spark has become a top levelApache project from Feb-2014.Features of Apache SparkApache Spark has following features. Speed: Spark helps to run an application in Hadoop cluster, up to 100 times fasterin memory, and 10 times faster when running on disk. This is possible by reducingnumber of read/write operations to disk. It stores the intermediate processing datain memory.1

Spark SQL Supports multiple languages: Spark provides built-in APIs in Java, Scala, orPython. Therefore, you can write applications in different languages. Spark comesup with 80 high-level operators for interactive querying. Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supportsSQL queries, Streaming data, Machine learning (ML), and Graph algorithms.Spark Built on HadoopThe following diagram shows three ways of how Spark can be built with Hadoopcomponents.There are three ways of Spark deployment as explained below. Standalone: Spark Standalone deployment means Spark occupies the place ontop of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobson cluster. Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarnwithout any pre-installation or root access required. It helps to integrate Sparkinto Hadoop ecosystem or Hadoop stack. It allows other components to run ontop of stack. Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark jobin addition to standalone deployment. With SIMR, user can start Spark and usesits shell without any administrative access.2

Spark SQLEnd of ebook previewIf you liked what you saw Buy it from our store @ https://store.tutorialspoint.com3

Spark is one of Hadoop's sub project developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014. Features of Apache Spark Apache Spark has following features.

Related Documents:

running Spark, use Spark SQL within other programming languages. Performance-wise, we find that Spark SQL is competitive with SQL-only systems on Hadoop for relational queries. It is also up to 10 faster and more memory-efficient than naive Spark code in computations expressible in SQL. More generally, we see Spark SQL as an important .

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Challenge How to design a distributed memory abstraction that is bothfault tolerantande cient? Solution

2.Configuring Hive 3.Configuring Spark & Hive 4.Starting the Spark Service and the Spark Thrift Server 5.Connecting Tableau to Spark SQL 5A. Install Tableau DevBuild 8.2.3 5B. Install the Spark SQL ODBC 5C. Opening a Spark SQL ODBC Connection 6.Appendix: SparkSQL 1.1 Patch Installation Steps 6A. Pre-Requisites: 6B. Apache Hadoop Install .

Contents at a Glance Preface xi Introduction 1 I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark 5 2 Deploying Spark 27 3 Understanding the Spark Cluster Architecture 45 4 Learning Spark Programming Basics 59 II: Beyond the Basics 5 Advanced Programming Using the Spark Core API 111 6 SQL and NoSQL Programming with Spark 161 7 Stream Processing and Messaging Using Spark 209

Spark Dataframe, Spark SQL, Hadoop metrics Guoshiwen Han, gh2567@columbia.edu 10/1/2021 1. Agenda Spark Dataframe Spark SQL Hadoop metrics 2. . ambari-server setup service ambari-server start point your browser to AmbariHost :8080 and login with the default user admin and password admin. Third-party tools 22

SQL Server supports ANSI SQL, which is the standard SQL (Structured Query Language) language. However, SQL Server comes with its own implementation of the SQL language, T-SQL (Transact- SQL). T-SQL is a Microsoft propriety Language known as Transact-SQL. It provides further capab

MS SQL Server: MS SQL Server 2017, MS SQL Server 2016, MS SQL Server 2014, MS SQL Server 2012, MS SQL Server 2008 R2, 2008, 2008 (64 bit), 2008 Express, MS SQL Server 2005, 2005 (64 bit), 2005 Express, MS SQL Server 2000, 2000 (64 bit), 7.0 and mixed formats. To install the software, follow the steps: 1. Double-click Stellar Repair for MS SQL.exe.

Server 2005 , SQL Server 2008 , SQL Server 2008 R2 , SQL Server 2012 , SQL Server 2014 , SQL Server 2005 Express Edition , SQL Server 2008 Express SQL Server 2008 R2 Express , SQL Server 2012 Express , SQL Server 2014 Express .NET Framework 4.0, .NET Framework 2.0,