Getting Started With Spark On Theta

2y ago
18 Views
2 Downloads
1.11 MB
24 Pages
Last View : 4m ago
Last Download : 2m ago
Upload by : Ronnie Bonney
Transcription

Getting Started with Sparkon ThetaXiao-Yong JinOct 3, 2019ALCF Simulation, Data, and Learning Workshop

Spark or MPI? Use MPI or even lower APIs to get the absolute performance Use Apache Spark if human time machine time Ease of use: parallelize quickly in Java, Scala, Python, R, and SQL Built-in libraries: SQL and DataFrames, MLlib for machine learning,GraphX, and Spark Streaming2Argonne Leadership Computing Facility

PySpark examples on Theta/projects/SDL Workshop/training/GettingStartedWithSparkOnTheta3

Getting Started with Spark on Theta (noninteractive)/soft/datascience/Spark Job/submit-spark.sh \-A SDL Workshop -t 10 -n 2 -q training \run-example SparkPi/soft/datascience/Spark Job/submit-spark.sh \-A SDL Workshop -t 10 -n 2 -q training \--class YOUR.SPARK.APP.CLASS \local:///ABSPATH/TO/YOUR/SPARK/APP.jar [EXTRA ARGS .]/soft/datascience/Spark Job/submit-spark.sh \-A SDL Workshop -t 10 -n 2 -q training \PATH/TO/YOUR/PYSPARK/SCRIPT.py [EXTRA ARGS .]4Argonne Leadership Computing Facility

Getting Started with Spark on Theta (Jupyter)thetalogin /soft/datascience/Spark Job/submit-spark.sh \-A SDL Workshop -t 60 -n 2 -q training -I.SPARKJOB JOBID 325700.# Spark is now running (SPARKJOB JOBID 325700) on:# nid03835nid03836declare -x SPARK MASTER URI "spark://nid03835:7077"# Spawning bash on host: nid03835.nid03835 export PYSPARK DRIVER PYTHON jupyternid03835 export PYSPARK DRIVER PYTHON OPTS "notebook --no-browser -ip nid03835 --port 8008"nid03835 /soft/datascience/apache spark/bin/pyspark \--master SPARK MASTER URIlocal ssh -L 8008:localhost:8008 theta ssh -L 8008:nid03835:8008 thetamom15Argonne Leadership Computing Facility

Spark ersocketResilient Distributed Datasetdistributed to executorsacted upon by tasks6Slave/WorkerArgonne Leadership Computing skTask

Theta ReminderYou7LoginNodeMOM NodeServiceLoginNodeMOM NodeServiceLoginNodeMOM m2thetamom3Argonne Leadership Computing Facilitycompute node.nid03835.

SPARK JOB (Script for working withCOBALT) Installed under /soft/datascience/Spark Job Designed to minimize the changes required for deploying on Theta Check out the readme file: /soft/datascience/Spark Job/readme Look in the example directory: /soft/datascience/Spark Job/example Under heavy development, guaranteed interface: submit-spark.sh Absolute stability, use explicit version number, eg:/soft/datascience/Spark Job v1.1.08Argonne Leadership Computing Facility

Spark Job [submit-spark.sh] usagesubmit-spark.sh [options] [JOBFILE [arguments .]]JOBFILE (optional) canscript.pybin.jarrun-example CLASSscriptsbe:pyspark scriptsjava binariesrun spark example CLASSother executable scripts (requires -s )Required options:-A PROJECT-t WALLTIME-n NODES-q QUEUEAllocation nameMax run time in minutesJob node countQueue nameOptional options:-o OUTPUTDIR-s-m-p 2 3 -I-w WAITTIME-hDirectory for COBALT output files (default: current dir)Enable script modeMaster uses a separate nodePython version (default: 3)Start an interactive ssh sessionTime to wait for prompt in minutes (default: 30)Print this help message9Argonne Leadership Computing Facility

Environment Variables (Information) The scripts set a few environment variables for informational purposes,and for controlling the behavior. Information (taken from the command line, the job scheduler, the system):SPARKJOB HOST "theta"SPARKJOB INTERACTIVE "1"SPARKJOB JOBID "242842"SPARKJOB PYVERSION "3"SPARKJOB SCRIPTMODE "0"SPARKJOB SCRIPTS DIR "/lus/theta-fs0/projects/datascience/xyjin/Spark Job"SPARKJOB SEPARATE MASTER "0"SPARKJOB OUTPUT DIR "/lus/theta-fs0/projects/datascience/xyjin/Spark Job/example"SPARK MASTER URI spark://nid03838:7077MASTER HOST nid0383810Argonne Leadership Computing Facility

Environment Variables (Customizable)SPARK HOME "/soft/datascience/apache spark"SPARK CONF DIR "/lus/theta-fs0/projects/datascience/xyjin/Spark Job/example/242842/conf"PYSPARK PYTHON thon"SPARKJOB WORKING DIR "/lus/theta-fs0/projects/datascience/xyjin/Spark Job/example/242842"SPARKJOB WORKING ENVS "/lus/theta-fs0/projects/datascience/xyjin/Spark Job/example/242842/envs"SPARKJOB DELAY BASE 15SPARKJOB DELAY MULT 0.125 The above is the environment set up when running a job under OUTPUTDIR/projects/datascience/xyjin/Spark Job/example The variable SPARKJOB OUTPUT DIR contains the directory path, whichSPARKJOB WORKING DIR and SPARKJOB WORKING ENVS depend on SPARKJOB DELAY BASE and SPARKJOB DELAY MULT controls how muchtime in seconds we wait until starting the Spark slave processes.11Argonne Leadership Computing Facility

Customizable Variables in env local.sh See /soft/datascience/Spark Job/example/env local.sh You can use SPARKJOB HOST to detect the running system.if [[ SPARKJOB HOST theta ]];thenmodule rm intelpython36module load miniconda-3export PYSPARK PYTHON " (which python)"fi On Cooley, interactive Spark jobs setup IPython notebook by defaults. You canchange it here, along with setting up your other python environment.unset PYSPARK DRIVER PYTHONunset PYSPARK DRIVER PYTHON OPTS12Argonne Leadership Computing Facility

Customizable Variables in env local.sh Create spark-defaults.conf file affecting Spark jobs submitted under thecurrent directory where this file resides, c.f. SPARK CONF DIR The parameters require tuning depending on the machine and workload.[[ -s SPARK CONF DIR/spark-defaults.conf ]] cat " SPARK CONF DIR/spark-defaults.conf" : UseParallelGC -XX:ParallelGCThreads 8spark.executor.extraJavaOptions -XX: UseParallelGC -XX:ParallelGCThreads 8EOF13Argonne Leadership Computing Facility

Spark on Theta Don't run Spark on the MOM node! Should the master share one node with the slaves? How many workers per node? How many executors per worker? How many tasks per executor? Is thread affinity useful? It all depends on your workload.14Argonne Leadership Computing Facility

Tuning parameters (spark-defaults.conf)Tune these numbers for your 28gspark.driver.extraJavaOptions-XX: UseParallelGC -XX:ParallelGCThreads 8spark.executor.extraJavaOptions -XX: UseParallelGC -XX:ParallelGCThreads 815Argonne Leadership Computing Facility

Tuning parameters (spark-defaults.conf)Tune these numbers for your numThreads48 JVM sees 256 cores on each Theta node By default, JVM launches 256 tasks simultaneously if memory allows spark.task.cpus makes JVM count each task as using 4 cores spark.rpc.netty.dispatcher.numThreads limits the netty thread pool Applies for PySpark applications, too More thread tuning: [SPARK-26632][Core] Separate Thread Configurations16Argonne Leadership Computing Facility

Tuning parameters (spark-defaults.conf)Tune these numbers for your 4000s1 Wait for resources on-line to avoid performance impact in the beginning Depends on your resource ty If you see related warnings It happens if you use large amount of nodes17Argonne Leadership Computing Facility100000

Tuning parameters (spark-defaults.conf)Tune these numbers for your meout240004000s12000s24000s Extra overhead compared to your MPI programs Impacts the FAULT TOLERANCE capabilityspark.locality.wait 6000sTransferring data over the network incurs a larger overhead than waiting18Argonne Leadership Computing Facility

Tuning parameters (spark-defaults.conf)Tune these numbers for your g128g You absolutely must set these to some large number The default 1g is too small unless you run multiple workers/executors19Argonne Leadership Computing Facility

Tuning parameters (spark-defaults.conf)Tune these numbers for your workloadspark.driver.extraJavaOptions-XX: UseParallelGC -XX:ParallelGCThreads 8spark.executor.extraJavaOptions -XX: UseParallelGC -XX:ParallelGCThreads 8 Depending on you application Tuning GC is another work of art Make sure GC time does not dominate20Argonne Leadership Computing Facility

Access the Web Interface Find the driver node ID, nid0NNNN Use SSH LocalForwardssh -L 8080:localhost:8080 -L 4040:localhost:4040 -t theta \ssh -L 8080:nid0NNNN:8080 -L 4040:nid0NNNN:4040 thetamom1 Go to http://localhost:8080 on your local machine21Argonne Leadership Computing Facility

Other things to consider Number of partitions for your RDD Point spark.local.dir to the local SSD Do not use "Dynamic Allocation" unless you have a strong reason Beyond the scope of this presentation: shuffle, other cluster managers, etc. Please contact us We are interested in Spark usage in scientific applications22Argonne Leadership Computing Facility

Overhead Dominated Weak Scaling (Preliminary) https://arxiv.org/abs/1904.11812 https://github.com/SparkHPC Memory bandwidth limitedoperation No shuffle, no disk, minimalnetworkNblock Sblock1Vn,s,i Ci),( NblockSblock n 1 s 1for i {0,1,2} and Vn is an RDD23Argonne Leadership Computing Facility

DON'T PANICxjin@anl.gov24Argonne Leadership Computing Facility

Getting Started with Spark on Theta Xiao-Yong Jin Oct 3, 2019 ALCF Simulation, Data, and Learning Workshop. Argonne Leadership Computing Facility Spark or MPI? Use MPI or even lower APIs to get the absolute performance Use Apache Spark if human time machine time

Related Documents:

Contents at a Glance Preface xi Introduction 1 I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark 5 2 Deploying Spark 27 3 Understanding the Spark Cluster Architecture 45 4 Learning Spark Programming Basics 59 II: Beyond the Basics 5 Advanced Programming Using the Spark Core API 111 6 SQL and NoSQL Programming with Spark 161 7 Stream Processing and Messaging Using Spark 209

The overview of Spark and how it is better Hadoop, deploying Spark without Hadoop, Spark history server and Cloudera distribution Spark Basics Spark installation guide, Spark configuration, memory management, executor memory vs. driver memory Working with Spark Shell, the concept of resilient distributed datasets (RDD) Learning to do functional .

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71. Challenge How to design a distributed memory abstraction that is bothfault tolerantande cient? Solution

2. Spark English-Teachers Manual Book II 10 3. Spark English-Teachers Manual Book III 19 4. Spark English-Teachers Manual Book IV 31 5. Spark English-Teachers Manual Book V 45 6. Spark English-Teachers Manual Book VI 59 7. Spark English-Teachers Manual Book VII 73 8. Spark English-Teachers Manual Book VIII 87 Revised Edition, 2017

Here is a tip sheet guide to get you started using Adobe Spark. Adobe Spark Video is a free App that can be accessed either through your Web Browser at . https://spark.adobe.com, or, you can download the Spark Video app for any mobile device. To download Adobe Spark to your mobile device, visit your App store and search for “Adobe Spark video”

Getting started with Spark Getting started with Apache Spark on Azure Databricks Section 1 6 MLlibMachine Learning Machine Learning: MLlib Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights. Built on top of Spark, MLlib is a scalable

Part I: Getting Started with Apache Spark HOUR 1 Introducing Apache Spark. 1 2 Understanding Hadoop . Getting Started with Spark SQL DataFrames. 294 Using Spark SQL DataFrames .

Whether you are just getting started with Spark or are already a Spark power user, this eBook will arm you with the knowledge to be successful on your next Spark project. Section 1: An Introduction to the Apache Spark APIs for Analytics Step 2: Apache Spark Concepts, Key Terms and Keywords