High Performance Distributed Deep Learning - Nvidia

1y ago

51 Views

3 Downloads

3.68 MB

93 Pages

Last View : 1d ago

Last Download : 3m ago

Upload by : Joanna Keil

Report this link

Download PDF

Transcription

Latest version of the slides can be obtained fromhttp://www.cse.ohio-state.edu/ panda/s9501.pdfHigh Performance Distributed Deep Learning:A Beginner’s GuideTutorial at GTC ’19byDhabaleswar K. (DK) PandaAmmar Ahmad AwanHari SubramoniThe Ohio State UniversityThe Ohio State UniversityThe Ohio State UniversityE-mail: panda@cse.ohio-state.eduE-mail: awan.10@osu.eduE-mail: te.edu/ pandahttp://www.cse.ohio-state.edu/ awan.10http://www.cse.ohio-state.edu/ subramon

Outline Introduction– The Past, Present, and Future of Deep Learning– What are Deep Neural Networks?– Diverse Applications of Deep Learning– Deep Learning Frameworks Overview of Execution Environments Parallel and Distributed DNN Training Latest Trends in HPC Technologies Challenges in Exploiting HPC Technologies for Deep Learning Solutions and Case Studies Open Issues and Challenges ConclusionNetwork Based Computing LaboratoryGTC ’192

Brief History of Deep Learning (DL)Courtesy: y/Network Based Computing LaboratoryGTC ’193

Milestones in the Development of Neural NetworksCourtesy: 23/deep learning 101 part1.htmlNetwork Based Computing LaboratoryGTC ’194

Understanding the Deep Learning Resurgence Deep Learning is a sub-set of MachineLearning– But, it is perhaps the most radical andrevolutionary subset– Automatic feature extraction vs. handcrafted features Deep Learning– A renewed interest and a lot of hype!– Key success: Deep Neural Networks (DNNs)– Everything was there since the late 80sexcept the “computability of DNNs”Courtesy: lNetwork Based Computing LaboratoryGTC ’195

Deep Learning, Many-cores, and HPC NVIDIA GPUs are the main driving force for faster training of DL models– The ImageNet Challenge - (ILSVRC)– 90% of the ImageNet teams used GPUs in 2014*– Deep Neural Networks (DNNs) like AlexNet, GoogLeNet, and VGG are used– A natural fit for DL due to the throughput-oriented nature In the High Performance Computing (HPC) arena– 126/500 Top HPC systems use NVIDIA GPUs (Nov ’18)– CUDA-Aware Message Passing Interface (MPI)– NVIDIA Fermi, Kepler, and Pascal architecture– DGX-1 (Pascal) and DGX-2 (Volta) Dedicated DL 09/07/imagenet/Network Based Computing LaboratoryGTC ’19Performance Sharewww.top500.org6

Deep Learning Use Cases and Growth TrendsCourtesy: etwork Based Computing LaboratoryGTC ’197

So what is a Deep Neural Network? Example of a 3-layer Deep Neural Network (DNN) – (input layer is not counted)Courtesy: http://cs231n.github.io/neural-networks-1/Network Based Computing LaboratoryGTC ’199

Graphical/Mathematical Intuitions for DNNsDrawing of a Biological NeuronThe Mathematical ModelCourtesy: http://cs231n.github.io/neural-networks-1/Network Based Computing LaboratoryGTC ’1910

Key Phases of Deep Learning Deep Learning has two major tasks1. Training of the Deep Neural Network2. Inference (or deployment) that uses a trained DNN DNN Training– Training is a compute/communication intensive process – can take days to weeks– Faster training is necessary! Faster training can be achieved by– Using Newer and Faster Hardware – But, there is a limit!– Can we use more GPUs or nodes? The need for Parallel and Distributed TrainingNetwork Based Computing LaboratoryGTC ’1911

DNN Training and InferenceCourtesy: fied v24.pdfNetwork Based Computing LaboratoryGTC ’1912

TensorFlow playground (Quick Demo) To actually train a network, please visit: http://playground.tensorflow.orgNetwork Based Computing LaboratoryGTC ’1913

Caption Generation, Translation, Style Transfer, and many more.Courtesy: lications-deep-learning/Courtesy: -translate-squeezes-deep.htmlNetwork Based Computing LaboratoryGTC ’1915

Google TranslateCourtesy: ork Based Computing LaboratoryGTC ’1916

Self Driving CarsCourtesy: s-musk/Network Based Computing LaboratoryGTC ’1917

Why we need DL frameworks? Deep Learning frameworks have emerged– hide most of the nasty mathematics– focus on the design of neural networks Distributed DL frameworks are being designed– We have saturated the peak potential of a singleGPU/CPU/KNL– Parallel (multiple processing units in a singlenode) and/or Distributed (usually involvesmultiple nodes) frameworks are emerging Distributed frameworks are being developed alongtwo directions– The HPC Eco-system: MPI-based Deep LearningStatement and its dataflow fragment. Thedata and computing vertexes with differentcolors reside on different processes.– Enterprise Eco-system: BigData-based Deep LearningCourtesy: https://web.stanford.edu/ rezab/nips2014workshop/submits/minerva.pdfNetwork Based Computing LaboratoryGTC ’1919

DL Frameworks and GitHub Statistics AI Index report offers verydetailed trends about AI andML It also provides interestingstatistics about open sourceDL frameworks and relatedGitHub statisticsCourtesy: http://cdn.aiindex.org/2017-report.pdfNetwork Based Computing LaboratoryGTC ’1920

Are Define-by-run frameworks easier than Define-and-run? Define-and-run: TensorFlow, Caffe, Torch, Theano, and others Define-by-run– PyTorch and Chainer– TensorFlow 1.5 introduced Eager Execution (Define-by-run) modeCourtesy: tworks-made-easy-by-chainerNetwork Based Computing LaboratoryGTC ’1921

Google TensorFlow (Most Popular) The most widely used framework open-sourced by Google Replaced Google’s DistBelief[1] framework Runs on almost all execution platforms available (CPU, GPU, TPU,Mobile, etc.) Very flexible but performance has been an issue Certain Python peculiarities like variable scope etc. https://github.com/tensorflow/tensorflowCourtesy: https://www.tensorflow.org/[1] Jeffrey Dean et al., “Large Scale Distributed Deep dia/research.google.com/en//archive/large deep networks nips2012.pdfNetwork Based Computing LaboratoryGTC ’1922

Facebook Torch/PyTorch - Catching up fast! Torch was written in Lua– Adoption wasn’t wide-spread PyTorch is a Python adaptation of TorchCourtesy: http://pytorch.org– Gaining lot of attention Several contributors– Biggest support by Facebook There are/maybe plans to merge the PyTorch and Caffe2 efforts Key selling point is ease of expression and “define-by-run” approachNetwork Based Computing LaboratoryGTC ’1923

Preferred Networks Chainer/ChainerMN ChainerMN provides multi-node parallel/distributed training using MPI– MVAPICH2 MPI library is being used by Preferred Networks– http://mvapich.cse.ohio-state.edu ChainerMN is geared towards performance– Uses Define-by-run (Chainer, PyTorch) approach instead of Define-and-run(Caffe, TensorFlow, Torch, Theano) approach– https://github.com/chainer/chainer– Focus on Speed as well as multi-node Scaling– Beats CNTK, MXNet, and TensorFlow for training ResNet-50 on 128 GPUs [1]1. etwork Based Computing LaboratoryGTC ’1924

Many Other DL Frameworks Keras - https://keras.io MXNet - http://mxnet.io Theano - http://deeplearning.net/software/theano/ Blocks - https://blocks.readthedocs.io/en/latest/ Intel BigDL - stributed-deep-learningon-apache-spark The list keeps growing and the names keep getting longer and weirder ;-)– Livermore Big Artificial Neural Network Toolkit (LBANN) https://github.com/LLNL/lbann– Deep Scalable Sparse Tensor Network Engine (DSSTNE) https://github.com/amzn/amazon-dsstneNetwork Based Computing LaboratoryGTC ’1925

Outline Introduction Overview of Execution Environments Parallel and Distributed DNN Training Latest Trends in HPC Technologies Challenges in Exploiting HPC Technologies for Deep Learning Solutions and Case Studies Open Issues and Challenges ConclusionNetwork Based Computing LaboratoryGTC ’1926

So where do we run our DL framework? Early (2014) frameworks used a single fast GPU– As DNNs became larger, faster and better GPUs became available– At the same time, parallel (multi-GPU) training gained traction as well Today– Parallel training on multiple GPUs is being supported by most frameworks– Distributed (multiple nodes) training is still upcoming A lot of fragmentation in the efforts (MPI, Big-Data, NCCL, Gloo, etc.)– On the other hand, DL has made its way to Mobile and Web too! Smartphones - OK Google, Siri, Cortana, Alexa, etc. DrivePX – the computer that drives NVIDIA’s self-driving car Deeplearn.js – a DL framework in a web-browser TensorFlow playground - http://playground.tensorflow.org/Network Based Computing LaboratoryGTC ’1927

Conventional Execution on GPUs and CPUs My framework is faster thanyour framework! This needs to be understoodin a holistic way. Performance depends on theentire execution environment(the full stack) Isolated view of performanceis not helpfulDL Applications (Image Recognition, Speech Processing, etc.)DL Frameworks (Caffe, TensorFlow, etc.)GenericConvolution LayerATLASMKL OptimizedConvolution LayerOpenBLAScuDNN OptimizedConvolution LayerMKL 2017cuDNN/cuBLASMulti-/Many-core(Xeon, Xeon Phi)Many-core GPU(Pascal P100)Other BLAS LibrariesBLAS LibrariesOther ProcessorsHardwareA. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Trainingon Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8.Network Based Computing LaboratoryGTC ’1928

DL Frameworks and Underlying Libraries BLAS Libraries – the heart of math operations– Atlas/OpenBLAS– NVIDIA cuBlas– Intel Math Kernel Library (MKL) Most compute intensive layers are generally optimized for a specifichardware– E.g. Convolution Layer, Pooling Layer, etc. DNN Libraries – the heart of Convolutions!– NVIDIA cuDNN (already reached its 7th iteration – cudnn-v7.5)– Intel MKL-DNN (MKL 2018) – recent but a very promising developmentNetwork Based Computing LaboratoryGTC ’1929

Where does the Performance come from? The full landscape: Forward and Backward Pass -- Faster Convolutions Faster Training Performance of Intel KNL NVIDIA P100 for AlexNet Training – Volta is in a different league! Most performance gains are based on improvements in layer conv2 and conv3 for AlexNetA. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on ModernArchitectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8.GTC ’19Network Based Computing Laboratory30

The Need for Parallel and Distributed Training Why do we need Parallel Training? Larger and Deeper models are being proposed– AlexNet to ResNet to Neural Machine Translation (NMT)– DNNs require a lot of memory– Larger models cannot fit a GPU’s memory Single GPU training became a bottleneck As mentioned earlier, community has already moved to multi-GPU training Multi-GPU in one node is good but there is a limit to Scale-up (8 GPUs) Multi-node (Distributed or Parallel) Training is necessary!!Network Based Computing LaboratoryGTC ’1932

Batch-size, Model-size, Accuracy, and Scalability Increasing model-size generally increases accuracy Increasing batch-size requires tweaking hyperparameters to maintain accuracy– Limits for batch-size– Cannot make it infinitely large– Over-fitting Large batch size generally helps scalability– More work to do before the need to synchronize Increasing the model-size (no. of parameters)– Communication overhead becomes bigger so scalabilitydecreases– GPU memory is precious and can only fit finite model dataCourtesy: ng-of-neural-networksGTC ’19Network Based Computing Laboratory33

Benefits of Distributed Training: An Example with Caffe Strong scaling CIFAR10 Training withOSU-Caffe (1 – 4 GPUs) – Batch Size 2KCIFAR-10 Training with OSU-Caffe Adding more GPUs will degrade thescaling efficientRun Command - (change np from 1—4)mpirun rsh -np np ./build/tools/caffetrain -solverexamples/cifar10/cifar10 quick solver.prototxt-scal strongTime (seconds) Large batch size is needed -GPUs4-GPUsOutput: I0123 21:49:24.289763 75582 caffe.cpp:351] Avg. Time Taken: 142.101OSU-Caffe is available from the HiDL project page(http://hidl.cse.ohio-state.edu)Output: I0123 21:54:03.449211 97694 caffe.cpp:351] Avg. Time Taken: 74.6679Output: I0123 22:02:46.858219 20659 caffe.cpp:351] Avg. Time Taken: 39.8109Network Based Computing LaboratoryGTC ’1934

Parallelization StrategiesModel Parallelism What are the Parallelization Strategies– Model Parallelism– Data Parallelism (Received the most attention)Data Parallelism– Hybrid Parallelism– Automatic SelectionHybrid (Model and Data) ParallelismCourtesy: ng-of-neural-networksNetwork Based Computing LaboratoryGTC ’1935

Communication in Distributed Frameworks What are the Design Choices for Communication?– Established paradigms like Message Passing Interface (MPI)– Develop specific communication libraries like NCCL, Gloo,Baidu-allreduce, etc.– Use Big-Data frameworks like Spark, Hadoop, etc. Still need some form of external communication for parameters (RDMA, IB,etc.) Focus on Scale-up and Scale-out– What are the challenges and opportunities?Network Based Computing LaboratoryGTC ’1936

Scale-up and Scale-out Scale-up: Intra-node CommunicationDesired NVIDIA cuDNN, cuBLAS, NCCL, etc. CUDA 9 Co-operative Groups Scale-out: Inter-node Communication– DL Frameworks – most are optimized forsingle-node only– Distributed (Parallel) Training is anemerging trend OSU-Caffe – MPI-basedScale-up Performance– Many improvements like:NCCL2cuDNNMPIMKL-DNNgRPCHadoop Microsoft CNTK – MPI/NCCL2 Google TensorFlow – gRPC-based/MPI/NCCL2 Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)Network Based Computing LaboratoryGTC ’19Scale-out Performance37

Data Parallel Deep Learning and MPI Collectives MPI Reduce – needed forgradient accumulation frommultiple solvers MPI Allreduce – use justone Allreduce instead ofReduce and BroadcastMPI Bcast (GPU 0)FL1L2.LnL1L2.Lnpacked reduce buffParamsBpacked reduce buffL1L2.LnFpacked reduce buffBFL1L2.LnApplyUpdatesB2. ForwardBackwardPasspacked reduce buffMPI Reduce (GPU 0)GradientsPropagationParamsGPU 3ParamsFB1. DataGPU 2ParamsGPU 1 MPI Bcast – required forDNN parameter exchangepacked comm buffLoop {}GPU 0 Major MPI Collectivesinvolved in Designingdistributed frameworks3. GradientAggregationA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPUClusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)Network Based Computing LaboratoryGTC ’1938

Drivers of Modern HPC Cluster ArchitecturesMulti-/Many-coreProcessorsHigh Performance Interconnects InfiniBand 1usec latency, 100Gbps Bandwidth Multi-core/many-core technologiesAcceleratorshigh compute density, highperformance/watt 1 TFlop DP on a chipSSD, NVMe-SSD, NVRAM Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD Accelerators (NVIDIA GPGPUs) Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.SummitNetwork Based Computing LaboratorySunway TaihuLightSierraGTC ’19K - Computer40

HPC Technologies Hardware– Interconnects – InfiniBand, RoCE, Omni-Path, etc.– Processors – GPUs, Multi-/Many-core CPUs, Tensor Processing Unit (TPU),FPGAs, etc. Communication Middleware– Message Passing Interface (MPI) CUDA-Aware MPI, Many-core Optimized MPI runtimes (KNL-specific optimizations)– NVIDIA NCCLNetwork Based Computing LaboratoryGTC ’1941

Overview of High Performance Interconnects High-Performance Computing (HPC) has adopted advanced interconnects and protocols– InfiniBand (IB)– Omni-Path– High Speed Ethernet 10/25/40/50/100 Gigabit Ethernet/iWARP– RDMA over Converged Enhanced Ethernet (RoCE) Very Good Performance– Low latency (few micro seconds)– High Bandwidth (200 Gb/s with HDR InfiniBand)– Low CPU overhead (5-10%) OpenFabrics software stack with IB, Omni-Path, iWARP and RoCE interfaces are driving HPC systems Many such systems in Top500 listNetwork Based Computing LaboratoryGTC ’1942

Network Speed Acceleration with IB and HSEEthernet (1979 - )Fast Ethernet (1993 -)Gigabit Ethernet (1995 -)ATM (1995 -)Myrinet (1993 -)Fibre Channel (1994 -)InfiniBand (2001 -)10-Gigabit Ethernet (2001 -)InfiniBand (2003 -)InfiniBand (2005 -)10 Mbit/sec100 Mbit/sec1000 Mbit /sec155/622/1024 Mbit/sec1 Gbit/sec1 Gbit/sec2 Gbit/sec (1X SDR)10 Gbit/sec8 Gbit/sec (4X SDR)16 Gbit/sec (4X DDR)24 Gbit/sec (12X SDR)32 Gbit/sec (4X QDR)40 Gbit/sec54.6 Gbit/sec (4X FDR)2 x 54.6 Gbit/sec (4X Dual-FDR)25/50 Gbit/sec100 Gbit/sec100 Gbit/sec100 Gbit/sec (4X EDR)200 Gbit/sec (4X HDR)InfiniBand (2007 -)40-Gigabit Ethernet (2010 -)InfiniBand (2011 -)InfiniBand (2012 -)25-/50-Gigabit Ethernet (2014 -)100-Gigabit Ethernet (2015 -)Omni-Path (2015 - )InfiniBand (2015 - )InfiniBand (2018 - )100 times in the last 17 yearsNetwork Based Computing LaboratoryGTC ’1943

Intel Neural Network Processor (NNP) Intel Nervana Neural Network Processors (NNP)– formerly known as “Lake Crest” Recently announced as part of Intel’s strategy fornext-generation AI systems Purpose built architecture for deep learning 1 TB/s High Bandwidth Memory (HBM) Spatial Architecture FlexPoint format– Similar performance (in terms of accuracy) to FP32 whileusing 16 bits of storageCourtesy: processor-architecture-update/Network Based Computing LaboratoryGTC ’1944

GraphCore – Intelligence Processing Unit (IPU) New processor that’s the first to bespecifically designed for machineintelligence workloads – an IntelligenceProcessing Unit (IPU)– Massively parallel– Low-precision floating-point compute– Higher compute density UK-based Startup Early benchmarks show 10-100xspeedup over GPUs– Presented at NIPS 2017Courtesy: r-a-range-of-machine-learning-applicationsNetwork Based Computing LaboratoryGTC ’1945

HPC Technologies Hardware– Interconnects – InfiniBand, RoCE, Omni-Path, etc.– Processors – GPUs, Multi-/Many-core CPUs, Tensor Processing Unit (TPU), FPGAs,etc. Communication Middleware– Message Passing Interface (MPI) CUDA-Aware MPI, Many-core Optimized MPI runtimes (KNL-specific optimizations)– NVIDIA NCCLNetwork Based Computing LaboratoryGTC ’1946

Parallel Programming Models OverviewP1P2P3Shared MemoryP1P2MemoryMemoryP1P3MemoryMemoryP2P3Logical shared memoryMemoryMemoryShared Memory ModelDistributed Memory ModelPartitioned Global Address Space (PGAS)SHMEM, DSMMPI (Message Passing Interface)OpenSHMEM, UPC, Chapel, X10, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems– e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI PGAS models are gradually receivingimportanceNetwork Based Computing LaboratoryGTC ’1947

Allreduce Collective Communication Pattern Element-wise Sum data from all processes and sends to all processesint MPI Allreduce (const void *sendbuf, void * recvbuf, int count, MPI Datatype datatype,MPI Op operation, MPI Comm comm)Sendbuf (Before)Input-only ParametersParameterDescriptionsendbufStarting address of send bufferrecvbufStarting address of recv buffertypeData type of buffer elementscountNumber of elements in the buffersoperationReduction operation to be performed (e.g. sum)commCommunicator handleInput/Output ParametersParameterDescriptionrecvbufStarting address of receive bufferNetwork Based Computing LaboratoryGTC ’19T1T2T3T41234123412341234Recvbuf (After)T1T2T3T448121648121648121648121648

Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)–MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002–MVAPICH2-X (MPI PGAS), Available since 2011–Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014–Support for Virtualization (MVAPICH2-Virt), Available since 2015–Support for Energy-Awareness (MVAPICH2-EA), Available since 2015–Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015– Used by more than 2,975 organizations in 86 countries– More than 528,000 ( 0.5 million) downloads from the OSU site directly–Empowering many TOP500 clusters (Nov ‘18 ranking) 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China 14th , 556,104 cores (Oakforest-PACS) in Japan 17th , 367,024 cores (Stampede2) at TACC 27th , 241,108-core (Pleiades) at NASA and many others– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)– http://mvapich.cse.ohio-state.eduPartner in the upcoming TACC Frontera System Empowering Top500 systems for over a decadeNetwork Based Computing LaboratoryGTC ’1949

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GDR Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing ( CUDA 4.0) Overlaps data movement from GPU with RDMA transfersAt Sender:MPI Send(s devbuf, size, );insideMVAPICH2At Receiver:MPI Recv(r devbuf, size, );High Performance and High ProductivityNetwork Based Computing LaboratoryGTC ’1950

Optimized MVAPICH2-GDR Design4K2K1K512256Message Size (Bytes)12816 32 64 128 256 512 1K 2K 4K 8K6483241628111X40GPU-GPU Inter-node 2520151050Bandwidth (MB/s)Latency (us)GPU-GPU Inter-node LatencyMessage Size (Bytes)Bandwidth (MB/s)MV2-(NO-GDR)MV2-GDR 2.3MV2-(NO-GDR)MV2-GDR-2.3GPU-GPU Inter-node 2.3.1Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 coresNVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCACUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA9x1248 16 32 64 128 256 512 1K 2K 4KMessage Size (Bytes)MV2-(NO-GDR)Network Based Computing LaboratoryMV2-GDR-2.3GTC ’1951

NCCL Communication Library Collective Communication with a caveat!– GPU buffer exchange– Dense Multi-GPU systems(Cray CS-Storm, DGX-1)– MPI-like – but not MPI standard compliant NCCL (pronounced Nickel)– Open-source Communication Library by NVIDIA– Topology-aware, ring-based (linear) collectivecommunication library for GPUs– Divide bigger buffers to smaller chunks– Good performance for large messages Kernel-based threaded copy (Warp-level Parallel)instead of ll/fast-multi-gpu-collectives-nccl/Network Based Computing LaboratoryGTC ’1952

Broad Challenge: Exploiting HPC for Deep LearningHow to efficiently scale-out aDeep Learning (DL) framework and takeadvantage of heterogeneousHigh Performance Computing (HPC)resources?Network Based Computing LaboratoryGTC ’1954

Research Challenges to Exploit HPC Technologies1. What are the fundamentalissues in designing DLframeworks?1Deep Learning and Machine Learning – Memory RequirementsMajor Computation and Communication Phases in DL Frameworks– Computation RequirementsModel Propagation– Communication Overhead2. Why do we need to supportdistributed training?– To overcome the limits ofsingle-node training– To better utilize hundreds ofexisting HPC ClustersNetwork Based Computing nication Runtimes to supportDistributed TrainingHPC PlatformsCPUGTC ’19InfiniBandGPU55

Research Challenges to Exploit HPC Technologies (Cont’d)3. What are the new design challengesbrought forward by DL frameworks forCommunication runtimes?Deep Learning and Machine Learning FrameworksCaffe/OSU-CaffeCNTK– Large Message CollectiveCommunication and ReductionsModel Propagation– Co-Design the support at Runtimelevel and Exploit it at the DLFramework level– What performance benefits can beobserved?TensorFlowMXNetMajor Computation and Communication Phases in DL Frameworks– GPU Buffers (CUDA-Awareness)4. Can a Co-design approach help in achievingScale-up and Scale-out on4Co-DesignOpportunitiesCommunication Runtimes arenessLarge-messageCollectives3HPC PlatformsCPUInfiniBandGPU– What needs to be fixed at thecommunication runtime layer?Network Based Computing LaboratoryGTC ’1956

Solutions and Case Studies: Exploiting HPC for DL NVIDIA NCCL/NCCL2Deep Learning and Machine Learning Frameworks Baidu-allreduce Facebook Gloo Co-design MPI runtimes andDL Frameworks Distributed Training forTensorFlow Scaling DNN Training onMulti-/Many-core CPUsCaffe/OSU-CaffeCNTKTensorFlowMXNetMajor Computation and Communication Phases in DL FrameworksModel signOpportunitiesCommunication Runtimes (MPI/NCCL/Gloo/MLSL)Point-toPointOperations PowerAI DDLCUDA-AwarenessLarge-messageCollectivesHPC PlatformsCPUNetwork Based Computing LaboratoryCaffe2GTC ’19InfiniBandGPU58

NVIDIA NCCL NCCL is a collective communication library– NCCL 1.x is only for Intra-node communication on asingle-node NCCL 2.0 supports inter-node communicationas well Design Philosophy– Use Rings and CUDA Kernels to perform efficientcommunication NCCL is optimized for dense multi-GPUsystems like the DGX-1 and DGX-1VCourtesy: es-gpu-acceleration-next-level/GTC ’19Network Based Computing Laboratory59

NCCL 2: Multi-node GPU CollectivesCourtesy: tion/s7155-jeaugey-nccl.pdfNetwork Based Computing LaboratoryGTC ’1960

MVAPICH2-GDR vs. NCCL2 – Allreduce Operation Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases MPI Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs100010000010000100Latency (us)Latency (us) 3X better 1.2X better1000100101011481632*Available sinceMVAPICH2-GDR 2.364 128 256 512 1K 2KMessage Size (Bytes)MVAPIC

-The Past, Present, and Future of Deep Learning -What are Deep Neural Networks? -Diverse Applications of Deep Learning -Deep Learning Frameworks Overview of Execution Environments Parallel and Distributed DNN Training Latest Trends in HPC Technologies Challenges in Exploiting HPC Technologies for Deep Learning

Related Documents:

Distributed Database Systems - UiO

Distributed Database Design Distributed Directory/Catalogue Mgmt Distributed Query Processing and Optimization Distributed Transaction Mgmt -Distributed Concurreny Control -Distributed Deadlock Mgmt -Distributed Recovery Mgmt influences query processing directory management distributed DB design reliability (log) concurrency control (lock)

18 Views

1y ago

Introducing Deep Learning with MATLAB

Deep Learning: Top 7 Ways to Get Started with MATLAB Deep Learning with MATLAB: Quick-Start Videos Start Deep Learning Faster Using Transfer Learning Transfer Learning Using AlexNet Introduction to Convolutional Neural Networks Create a Simple Deep Learning Network for Classification Deep Learning for Computer Vision with MATLAB

75 Views

1y ago

Deep Learning in Effective English Teaching Strategy of Senior High

English teaching and Learning in Senior High, hoping to provide some fresh thoughts of deep learning in English of Senior High. 2. Deep learning . 2.1 The concept of deep learning . Deep learning was put forward in a paper namedon Qualitative Differences in Learning: I -

24 Views

7m ago

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

102 Views

1y ago

Performance Modeling and Scalability Optimization of Distributed Deep ...

distributed deep learning systems where the training is distributed over clusters of commodity machines [5, 13]. The DistBelief [13] and Adam [5] distributed deep learning systems run on commodity clusters of 1000 and 120 machines respectively connected by Eth-ernet. In addition to SIMD (single instruction multiple data) and

11 Views

1y ago

Deep Learning for Internet of Things Application Using H2O Platform

Deep Learning can create masterpieces: Semantic Style Transfer . Deep Learning Tools . Deep Learning Tools . Deep Learning Tools . What is H2O? Math Platform Open source in-memory prediction engine Parallelized and distributed algorithms making the most use out of

28 Views

1y ago

Deep learning for aerospace applications - Teratec

Deep Learning Personal assistant Personalised learning Recommendations Réponse automatique Deep learning and Big data for cardiology. 4 2017 Deep Learning. 5 2017 Overview Machine Learning Deep Learning DeLTA. 6 2017 AI The science and engineering of making intelligent machines.

25 Views

1y ago

Coronavirus conspiracies and views of vaccination

Coronavirus conspiracies Notable minorities of the population have conspiracy suspicions about coronavirus, ranging from the relatively mild – such as that “people need to wake up and start asking questions about the pandemic”’ (believed by 41%) – to the more extreme, including that “reporters, scientists, and government officials are involved in a conspiracy to cover up important .

39 Views

3y ago

Recent Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

List of Insurance Companies by Insurance Manager - Cayman Islands dollar

2447 Batan Insurance Company SPC, Ltd. 29-Sep-03 1307714 BBG Insurance Services, Ltd. 09-Aug-16 1254 BCHS Insurance, Ltd. 07-Oct-98 1168 Bearacuda Re 01-Aug-97 2639 Bedrock Insurance Limited 24-Nov-05 2150 Bom Ambiente Insurance Company 14-Jun-00 2565 Boundless Insurance Company, Ltd. 01-Dec-04 769 Bucap Limited 03-Mar-89

1y ago

293 Views

Insurance Certificate 713705-3 and Assistance Program

Name of insurance product: Purchase Protection and Travel Insurance for National Bank of Canada Mastercard credit cards, group insurance policy no. 713705 (Schedule A Certificate number 3)/713705-3 Type of insurance product: Purchase insurance and extended warranty and travel insurance (group insurance) Assistance provider contact information

3m ago

54 Views

Oracle Insurance Performance Insight for General Insurance

for General Insurance Overview Oracle Insurance Performance Insight for General Insurance (OIPIGI) is a comprehensive business intelligence system created exclusively for the General Insurance/Property and Casualty (P&C) insurance industry. OIPIGI provides a complete set of web-based analytical and reporting components that enable users to

1y ago

175 Views

S OF GENERAL INSURANCE

General Insurance comprises of insurance of property against fire, burglary etc, personal insurance such as Accident and Health Insurance, and liability insurance which covers legal liabilities. Suitable general Insurance covers are necessary for every family. It is important to protect one’s property, which

3y ago

278 Views

Insurance Act 1978 - Bermuda Laws

INSURANCE MANAGERS, BROKERS, AGENTS, INSURANCE MARKETPLACE PROVIDERS AND SALESMEN Insurance managers, agents and insurance marketplace providers to maintain lists of insurers for which they act Insurance broker, agent, salesman or insurance marketplace provider deemed agent of insurer in cert

2y ago

280 Views

High Performance Distributed Deep Learning - Nvidia

It looks like you're using an ad-blocker