Parallel Computing: A Brief Discussion - UCL

1y ago
3 Views
2 Downloads
3.17 MB
30 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Harley Spears
Transcription

Parallel Computing: a brief discussionMarzia RiviAstrophysics GroupDepartment of Physics & AstronomyResearch Programming Social – Nov 10, 2015

What is parallel computing?Traditionally, software has been written for serialcomputation:––––To be run on a single computer having a single core.A problem is broken into a discrete series of instructions.Instructions are executed one after another.Only one instruction may execute at any moment in time.Parallel computing is the simultaneous use of multiplecompute resources to solve a computational problem:– A problem is broken into discrete parts that can be solvedconcurrently.– Instructions from each part executed simultaneously ondifferent cores.

Why parallel computing? Save time and/or money:– in theory, more resources we use, shorter the time to finish, withpotential cost savings. Solve larger problems:– when the problems are so large and complex, it is impossible tosolve them on a single computer, e.g. "Grand Challenge" problemsrequiring PetaFLOPS and PetaBytes of computing resources(en.wikipedia.org/wiki/Grand Challenge).– Many scientific problems can be tackled only by increasingprocessor performances.– Highly complex or memory greedy problems can be solved onlywith greater computing capabilities. Limits to serial computing: physical and practical reasons

Researcher 2 Researcher 3Researcher 1Who needs parallel computing? has a large number ofindependent jobs (e.g.processing video files,genome sequencing,parametric studies)uses serial applicationsdeveloped serial code andvalidated it on smallproblemsto publish, needs some“big problem” resultsneeds to run large parallelsimulations fast (e.g.molecular dynamics,computational fluiddynamics, cosmology)High ThroughputComputing (HTC)(many computers): dynamicenvironment multipleindependent smallto-medium jobs large amounts ofprocessing overlong time loosely connectedresources (e.g. grid)High PerformanceComputing (HPC)(single parallelcomputer): staticenvironment single large scaleproblems tightly coupledparallelism

How to parallelise an application?Automatic parallelisation tools: compiler support for vectorisation of operations (SSE and AVX)and threads parallelisation (OpenMP) specific tools exists but limited practical use all successful applications require intervention and steeringParallel code development requires: programming languages (with support for parallel libraries, APIs)parallel programming standards (such as MPI and OpenMP)compilersperformance libraries/tools (both serial and parallel)But , more that anything, it requires understanding: the algorithms (program, application, solver, etc.):the factors that influence parallel performance

How to parallelise an application? First, make it work!– analyse the key features of your parallel algorithms: parallelism: the type of parallel algorithm that can use parallel agents granularity: the amount of computation carried out by parallel agents dependencies: algorithmic restrictions on how the parallel work can bescheduled– re-program the application to run in parallel and validate it Then, make it work well!– Pay attention to the key aspects of an optimal parallel execution: data locality (computation vs. communication) scalability (linear scaling is the holy grail: execution time is inverselyproportional with the number of processors)– Use profilers and performance tools to identify problems.

Task parallelismThread (or task) parallelism is based on parting the operationalgorithm.Task ParallelismThread (or task)based on withexecutingIf anparallelismalgorithm is isimplementedseries of independent oconcurrently differentpartsof the algorithm.can be spreadthroughoutthe processors thus realizing proparallelisation.Features Different independent sets ofinstructions applied to singleset (or multiple sets) of data.May lead to work imbalanceand may not scale well andperformance limited by theslowest process.begintask 1task 2task 3task 4endcpu1cpu2cpu3cpu4

Data ParallelismDataparallelismData parallelism means spreading data to be computedmeans spreading data to be computed through thethrough Datathe parallelismprocessors.processors.Features The samesets of instructionsapplieddifferent(parts butof the)data. data seThe processorsexecute merelythetosameoperations,on diverseProcessorsondistributiondata assignedto themand acrosscommunicateThis workoften onlymeansof arrayelementsthe computing unwhen necessary. Easy to program, scale well.begin Inherent in program loops.noi 4?yestaskendarray[4]datacpu1cpu3cpu2cpu4

Granularity Coarse––––parallelise large amounts of the total workloadin general, the coarser the betterminimal inter-processor communicationcan lead to imbalance Fine– parallelise small amounts of the total workload (e.g. inner loops)– can lead to unacceptable parallel overheads (e.g. communication)

DependenciesdependenceDictate the order Dataof operations,imposes limits onDatadependenceparallelism and requires parallel synchronisation.In the following example the i index loop can be parallelized:theexamplefollowingtheexampletheloopi indexcan be parallelized:InInthisi indexcanloopbe parallelized:DO I 1, NDO I 1,DON J 1,DO A(J,I)J 1,END A(J,I)DOENDDOEND DOEND DOfortranfortranNN A(J-1,I) B(J,I) A(J-1,I) B(J,I)c/c For (i 1; i n; i ){c/c For for(i 1;i n;i ){(j 1;j n;j ){for a[j][i](j 1 ;j n;j ){ a[j-1][i] b[j][i];a[j][i] a[j-1][i] b[j][i];}}}}In this loop parallelization is dependent on the K value:thislooploop parallelizationparallelization isisdependentononthetheK value:InInthisdependentk value:fortranDO I M, NfortranDO I M, N A(I) A(I-K) B(I)/C(I)A(I) A(I-K) B(I)/C(I)END DOEND DOFor (i m; i n; i ){For a[i](i m; i n;i ){a[i-k] b[i]/c[i];a[i] a[i-k] b[i]/c[i];}}If KN-Morork K N-M ard.k M-NparallelizationIf K N-M or K M-N parallelization is straightforward.c/c c/c

DistributedMemoryMemory rforphysicalphysical Distributedmemory that is not common. As a programming model, tasks can onlymemorythat is not common. As a programming model, tasks can onlylogically "see" local machine memory and must use communications tologically "see" localmachinememory and mustuse communications toParallelcomputingmodelsaccess memory on other machines where other tasks are executing.access memoryon other machines where othertasks are executing.Shared memoryDistributed memory Each processor has direct accessSharedMemoryto commonphysicalmemorySharedMemory(e.g. multi-processors, clusternodes). Agent of parallelism: the thread(program collection of threads).Threads exchange informationimplicitly by reading/writingshared variables. Programming standard: OpenMP. Local Distributedprocessor memoryis invisible toMemoryall otherprocessors,network basedDistributedMemorymemory access (e.g. computerclusters). Agent of parallelism: the process(program collection of processes).Exchanging information betweenprocesses requires communications.Programming standard: MPI.

OpenMPhttp://www.openmp.orgAPI instructing the compiler what can be done in parallel(high-level programming). Consisting of:–––compiler directivesruntime library functionsenvironment ntVariablesRuntime LibraryThreads in Operating System Supported by most compilers for Fortran and C/C . Usable as serial code (threading ignored by serial compilation). By design, suited for data parallelism.

OpenMPThreads are generated automatically at runtime and scheduled bythe OS. Thread creation / destruction overhead.Execution modelMinimise the number of times parallel regions are entered/exited.masterMaster thread onlyAll threadsFortran syntaxC syntax! OMP PARALLEL#pragma omp parallel {Parallel region! OMP END PARALLELMaster thread only}

OpenMP - exampleObjective: vectorise a loop, to map the sin operation to vector x in parallel.Idea: instruct the compiler on what to parallelise (the loop) and how (privateand shared data) and let it do the hard work.In C:#pragma omp parallel for shared(x, y, J) private(j)for (j 0; j J; j ) {y[j] sin(x[j]);}In Fortran: omp parallel do shared(x, y, J) private(j)do j 1, Jy(j) sin(x(j))end do omp end parallel doNumber of threads is set by environment variable OMP NUM THREADS orprogrammed for using the RTL function omp set num threads().

MPI - Message Passing Interfacehttp://www.mpi-forum.org/MPI is a specification for a Distributed-Memory API designedby a committee for Fortran, C and C languages. Two versions:– MPI 1.0, quickly and universally adopted (most used and useful)– MPI 2.0, is a superset of MPI 1.0 (adding parallel I/O, dynamicprocess management and direct remote memory operations) butis not so popular. Many implementations– open software (MPICH, MVAPICH, OpenMPI)– vendor (HP/Platform, SGI MPT, Intel).

MPI implementation components Libraries covering the functionality specified by the standard. Header files, specifying interfaces, constants etc.– C/C : mpi.h– Fortran: mpif.h Tools to compile and link MPI applications (wrappers aroundserial compilers)– Fortran: mpif77, mpif90– C: mpicc– C : mpicxx, mpiCC An MPI application launcher (mapping processes to CPUs)mpirun -np n processes executable

MPI - overview Processes (MPI tasks) are mapped to processors (CPU cores). Start/stop mechanisms:– MPI Init() to initialise processes– MPI Finalize() to finalise and clean up processes Communicators:– a communicator is a collection (network) of processes– default is MPI COMM WORLD, which is always present and includesall processes requested by mpirun– only processes included in a communicator can communicate Identification mechanism:– process id: MPI Comm rank()– communicator size (number of processes): MPI Comm size()

MPI - communicationInter-process communication (the cornerstone of MPIprogramming): one-to-one communication (send, receive)one-to-many communication (broadcasts, scatter)many-to-one communication (gather)many-to-many communication (allgather)reduction (e.g. global sums, global max/min) (a special many-to-one!)process synchronisation (barriers)Domain decompositionMaster-slave

MPI - example#include "mpi.h"#include stdio.h #include stdlib.h int main( int argc, char *argv[]){int my rank, numprocs; char message[100]; int dest, tag, source; MPI Statusstatus;MPI Init(&argc,&argv);MPI Comm rank(MPI COMM WORLD,&my rank);MPI Comm size(MPI COMM WORLD,&numprocs); If the program is executed with two prif (my rank ! 0){sprintf(message,"Greetings from process %d !\0",my rank); dest 0;Greetings from process 1!tag 0;MPI Send(message, sizeof(message), MPI CHAR, dest, tag, MPI COMM WORLD);} else {for (source 1; source (numprocs-1); source )If the program is executed with four p{MPI Recv(message, 100, MPI CHAR, source, tag, MPI COMM WORLD, &status);printf("%s\n",message);}Greetings from process 1!}MPI Finalize();Greetings from process 2!Hello world! (o}Greetings from process 3!

Distributed vs shared memoryapplication featureshared memory / OpenMPdistributed memory / MPIparallelisation easy, incremental (parallelising smallparts of the code at a time) mostly parallelise loops relatively difficult (tends to requirea all-or-nothing approach) can be used in a wider range ofcontextsscaling (hardware view)both expensive (few vendors providescalable solutions) and cheap (multicore workstations) relatively cheap (most vendorsprovide systems with 1000’s ofcores) runs on both shared anddistributed systemsscaling (programming view)small/simple programs are easy and fastto implementeven small/simple programs involvelarge programming complexitymaintainabilitycode is relatively easy to understandand maintaincode is relatively difficult tounderstandreadabilitysmall increase in code size, readablecodetends to add a lot of extra coding formessage handling, code readablewith difficultydebugging requires special compiler supportdebuggers are extension of serialones no special compiler support (justlibraries)specialised debuggers

Distributed vs shared memory paradigmWhich problems are suited to Distributed Memory Processing? Embarrassingly parallel problems (independent tasks), e.g. MonteCarlo methods.Computation bound problems (heavy local computation with little dataexchange between processes).– models with localised data, e.g. PDEs solved using finite elements/volumes(CFD, CHMD, etc.)– other models with distributed data: molecular dynamics, etc.Which problems are suited to Shared Memory Processing? Communication bound problems (much data shared between threads)– models with non-local data: e.g. Newtonian particle dynamics– Fourier transform, convolutions.

Accelerators - motivationMoore’s Law (1965): the number of transistors in CPUdesign doubles roughly every 2yearsbacked by clock speed increase,this has correlated withexponentially increasing CPUperformance for at least 40 years.This meant the same old (singlethreaded) code just runs faster onnewer hardware. No more!While the “law” still holds, clockfrequency of general purpose CPUswas “frozen” in 2004 at around 2.5-3.0GHz and design has gone multicore.Performance improvements are now coming from the increase in the number ofcores on a processor.

Accelerators – different philosophiesAcceleratorDesign of CPUs optimized forsequential code and coarsegrained parallelism: multi-core sophisticated control logic unit large cache memories toreduce access latencies.Design of accelerators optimized fornumerically intensive computation bya massive fine grained parallelism: many-cores (several hundreds) leightweight threads and highexecution throughput large number of threads to overcomelong-latency memory accesses.

Accelerators - examplesNVIDIA Tesla K20X GPU2688 cores6GB GDDR5 memory250 GB/sec memory bandwidth3.95Tflops/sec of peak SPIntel Xeon Phi 5110 MIC60 cores8GB GDDR5320 GB/s memory bandwidth240 HW threads (4 per core)512-bit wide SIMD capability

Accelerators – programming modelApplications should use both CPUs andthe accelerator, where the latter isexploited as a coprocessor: Serial sections of the code are performedby CPU (host). The parallel ones (that exhibit rich amountof data parallelism) are performed byaccelerator (device). Host and device have separate memoryspaces: need to transfer data in a mannersimilar to “one-sided” message passing.Several languages/API: GPU: CUDA, pyCUDA, OpenCL, OpenACC Xeon Phi: OpenMP, Intel TBB, CilkGPU

Example – CUDADEVICEOne thread per iteration!void main()HOST{ emcpyHostToDevice);increment gpu 4,4 ); .}

OpenACChttp://www.openacc-standard.org/GPU directive based API (corresponds to “OpenMP” forCPU parallel programming).ü Supported by CRAY and PGI (slightlydifferent implementations, butconverging) and soon GCC.ü “Easier” code development – supportsincremental development.ü possible performance loss – about 20%compared to CUDA.ü Can be “combined” with CUDA code.

Accelerators programming Accelerators suitable for massively parallel algorithms and requirelow-level programming (architecture bound) to have goodperformances. They can effectively help in reducing the time to solution. Howeverthe effectiveness is strongly dependent on the algorithm and theamount of computation. The effort to get codes efficiently running on accelerators is, ingeneral, big, irrespectively of the programming model adopted.However portability and maintainability of the code push towarddirective based approaches (at the expenses of some performance). All the (suitable) computational demanding parts of the code should beported. Data transfer should be minimized or hidden. Host-Deviceoverlap is hard to achieve.

Hybrid parallelprogrammingIl modelloibridoHybrid programming (MPI OpenMP, MPI CUDA) is a!growingPiù noditrend.SMP (Symmetric Multiprocessor) connessiuna rete di interconnessione. Take the positive of all models.! SuSuitsogninodo hierarchyè mappato(almeno)the memoryon “fat-nodes”(nodes unwith processolarge memory MPIand many cores).piùthreads OpenMP Scope for better scaling than pure MPI (less inter-nodecommunication) on modern clusters.

Questions?

granularity: the amount of computation carried out by parallel agents dependencies: algorithmic restrictions on how the parallel work can be scheduled - re-program the application to run in parallel and validate it Then, make it work well! - Pay attention to the key aspects of an optimal parallel execution:

Related Documents:

Cloud Computing J.B.I.E.T Page 5 Computing Paradigm Distinctions . The high-technology community has argued for many years about the precise definitions of centralized computing, parallel computing, distributed computing, and cloud computing. In general, distributed computing is the opposite of centralized computing.

Parallel computing is a form of High Performance computing. By using the strength of many smaller computational units, parallel computing can pro-vide a massive speed boost for traditional algorithms.[3] There are multiple programming solutions that o er parallel computing. Traditionally, programs are written to be executed linearly. Languages

Practical Application of Parallel Computing Why parallel computing? Need faster insight on more complex problems with larger datasets Computing infrastructure is broadly available (multicore desktops, GPUs, clusters) Why parallel computing with MATLAB Leverage computational power of more hardware

Parallel Computing Toolbox Ordinary Di erential Equations Partial Di erential Equations Conclusion Lecture 8 Scienti c Computing: Symbolic Math, Parallel Computing, ODEs/PDEs Matthew J. Zahr CME 292 Advanced MATLAB for Scienti c Computing Stanford University 30th April 2015 CME 292: Advanced MATLAB for SC Lecture 8. Symbolic Math Toolbox .

Parallel computing, distributed computing, java, ITU-PRP . 1 Introduction . ITU-PRP provides an all-in-one solution for Parallel Programmers, with a Parallel Programming Framework and a . JADE (Java Agent Development Framework) [6] as another specific Framework implemented on Java, provides a framework for Parallel Processing.

› Parallel computing is a term used for programs that operate within a shared memory . algorithm for parallel computing in Java and can execute ForkJoinTaskprocesses Parallel Computing USC CSCI 201L. . array into the same number of sub-arrays as processors/cores on

distributed. Some authors consider cloud computing to be a form of utility computing or service computing. Ubiquitous computing refers to computing with pervasive devices at any place and time using wired or wireless communication. Internet computing is even broader and covers all computing paradigms over the Internet.

Artificial intelligence (AI) is a significant step forward in the digitalisation and transformation of modern businesses. In short, it refers to computers’ capability to acquire and apply knowledge without programmers’ intervention. Investors are lining up to be part of the imminent change. AI attracted USD 24 bn in investments globally in 2018, a twelvefold increase since 2013. US start .