Parallel Computing: A Brief Discussion - UCL

1y ago

3 Views

2 Downloads

3.17 MB

30 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Harley Spears

Report this link

Download PDF

Transcription

Parallel Computing: a brief discussionMarzia RiviAstrophysics GroupDepartment of Physics & AstronomyResearch Programming Social – Nov 10, 2015

What is parallel computing?Traditionally, software has been written for serialcomputation:––––To be run on a single computer having a single core.A problem is broken into a discrete series of instructions.Instructions are executed one after another.Only one instruction may execute at any moment in time.Parallel computing is the simultaneous use of multiplecompute resources to solve a computational problem:– A problem is broken into discrete parts that can be solvedconcurrently.– Instructions from each part executed simultaneously ondifferent cores.

Why parallel computing? Save time and/or money:– in theory, more resources we use, shorter the time to finish, withpotential cost savings. Solve larger problems:– when the problems are so large and complex, it is impossible tosolve them on a single computer, e.g. "Grand Challenge" problemsrequiring PetaFLOPS and PetaBytes of computing resources(en.wikipedia.org/wiki/Grand Challenge).– Many scientific problems can be tackled only by increasingprocessor performances.– Highly complex or memory greedy problems can be solved onlywith greater computing capabilities. Limits to serial computing: physical and practical reasons

Researcher 2 Researcher 3Researcher 1Who needs parallel computing? has a large number ofindependent jobs (e.g.processing video files,genome sequencing,parametric studies)uses serial applicationsdeveloped serial code andvalidated it on smallproblemsto publish, needs some“big problem” resultsneeds to run large parallelsimulations fast (e.g.molecular dynamics,computational fluiddynamics, cosmology)High ThroughputComputing (HTC)(many computers): dynamicenvironment multipleindependent smallto-medium jobs large amounts ofprocessing overlong time loosely connectedresources (e.g. grid)High PerformanceComputing (HPC)(single parallelcomputer): staticenvironment single large scaleproblems tightly coupledparallelism

How to parallelise an application?Automatic parallelisation tools: compiler support for vectorisation of operations (SSE and AVX)and threads parallelisation (OpenMP) specific tools exists but limited practical use all successful applications require intervention and steeringParallel code development requires: programming languages (with support for parallel libraries, APIs)parallel programming standards (such as MPI and OpenMP)compilersperformance libraries/tools (both serial and parallel)But , more that anything, it requires understanding: the algorithms (program, application, solver, etc.):the factors that influence parallel performance

How to parallelise an application? First, make it work!– analyse the key features of your parallel algorithms: parallelism: the type of parallel algorithm that can use parallel agents granularity: the amount of computation carried out by parallel agents dependencies: algorithmic restrictions on how the parallel work can bescheduled– re-program the application to run in parallel and validate it Then, make it work well!– Pay attention to the key aspects of an optimal parallel execution: data locality (computation vs. communication) scalability (linear scaling is the holy grail: execution time is inverselyproportional with the number of processors)– Use profilers and performance tools to identify problems.

Task parallelismThread (or task) parallelism is based on parting the operationalgorithm.Task ParallelismThread (or task)based on withexecutingIf anparallelismalgorithm is isimplementedseries of independent oconcurrently differentpartsof the algorithm.can be spreadthroughoutthe processors thus realizing proparallelisation.Features Different independent sets ofinstructions applied to singleset (or multiple sets) of data.May lead to work imbalanceand may not scale well andperformance limited by theslowest process.begintask 1task 2task 3task 4endcpu1cpu2cpu3cpu4

Data ParallelismDataparallelismData parallelism means spreading data to be computedmeans spreading data to be computed through thethrough Datathe parallelismprocessors.processors.Features The samesets of instructionsapplieddifferent(parts butof the)data. data seThe processorsexecute merelythetosameoperations,on diverseProcessorsondistributiondata assignedto themand acrosscommunicateThis workoften onlymeansof arrayelementsthe computing unwhen necessary. Easy to program, scale well.begin Inherent in program loops.noi 4?yestaskendarray[4]datacpu1cpu3cpu2cpu4

Granularity Coarse––––parallelise large amounts of the total workloadin general, the coarser the betterminimal inter-processor communicationcan lead to imbalance Fine– parallelise small amounts of the total workload (e.g. inner loops)– can lead to unacceptable parallel overheads (e.g. communication)

DependenciesdependenceDictate the order Dataof operations,imposes limits onDatadependenceparallelism and requires parallel synchronisation.In the following example the i index loop can be parallelized:theexamplefollowingtheexampletheloopi indexcan be parallelized:InInthisi indexcanloopbe parallelized:DO I 1, NDO I 1,DON J 1,DO A(J,I)J 1,END A(J,I)DOENDDOEND DOEND DOfortranfortranNN A(J-1,I) B(J,I) A(J-1,I) B(J,I)c/c For (i 1; i n; i ){c/c For for(i 1;i n;i ){(j 1;j n;j ){for a[j][i](j 1 ;j n;j ){ a[j-1][i] b[j][i];a[j][i] a[j-1][i] b[j][i];}}}}In this loop parallelization is dependent on the K value:thislooploop parallelizationparallelization isisdependentononthetheK value:InInthisdependentk value:fortranDO I M, NfortranDO I M, N A(I) A(I-K) B(I)/C(I)A(I) A(I-K) B(I)/C(I)END DOEND DOFor (i m; i n; i ){For a[i](i m; i n;i ){a[i-k] b[i]/c[i];a[i] a[i-k] b[i]/c[i];}}If KN-Morork K N-M ard.k M-NparallelizationIf K N-M or K M-N parallelization is straightforward.c/c c/c

DistributedMemoryMemory rforphysicalphysical Distributedmemory that is not common. As a programming model, tasks can onlymemorythat is not common. As a programming model, tasks can onlylogically "see" local machine memory and must use communications tologically "see" localmachinememory and mustuse communications toParallelcomputingmodelsaccess memory on other machines where other tasks are executing.access memoryon other machines where othertasks are executing.Shared memoryDistributed memory Each processor has direct accessSharedMemoryto commonphysicalmemorySharedMemory(e.g. multi-processors, clusternodes). Agent of parallelism: the thread(program collection of threads).Threads exchange informationimplicitly by reading/writingshared variables. Programming standard: OpenMP. Local Distributedprocessor memoryis invisible toMemoryall otherprocessors,network basedDistributedMemorymemory access (e.g. computerclusters). Agent of parallelism: the process(program collection of processes).Exchanging information betweenprocesses requires communications.Programming standard: MPI.

OpenMPhttp://www.openmp.orgAPI instructing the compiler what can be done in parallel(high-level programming). Consisting of:–––compiler directivesruntime library functionsenvironment ntVariablesRuntime LibraryThreads in Operating System Supported by most compilers for Fortran and C/C . Usable as serial code (threading ignored by serial compilation). By design, suited for data parallelism.

OpenMPThreads are generated automatically at runtime and scheduled bythe OS. Thread creation / destruction overhead.Execution modelMinimise the number of times parallel regions are entered/exited.masterMaster thread onlyAll threadsFortran syntaxC syntax! OMP PARALLEL#pragma omp parallel {Parallel region! OMP END PARALLELMaster thread only}

OpenMP - exampleObjective: vectorise a loop, to map the sin operation to vector x in parallel.Idea: instruct the compiler on what to parallelise (the loop) and how (privateand shared data) and let it do the hard work.In C:#pragma omp parallel for shared(x, y, J) private(j)for (j 0; j J; j ) {y[j] sin(x[j]);}In Fortran: omp parallel do shared(x, y, J) private(j)do j 1, Jy(j) sin(x(j))end do omp end parallel doNumber of threads is set by environment variable OMP NUM THREADS orprogrammed for using the RTL function omp set num threads().

MPI - Message Passing Interfacehttp://www.mpi-forum.org/MPI is a specification for a Distributed-Memory API designedby a committee for Fortran, C and C languages. Two versions:– MPI 1.0, quickly and universally adopted (most used and useful)– MPI 2.0, is a superset of MPI 1.0 (adding parallel I/O, dynamicprocess management and direct remote memory operations) butis not so popular. Many implementations– open software (MPICH, MVAPICH, OpenMPI)– vendor (HP/Platform, SGI MPT, Intel).

MPI implementation components Libraries covering the functionality specified by the standard. Header files, specifying interfaces, constants etc.– C/C : mpi.h– Fortran: mpif.h Tools to compile and link MPI applications (wrappers aroundserial compilers)– Fortran: mpif77, mpif90– C: mpicc– C : mpicxx, mpiCC An MPI application launcher (mapping processes to CPUs)mpirun -np n processes executable

MPI - overview Processes (MPI tasks) are mapped to processors (CPU cores). Start/stop mechanisms:– MPI Init() to initialise processes– MPI Finalize() to finalise and clean up processes Communicators:– a communicator is a collection (network) of processes– default is MPI COMM WORLD, which is always present and includesall processes requested by mpirun– only processes included in a communicator can communicate Identification mechanism:– process id: MPI Comm rank()– communicator size (number of processes): MPI Comm size()

MPI - communicationInter-process communication (the cornerstone of MPIprogramming): one-to-one communication (send, receive)one-to-many communication (broadcasts, scatter)many-to-one communication (gather)many-to-many communication (allgather)reduction (e.g. global sums, global max/min) (a special many-to-one!)process synchronisation (barriers)Domain decompositionMaster-slave

MPI - example#include "mpi.h"#include stdio.h #include stdlib.h int main( int argc, char *argv[]){int my rank, numprocs; char message[100]; int dest, tag, source; MPI Statusstatus;MPI Init(&argc,&argv);MPI Comm rank(MPI COMM WORLD,&my rank);MPI Comm size(MPI COMM WORLD,&numprocs); If the program is executed with two prif (my rank ! 0){sprintf(message,"Greetings from process %d !\0",my rank); dest 0;Greetings from process 1!tag 0;MPI Send(message, sizeof(message), MPI CHAR, dest, tag, MPI COMM WORLD);} else {for (source 1; source (numprocs-1); source )If the program is executed with four p{MPI Recv(message, 100, MPI CHAR, source, tag, MPI COMM WORLD, &status);printf("%s\n",message);}Greetings from process 1!}MPI Finalize();Greetings from process 2!Hello world! (o}Greetings from process 3!

Distributed vs shared memoryapplication featureshared memory / OpenMPdistributed memory / MPIparallelisation easy, incremental (parallelising smallparts of the code at a time) mostly parallelise loops relatively difficult (tends to requirea all-or-nothing approach) can be used in a wider range ofcontextsscaling (hardware view)both expensive (few vendors providescalable solutions) and cheap (multicore workstations) relatively cheap (most vendorsprovide systems with 1000’s ofcores) runs on both shared anddistributed systemsscaling (programming view)small/simple programs are easy and fastto implementeven small/simple programs involvelarge programming complexitymaintainabilitycode is relatively easy to understandand maintaincode is relatively difficult tounderstandreadabilitysmall increase in code size, readablecodetends to add a lot of extra coding formessage handling, code readablewith difficultydebugging requires special compiler supportdebuggers are extension of serialones no special compiler support (justlibraries)specialised debuggers

Distributed vs shared memory paradigmWhich problems are suited to Distributed Memory Processing? Embarrassingly parallel problems (independent tasks), e.g. MonteCarlo methods.Computation bound problems (heavy local computation with little dataexchange between processes).– models with localised data, e.g. PDEs solved using finite elements/volumes(CFD, CHMD, etc.)– other models with distributed data: molecular dynamics, etc.Which problems are suited to Shared Memory Processing? Communication bound problems (much data shared between threads)– models with non-local data: e.g. Newtonian particle dynamics– Fourier transform, convolutions.

Accelerators - motivationMoore’s Law (1965): the number of transistors in CPUdesign doubles roughly every 2yearsbacked by clock speed increase,this has correlated withexponentially increasing CPUperformance for at least 40 years.This meant the same old (singlethreaded) code just runs faster onnewer hardware. No more!While the “law” still holds, clockfrequency of general purpose CPUswas “frozen” in 2004 at around 2.5-3.0GHz and design has gone multicore.Performance improvements are now coming from the increase in the number ofcores on a processor.

Accelerators – different philosophiesAcceleratorDesign of CPUs optimized forsequential code and coarsegrained parallelism: multi-core sophisticated control logic unit large cache memories toreduce access latencies.Design of accelerators optimized fornumerically intensive computation bya massive fine grained parallelism: many-cores (several hundreds) leightweight threads and highexecution throughput large number of threads to overcomelong-latency memory accesses.

Accelerators - examplesNVIDIA Tesla K20X GPU2688 cores6GB GDDR5 memory250 GB/sec memory bandwidth3.95Tflops/sec of peak SPIntel Xeon Phi 5110 MIC60 cores8GB GDDR5320 GB/s memory bandwidth240 HW threads (4 per core)512-bit wide SIMD capability

Accelerators – programming modelApplications should use both CPUs andthe accelerator, where the latter isexploited as a coprocessor: Serial sections of the code are performedby CPU (host). The parallel ones (that exhibit rich amountof data parallelism) are performed byaccelerator (device). Host and device have separate memoryspaces: need to transfer data in a mannersimilar to “one-sided” message passing.Several languages/API: GPU: CUDA, pyCUDA, OpenCL, OpenACC Xeon Phi: OpenMP, Intel TBB, CilkGPU

Example – CUDADEVICEOne thread per iteration!void main()HOST{ emcpyHostToDevice);increment gpu 4,4 ); .}

OpenACChttp://www.openacc-standard.org/GPU directive based API (corresponds to “OpenMP” forCPU parallel programming).ü Supported by CRAY and PGI (slightlydifferent implementations, butconverging) and soon GCC.ü “Easier” code development – supportsincremental development.ü possible performance loss – about 20%compared to CUDA.ü Can be “combined” with CUDA code.

Accelerators programming Accelerators suitable for massively parallel algorithms and requirelow-level programming (architecture bound) to have goodperformances. They can effectively help in reducing the time to solution. Howeverthe effectiveness is strongly dependent on the algorithm and theamount of computation. The effort to get codes efficiently running on accelerators is, ingeneral, big, irrespectively of the programming model adopted.However portability and maintainability of the code push towarddirective based approaches (at the expenses of some performance). All the (suitable) computational demanding parts of the code should beported. Data transfer should be minimized or hidden. Host-Deviceoverlap is hard to achieve.

Hybrid parallelprogrammingIl modelloibridoHybrid programming (MPI OpenMP, MPI CUDA) is a!growingPiù noditrend.SMP (Symmetric Multiprocessor) connessiuna rete di interconnessione. Take the positive of all models.! SuSuitsogninodo hierarchyè mappato(almeno)the memoryon “fat-nodes”(nodes unwith processolarge memory MPIand many cores).piùthreads OpenMP Scope for better scaling than pure MPI (less inter-nodecommunication) on modern clusters.

Questions?

granularity: the amount of computation carried out by parallel agents dependencies: algorithmic restrictions on how the parallel work can be scheduled - re-program the application to run in parallel and validate it Then, make it work well! - Pay attention to the key aspects of an optimal parallel execution:

Related Documents:

LECTURE NOTES ON CLOUD COMPUTING - J. B. Institute of Engineering and ...

Cloud Computing J.B.I.E.T Page 5 Computing Paradigm Distinctions . The high-technology community has argued for many years about the precise definitions of centralized computing, parallel computing, distributed computing, and cloud computing. In general, distributed computing is the opposite of centralized computing.

21 Views

1y ago

Accelerating BCCD using parallel computing

Parallel computing is a form of High Performance computing. By using the strength of many smaller computational units, parallel computing can pro-vide a massive speed boost for traditional algorithms.[3] There are multiple programming solutions that o er parallel computing. Traditionally, programs are written to be executed linearly. Languages

25 Views

3y ago

Parallel and Distributed Computing with MATLAB - EAFIT

Practical Application of Parallel Computing Why parallel computing? Need faster insight on more complex problems with larger datasets Computing infrastructure is broadly available (multicore desktops, GPUs, clusters) Why parallel computing with MATLAB Leverage computational power of more hardware

16 Views

1y ago

Lecture 8 Scientific Computing: Symbolic Math, Parallel Computing, ODEs ...

Parallel Computing Toolbox Ordinary Di erential Equations Partial Di erential Equations Conclusion Lecture 8 Scienti c Computing: Symbolic Math, Parallel Computing, ODEs/PDEs Matthew J. Zahr CME 292 Advanced MATLAB for Scienti c Computing Stanford University 30th April 2015 CME 292: Advanced MATLAB for SC Lecture 8. Symbolic Math Toolbox .

17 Views

1y ago

ITU-PRP: Parallel and Distributed Computing Middleware for Java ... - CORE

Parallel computing, distributed computing, java, ITU-PRP . 1 Introduction . ITU-PRP provides an all-in-one solution for Parallel Programmers, with a Parallel Programming Framework and a . JADE (Java Agent Development Framework) [6] as another specific Framework implemented on Java, provides a framework for Parallel Processing.

9 Views

1y ago

Parallel Computing - University of Southern California

› Parallel computing is a term used for programs that operate within a shared memory . algorithm for parallel computing in Java and can execute ForkJoinTaskprocesses Parallel Computing USC CSCI 201L. . array into the same number of sub-arrays as processors/cores on

22 Views

2y ago

UNIT 1 design objectives to achieve High-performance computing ... - Weebly

distributed. Some authors consider cloud computing to be a form of utility computing or service computing. Ubiquitous computing refers to computing with pervasive devices at any place and time using wired or wireless communication. Internet computing is even broader and covers all computing paradigms over the Internet.

70 Views

11m ago

20190604 Artificial intelligence in banking SA

Artificial intelligence (AI) is a significant step forward in the digitalisation and transformation of modern businesses. In short, it refers to computers’ capability to acquire and apply knowledge without programmers’ intervention. Investors are lining up to be part of the imminent change. AI attracted USD 24 bn in investments globally in 2018, a twelvefold increase since 2013. US start .

57 Views

3y ago

Recent Views

Columbus,Ohio 1890

Slicing Steaks 3563 Beef Tender, Select In Stock 3852 Angus XT Shoulder Clod, Choice In Stock 3853 Angus XT Chuck Roll, Choice 20/up In Stock 3856 Angus XT Peeled Knuckle In Stock 3857 Angus XT Inside Rounds In Stock 3858 Angus XT Flats, Choice In Stock 3859 Angus XT Eye Of Round, Choice In Stock 3507 Point Off Bnls Beef Brisket, Choice In Stock

2y ago

268 Views

Buying Your First Stock - Stock-Trak

Stock Market Game Time: 15 Minutes Requires: StockTrak Curriculum , Computer Access Buying Your First Stock This lesson is an introduction to buying a stock. Students will be introduced to basic vocabulary that is involved with a buying and owning a stock. Stu-dents will be going through the entire process of buying a stock from looking

1y ago

164 Views

TRAINING - CamInstructor

Mastercam Training Guide Mill-Lesson-4-9 6. Change the parameters to match the Stock Setup screenshot below: Stock Setup Stock Origin The stock origin is the X-Y-Z coordinate position of the point indicated by the cross in the picture of the stock model. Use it so Mastercam knows where your stock model is located relative to your part and

3y ago

242 Views

WPX Energy, Inc. - Feltl and Company

WPX Energy, Inc. Common Stock We are offering 27,000,000 shares of our common stock. Our common stock is listed on the New York Stock Exchange under the symbol “WPX.” On July 10, 2015, the last reported sale price for our common stock on the New York Stock Exchange (the “NYSE”) was 11.22 per share.

3y ago

172 Views

Spray 2020 Corporate Profiles - industry-publications

Custom plastic tubes (mono & multi-layer, ABL and Polyami) Stock and custom plastic, metal, and wood caps and closures Stock and custom fine mist, treatment and lotion pumps Stock and custom droppers Stock and custom rollerballs/roll-ons Stock sampler bottles and vials Stock German Quality cosmetic pencil sharpeners

2y ago

180 Views

The Stock Market Profits Blueprint - Liberated Stock Trader

The stock market profits blueprint has been hand crafted to enable you to understand all the factors that play on the stock market. It is called a blueprint because a blueprint is in effect an architectural document to show how something is designed. The Blueprint will show you a powerful way to envisage how the stock market and the stock market

1y ago

181 Views

The Impact of Persian News on Stock Returns Through Text Mining Techniques

Persian news - on the stock prices has been neglected. Consequently, this study aimed to fill this gap. To this aim, the stock index values were collected from the Tehran Stock Exchange along with the . Stock market prediction is a way to understand the future fluctuations of a company's stock price (Jishag et al., 2020). Generally, two .

1y ago

225 Views

Stock Market Uncertainty and the Stock-Bond Return Relation

implied volatility and stock turnover may prove useful for ﬁnancial applications that need to under-stand and predict stock and bond return co-movements. Finally, our empirical results suggest that the beneﬁts of stock-bond diversiﬁcation increase during periods of high stock market uncertainty. This study is organized as follow.

1y ago

158 Views

Operation of Stock Exchange - Williams College

Class Notes Operation of Stock Exchange - 3 - Buying on Margin "Margin" is borrowing money from your broker to buy a stock and using your invest-ment as collateral. Example Buy paying full price Buy stock at 60. Stock price goes to 90. Return (90 - 60)/60 50% Buy on "margin" Buy stock at 60. Borrow 30; you pay 30.

1y ago

138 Views

Stock Market Development and Economic Growth: Empirical Evidence from China

measures used to proxy for stock market size and the size of real economy. Most of the existing studies use stock market index as a proxy for measuring the growth and development of stock market in a country. We argue that stock market index may not be a good measure of stock market size when looking at its association with economic growth.

1y ago

263 Views

A Hybrid Prediction Method for Stock Price Using LSTM and . - Hindawi

the relationship between stock prices and these factors. Although these factors will temporarily change the stock price, in essence, these factors will be reﬂected in the stock price and will not change the long-term trend of the stock price. erefore, stock prices can be predicted simply with historical data.

1y ago

159 Views

A voyage to more stable safety stock and service levels - apics

safety stock targets. Most enterprise resources planning (ERP) systems perform a safety stock calculation. But very few include in the system all sources of variability as inputs to the safety stock formula. Furthermore, ERP tools rarely calculate accurate safety stock inputs or correct erroneous data. Figure 1 shows 13 basic safety stock inputs.

1y ago

149 Views

Factors Affecting Performance of Stock Market: Evidence from . - HRMARS

We used the data of Colombo Stock Exchange (CSE) for Sri Lankan stock market in this research which is the main stock exchange of Sri Lanka. The market capitalization of CSE is over 20 billion USD. Colombo stock exchange is the first south Asian region stock market and overall 52nd who obtain the membership of World Federation of Exchanges.

11m ago

103 Views

Forecasting Stock Price Turning Points in The Tehran Stock Exchange .

Forecasting Stock Price Turning Points in the Tehran Stock Exchange Using Weighted Support Vector Machine. Journal of Entrepreneurship Education, 25(5), 1-12 . 2 1528-2651-25-5-797 Citation Information: Sayrani., M & Sharif, J.S. (2022). Forecasting Stock Price Turning Points in the Tehran Stock Exchange Using Weighted Support Vector Machine. .

7m ago

96 Views

Water Physical Stock Account: 1995-2010 - Tableau Public

10 Water physical stock account for year ended June 2003, by region . 11 Water physical stock account for year ended June 2004, by region . 12 Water physical stock account for year ended June 2005, by region . 13 Water physical stock account for year ended June 2006, by region . 14 Water physical stock account for year ended June 2007, by region

3m ago

22 Views

Parallel Computing: A Brief Discussion - UCL

It looks like you're using an ad-blocker