An Overview of Parallel ComputingRahman TashakkoriandDarren GreeneJanuary 20, 2007Department of Computer ScienceAppalachian State UniversityA Consortium to Promote Computational Science and HighPerformance Computing1Most of the material and figures on this presentation has comefrom:Reference - Lawrence Livermore National Lab, Parallel Computing rallel comp/21
Basics of Parallel ProgrammingParallel programming requires some understanding ofparallel memory architectures and programmingmodels.It also requires some skills in designing parallelprograms.It is often deals with converting existing serialprograms to their parallelized version.3Why Parallel Programming? Save time – do computations faster Complex problems - Solve larger problems Provide concurrency – Do multiple things concurrently Taking advantage of non-local resources - using availablecompute resources on a wide area network, or even theInternet when limited local compute resources are available. Cost savings - using multiple less expensive computeresources instead of paying for time on a supercomputer. Overcoming memory constraints - single computers havevery finite memory resources. For large problems, using thememories of multiple computers may overcome this obstacle.Note: Cost of memory goes up very fast with the size42
Traditional Computation Programs are written for serial computation These programs run on a single CentralProcessing Unit (CPU) on a single computer. A problem is broken into a discrete series ofinstructions. Instructions are executed one after another. Only one instruction may execute at any momentin time.Why don’t we build a very very very fast single CPU computer toperform computations faster?Limitations: Transmission speed, hard to make it smaller, and not5economical.Traditional Computation Both program and data instructions are stored inthe memory Program instructions are coded data telling thecomputer to do something Data is the information to be used by theprogram A central processing unit (CPU) getsinstructions and/or data from the memory,decodes the instructions, and then sequentiallyexecutes them.63
Traditional ComputationWhat is the simplest way of achieving parallelization for the abovetask? Give many of the same computations to many CPU’s Give the smaller instructions (t1, t2, ) to different CPUs7Simple Parallel Computation Simultaneously use multiple compute resourcesto solve a computational problem. Run using multiple CPUs Break the problem into discrete parts that can besolved concurrently on different CPUs Further break down the smaller instructions Execute instructions from each partsimultaneously on different CPUsQuestion – Can all problems be parallelized?Question – Can all problems be solved on a parallel machine?84
One Possibility of Parallel Computation9Parallel Computing Compute Resources A single computer withmultiple processors; An arbitrary number ofcomputers connected by anetwork; A combination of both.One of our 8- node clusters(UNCG Cluster)105
When can we do it?Parallel computing is possible or practical when: The problem in hand can be broken into discretepieces that can be solved simultaneously The compute resource allows execution ofmultiple program instructions at any moment intime The problem is such that it can be solved in lesstime with multiple compute resources than with asingle compute resource.11?10 galwater100 galwater10 galwater126
ArchitecturesParallelization can be done with respect to two independentdimensions: Instruction and Data.So we may have: S I S D - Single Instruction, Single Data S I M D Single Instruction, Multiple Data M I S D Multiple Instruction, Single Data M I M D Multiple Instruction, Multiple Data13SISD A serial (non-parallel) computer Single instruction: only one instructionstream is being acted on by the CPU duringany one clock cycle Single data: only one data stream is beingused as input during any one clock cycle Deterministic execution This is the oldest and until recently, themost prevalent form of computerarchitectureExamples: most PCs, single CPUworkstations and mainframes147
SIMD A type of parallel computer Single instruction: All processing units execute the sameinstruction at any given clock cycle Multiple data: Each processing unit can operate on a different dataelement This type of machine typically has an instruction dispatcher, a veryhigh-bandwidth internal network, and a very large array of verysmall-capacity instruction units. Best suited for specialized problems characterized by a highdegree of regularity, such as image processing. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector PipelinesExamples: Processor Arrays: Connection Machine CM-2, MasparMP-1, MP-2, and Vector Pipelines: IBM 9000, Cray C90, FujitsuVP, NEC SX-2, Hitachi S82015SIMD168
MISD A single data stream is fed into multiple processingunits. Each processing unit operates on the data independentlyvia independent instruction streams. Few actual examples of this class of parallel computerhave ever existed. One is the experimental CarnegieMellon C.mmp computer (1971).Some conceivable uses might be: multiple frequency filters operating on a single signalstream multiple cryptography algorithms attempting to crack asingle coded message.17MISD 189
MIMD Currently, the most common type of parallelcomputer. Most modern computers fall into thiscategory. Multiple Instruction: every processor may beexecuting a different instruction stream Multiple Data: every processor may be working witha different data stream Execution can be synchronous or asynchronous,deterministic or non-deterministic Examples: most current supercomputers, networkedparallel computer "grids" and multi-processor SMPcomputers - including some types of PCs.19MIMD2010
SISDSIMDMIMDMISDSummary21Some DefinitionsTask - A logically discrete section of computational work. A task istypically a program or program- like set of instructions that isexecuted by a processor.Parallel Task - A task that can be executed by multiple processorssafely (yields correct results).Serial Execution - Execution of a program sequentially, onestatement at a time. In the simplest sense, this is what happens on aone processor machine. However, virtually all parallel tasks will havesections of a parallel program that must be executed serially.Parallel Execution - Execution of a program by more than one task,with each task being able to execute the same or different statementat the same moment in time.2211
Some DefinitionsShared Memory - From a strictly hardware point of view, describesa computer architecture where all processors have direct (usuallybus based) access to common physical memory. In a programmingsense, it describes a model where parallel tasks all have the same"picture" of memory and can directly address and access the samelogical memory locations regardless of where the physical memoryactually exists.Non-shared model - Flow of file data from server to client.23Importance of Shared Memory in Client Server ModelIn a non-shared model: The client reads the data from the IPC channel, normally requiringthe data to be copied from the kernel to the process. Finally, the data is copied from the client ’s buffer, the secondargument to the write function, to the output file.A total of four copies of the data are normally required.Additionally, these four copies are done between the kernel and aprocess, often an expensive copy (more expensive than copying datawithin the kernel, or copying data within a single process). Thefigure shown on the previous page depicts this movement of thedata between the client and server, through the kernel.2412
Some DefinitionsDistributed Memory - In hardware, refers to network based memoryaccess for physical memory that is not common. As a programmingmodel, tasks can only logically "see" local machine memory andmust use communications to access memory on other machineswhere other tasks are executing.Copying file data from server to client using shared memory.25 The server gets access to a shared memory object using (say) asemaphore. The server reads from the input file into the shared memoryobject. The second argument to the read, the address of the databuffer, points into the shared memory object. When the read is complete, the server notifies the client, using asemaphore. The client writes the data from the shared memory object to theoutput file.As the figure shows the data is copied only twice—from the inputfile into shared memory and from shared memory to the outputfile. The dashed box encloses the client and the shared memoryobject, and another dashed box enclosing the server and the sharedmemory object, to reinforce that the shared memory objectappears in the address space of both the client and the server.2613
Some DefinitionsCommunications - Parallel tasks typically need to exchange data.There are several ways this can be accomplished, such as through ashared memory bus or over a network, however the actual event ofdata exchange is commonly referred to as communicationsregardless of the method employed.Synchronization - The coordination of parallel tasks in real time,very often associated with communications. Often implemented byestablishing a synchronization point within an application where atask may not proceed further until another task(s) reaches the sameor logically equivalent point. Synchronization usually involveswaiting by at least one task, and can therefore cause a parallelapplication's wall clock execution time to increase.Demo both.27Some DefinitionsGranularity - In parallel computing, granularity is a qualitativemeasure of the ratio of computation to communication. Coarse: relatively large amounts of computational work aredone between communication events Fine: relatively small amounts of computational work are donebetween communication events. Demo.Speedup - Speedup of a code which has been parallelized, definedas: wall clock of serial execution – wall clock of parallel executionParallel Overhead - The amount of time required to coordinateparallel tasks, as opposed to doing useful work. Parallel overheadcan include factors such as:1) Task start-up time 2) Synchronizations 3) Datacommunications Software overhead imposed by parallel compilers,libraries, tools, operating system, etc. 4) Task termination time2814
Some DefinitionsMassively Parallel - Refers to the hardware that comprises a givenparallel system - having many processors. The meaning of manykeeps increasing, but currently we seem to be talking about 6 digits.Scalability - Refers to a parallel system's (hardware and/orsoftware) ability to demonstrate a proportionate increase in parallelspeedup with the addition of more processors. Factors thatcontribute to scalability include: Hardware - particularly memory-cpu bandwidths and networkcommunications Application algorithm Parallel overhead related Characteristics of your specific application and coding29Common Parallel Computer Memory ArchitecturesShared Memory ArchitectureDistributed Memory Architecture3015
Common Parallel Computer Memory ArchitecturesHybrid Distributed-Shared Memory31Parallel Programming ModelsParallel programming models exist as anabstraction above hardware and memoryarchitectures. There are several parallelprogramming models in common use: Shared Memory Threads Message Passing Data Parallel Hybrid3216
Message Passing Model A set of tasks that use their own local memory duringcomputation. Multiple tasks can reside on the samephysical machine as well as across an arbitrary number ofmachines. Tasks exchange data through communications bysending and receiving messages. Data transfer usually requires cooperative operations tobe performed by each process. For example, a sendoperation must have a matching receive operation.33Implementation of Message Passing Model From a programming perspective, message passingimplementations commonly comprise a library ofsubroutines that are imbedded in source code. Theprogrammer is responsible for determining all parallelism. Historically, a variety of message passing libraries havebeen available since the 1980s. These implementationsdiffered substantially from each other making it difficultfor programmers to develop portable applications. In 1992, the MPI Forum was formed with the primarygoal of establishing a standard interface for messagepassing implementations.3417
Implementation of Message Passing Model Part 1 of the Message Passing Interface (MPI) was released in1994. Part 2 (MPI-2) was released in 1996. Both MPI specificationsare available on the web at:www.mcs.anl.gov/Projects/mpi/standard.html. MPI is now the "de facto" industry standard for message passing,replacing virtually all other message passing implementations usedfor production work. Most, if not all of the popular parallelcomputing platforms offer at least one implementation of MPI. Afew offer a full implementation of MPI-2. For shared memory architectures, MPI implementations usuallydon't use a network for task communications. Instead, they useshared memory (memory copies) for performance reasons.35Data Parallel Model Most of the parallel work focuses on performingoperations on a data set. The data set is typicallyorganized into a common structure, such as anarray or cube. A set of tasks work collectively on the same datastructure, however, each task works on a differentpartition of the same data structure. Tasks perform the same operation on theirpartition of work, for example,"add 4 to every array element".3618
Data Parallel ModelOn shared memory architectures, all tasks may have access to thedata structure through global memory. On distributed memoryarchitectures the data structure is split up and resides as "chunks" in37the local memory of each task.Implementation of Data Parallelization Programming with the data parallel model isusually accomplished by writing a program withdata parallel constructs. The constructs can be calls to a data parallelsubroutine library or, compiler directivesrecognized by a data parallel compiler. Fortran 90 and 95 (F90, F95) High Performance Fortran (HPF)3819
Understanding Parallel ProgrammingWhether the intention of your work is to parallelizean existing program or to start a new parallelprogram, you have to make sure that the problemhas a solution that is parallelizable.39Non-parallelizableCalculation of the Fibonacci series(1,1,2,3,5,8,13,21,.) using the formula:F(k 2) F(k 1) F(k)Why this is not parallelizable?This is a non-parallelizable problem because thecalculation of the Fibonacci sequence as shown wouldentail dependent calculations rather than independentones. The calculation of the k 2 value uses those of bothk 1 and k. These three terms cannot be calculatedindependently and therefore, not in parallel.4020
Understanding Parallel ProgrammingIdentify the program's hotspots: Know where most of the real work is being done.The majority of scientific and technical programsusually accomplish most of their work in a fewplaces. Profilers and performance analysis tools can helphere. Focus on parallelizing the hotspots and ignorethose sections of the program that account for littleCPU usage.41Understanding Parallel ProgrammingIdentify bottlenecks in the program Are there areas that are disproportionately slow,or cause parallelizable work to halt or be deferred?For example, I/O is usually something that slows aprogram down. It possible, restructure the program or use adifferent algorithm to reduce or eliminateunnecessary slow areas4221
Understanding Parallel Programming Identify inhibitors to parallelism. One commonclass of inhibitor is data dependence, asdemonstrated by the Fibonacci sequence above. Investigate other algorithms if possible. This maybe the single most important consideration whendesigning a parallel application.43Partitioning for Parallel Programming One of the first steps in designing a parallelprogram is to break the problem into discrete"chunks" of work that can be distributed tomultiple tasks. This is known as decomposition orpartitioning. There are two basic ways to partitioncomputational work among parallel tasks:Domain Decomposition (Dealing with Data) andFunctional Decomposition (Dealing with theproblem) .4422
Partitioning for Parallel ProgrammingDomain DecompositionIn this type of partitioning, the data associated with aproblem is decomposed. Each parallel task then works ona portion of of the data.There are different ways to partition data by blockor by cycle. Example: images in image processing.Functional DecompositionIn this approach, the focus is on the computation that is tobe performed rather than on the data manipulated by thecomputation. The problem is decomposed according to thework that must be done. Each task then performs a portionof the overall work.45Important things to considerThere are a number of important factors to consider whendesigning your program's inter-task communications: Cost of communications Latency vs. Bandwidth Visibility of communications Synchronous vs. asynchronous communications Scope of communications Efficiency of communications Overhead and Complexity4623
SynchronizationThere are a number of important factors to consider whendesigning your program's inter-task communications: Barrier Lock / semaphore Synchronous communication operations47Designing Parallel ProgramsWe have to pay close attention to the following. A dependence exists between program statements whenthe order of statement execution affects the results of theprogram. A data dependence results from multiple use of thesame location(s) in storage by different tasks. Dependencies are important to parallel programmingbecause they are one of the primary inhibitors toparallelism.4824
Designing Parallel Programs Loop carried data dependenceDO 500 J MYSTART,MYENDA(J) A(J-1) * 2.0500 CONTINUEThe value of A(J-1) must be computed before the value of A(J),therefore A(J) exhibits a data dependency on A(J-1). Parallelism isinhibited. Loop independent data dependencetask 1task 2----------X 2X 4 . .Y X**2Y X**3Parallelism is inhibited. Why?49Designing Parallel ProgramsAlthough all data dependencies areimportant to identify when designingparallel programs, loop carrieddependencies are particularlyimportant since loops are possibly themost common target of parallelizationefforts.5025
Designing Parallel ProgramsHow to Handle Data Dependencies: Distributed memory architectures - communicaterequired data at synchronization points. Shared memory architectures - synchronizeread/write operations between tasks.51Designing Parallel ProgramsLoad Balancing: Load balancing refers to the practice ofdistributing work among tasks so that all tasks arekept busy all of the time. It can be considered aminimization of task idle time. Load balancing is important to parallel programsfor performance reasons. For example, if all tasksare subject to a barrier synchronization point, theslowest task will determine the overallperformance.5226
Designing Parallel ProgramsLoad Balancing: Equally partition the work each task receives Use dynamic work assignment53Designing Parallel ProgramsGranularity: Computation / Communication Ratio Fine-grain Parallelism Coarse-grain Parallelism Best practices 5427
Fine-grain Parallelism: Relatively small amounts of computationalwork are done between communicationevents Low computation to communication ratio Facilitates load balancing Implies high communication overhead andless opportunity for performanceenhancement If granularity is too fine it is possible thatthe overhead required for communicationsand synchronization between tasks takeslonger than the computation.55Coarse-grain Parallelism Relatively large amounts ofcomputational work are done betweencommunication/synchronizationevents High computation to communicationratio Implies more opportunity forperformance increase Harder to load balance efficiently5628
Best granularity Practices The most efficient granularity is dependent onthe algorithm and the hardware environment inwhich it runs. In most cases the overhead associated withcommunications and synchronization is highrelative to execution speed so it is advantageous tohave coarse granularity. Fine-grain parallelism can help reduce overheadsdue to load imbalance.57Limits and Costs of Parallel ProgrammingAmdahl’s Law: Potential program speedup is defined by the fractionof code (P) that can be parallelized:speedup 1/(1-P)Modified version includes the number of processors as:speedup 1/ ( P/N S)where P parallel fraction, N number of processors and S serialfraction5829
Measure of Complexity of a ProgramThe costs of complexity are measured inprogrammer’s time in virtually every aspect of thesoftware development cycle: Design Coding Debugging Tuning Maintenance59ExampleImproving the intensity of an image by multiplying all pixels with aconstant larger than one.for (int i 0; i N; i )for(j 0; j M; j )f(i, j) conts * f(i, j);}}The computation for each pixel is independent of others. Thiscomputation must be computationally intensive to make theparallelization worth the efforts.6030
One possible parallelization:for (int i start1; i end1; i )for(j 0; j M; j )f(i, j) conts * f(i, j);}}for (int i start2; i end2; i )for(j 0; j M; j )f(i, j) conts * f(i, j);}}Task 1 Task 2 Task 3 8-by-8 image61Some Quick and Valuable Resources: Dr. Li’s lecture material for Parallel Computing course (next section) The Lawrence Livermore Parallel Computing Tutorial Barry Wilkinson Parallel Computing Book6231
as: wall clock of serial execution - wall clock of parallel execution Parallel Overhead - The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead can include factors such as: 1) Task start-up time 2) Synchronizations 3) Data communications Software overhead imposed by parallel compilers,
Cloud Computing J.B.I.E.T Page 5 Computing Paradigm Distinctions . The high-technology community has argued for many years about the precise definitions of centralized computing, parallel computing, distributed computing, and cloud computing. In general, distributed computing is the opposite of centralized computing.
Parallel computing is a form of High Performance computing. By using the strength of many smaller computational units, parallel computing can pro-vide a massive speed boost for traditional algorithms.[3] There are multiple programming solutions that o er parallel computing. Traditionally, programs are written to be executed linearly. Languages
Practical Application of Parallel Computing Why parallel computing? Need faster insight on more complex problems with larger datasets Computing infrastructure is broadly available (multicore desktops, GPUs, clusters) Why parallel computing with MATLAB Leverage computational power of more hardware
Parallel Computing Toolbox Ordinary Di erential Equations Partial Di erential Equations Conclusion Lecture 8 Scienti c Computing: Symbolic Math, Parallel Computing, ODEs/PDEs Matthew J. Zahr CME 292 Advanced MATLAB for Scienti c Computing Stanford University 30th April 2015 CME 292: Advanced MATLAB for SC Lecture 8. Symbolic Math Toolbox .
Parallel computing, distributed computing, java, ITU-PRP . 1 Introduction . ITU-PRP provides an all-in-one solution for Parallel Programmers, with a Parallel Programming Framework and a . JADE (Java Agent Development Framework) [6] as another specific Framework implemented on Java, provides a framework for Parallel Processing.
› Parallel computing is a term used for programs that operate within a shared memory . algorithm for parallel computing in Java and can execute ForkJoinTaskprocesses Parallel Computing USC CSCI 201L. . array into the same number of sub-arrays as processors/cores on
distributed. Some authors consider cloud computing to be a form of utility computing or service computing. Ubiquitous computing refers to computing with pervasive devices at any place and time using wired or wireless communication. Internet computing is even broader and covers all computing paradigms over the Internet.
2 The Adventures of Tom Sawyer. already through with his part of the work (picking up chips), for he was a quiet boy, and had no adventurous, troublesome ways. While Tom was eating his supper, and stealing sugar as opportunity offered, Aunt Polly asked him questions that were full of guile, and very deep for she wanted to trap him into damaging revealments. Like many other simple-hearted souls .