Sc - Massachusetts Institute Of Technology

3y ago
15 Views
2 Downloads
386.59 KB
29 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Farrah Jaffe
Transcription

Scheduling Multithreaded Computationsby Work StealingRobert D. BlumofeThe University of Texas at AustinCharles E. LeisersonMIT Laboratory for Computer ScienceAbstractThis paper studies the problem of e ciently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is \workstealing," in which processors needing work steal computational threads from otherprocessors. In this paper, we give the rst provably good work-stealing scheduler formultithreaded computations with dependencies.Speci cally, our analysis shows that the expected time to execute a fully strictcomputation on processors using our work-stealing scheduler is 1 ( 1 ), where1 is the minimum serial execution time of the multithreaded computation and 1 isthe minimum execution time with an in nite number of processors. Moreover, thespace required by the execution is at most 1 , where 1 is the minimum serial spacerequirement. We also show that the expected total communication of the algorithm isat most ( 1 (1 d ) max ), where max is the size of the largest activation record ofany thread and d is the maximum number of times that any thread synchronizes withits parent. This communication bound justi es the folk wisdom that work-stealingschedulers are more communication e cient than their work-sharing counterparts. Allthree of these bounds are existentially optimal to within a constant factor.PT PTO TTS PO PTnSSSn1 IntroductionFor e cient execution of a dynamically growing \multithreaded" computation on a MIMDstyle parallel computer, a scheduling algorithm must ensure that enough threads are activeconcurrently to keep the processors busy. Simultaneously, it should ensure that the numberof concurrently active threads remains within reasonable limits so that memory requirementsare not unduly large. Moreover, the scheduler should also try to maintain related threadsThis research was supported in part by the Advanced Research Projects Agency under Contract N0001494-1-0985. This research was done while Robert D. Blumofe was at the MIT Laboratory for Computer Scienceand was supported in part by an ARPA High-Performance Computing Graduate Fellowship.1

on the same processor, if possible, so that communication between them can be minimized.Needless to say, achieving all these goals simultaneously can be di cult.Two scheduling paradigms have arisen to address the problem of scheduling multithreadedcomputations: work sharing and work stealing. In work sharing, whenever a processorgenerates new threads, the scheduler attempts to migrate some of them to other processorsin hopes of distributing the work to underutilized processors. In work stealing, however,underutilized processors take the initiative: they attempt to \steal" threads from otherprocessors. Intuitively, the migration of threads occurs less frequently with work stealingthan with work sharing, since when all processors have work to do, no threads are migratedby a work-stealing scheduler, but threads are always migrated by a work-sharing scheduler.The work-stealing idea dates back at least as far as Burton and Sleep's research on parallel execution of functional programs [16] and Halstead's implementation of Multilisp [30].These authors point out the heuristic bene ts of work stealing with regards to space andcommunication. Since then, many researchers have implemented variants on this strategy[11, 21, 23, 29, 34, 37, 46]. Rudolph, Slivkin-Allalouf, and Upfal [43] analyzed a randomized work-stealing strategy for load balancing independent jobs on a parallel computer, andKarp and Zhang [33] analyzed a randomized work-stealing strategy for parallel backtracksearch. Recently, Zhang and Ortynski [48] have obtained good bounds on the communicationrequirements of this algorithm.In this paper, we present and analyze a work-stealing algorithm for scheduling \fullystrict" (well-structured) multithreaded computations. This class of computations encompasses both backtrack search computations [33, 48] and divide-and-conquer computations [47],as well as data ow computations [2] in which threads may stall due to a data dependency.We analyze our algorithms in a stringent atomic-access model similar to the atomic messagepassing model of [36] in which concurrent accesses to the same data structure are seriallyqueued by an adversary.Our main contribution is a randomized work-stealing scheduling algorithm for fully strictmultithreaded computations which is provably e cient in terms of time, space, and communication. We prove that the expected time to execute a fully strict computation on Pprocessors using our work-stealing scheduler is T1 P O(T1), where T1 is the minimumserial execution time of the multithreaded computation and T1 is the minimum executiontime with an in nite number of processors. In addition, the space required by the executionis at most S1P , where S1 is the minimum serial space requirement. These bounds are better than previous bounds for work-sharing schedulers [10], and the work-stealing scheduleris much simpler and eminently practical. Part of this improvement is due to our focusing on fully strict computations, as compared to the (general) strict computations studiedin [10]. We also prove that the expected total communication of the execution is at mostO(PT1(1 nd )Smax), where Smax is the size of the largest activation record of any threadand nd is the maximum number of times that any thread synchronizes with its parent. Thisbound is existentially tight to within a constant factor, meeting the lower bound of Wuand Kung [47] for communication in parallel divide-and-conquer. In contrast, work-sharingschedulers have nearly worst-case behavior for communication. Thus, our results bolster thefolk wisdom that work stealing is superior to work sharing.Others have studied and continue to study the problem of e ciently managing the spacerequirements of parallel computations. Culler and Arvind [19] and Ruggiero and Sargeant2

[44] give heuristics for limiting the space required by data ow programs. Burton [14] showshow to limit space in certain parallel computations without causing deadlock. More recently,Burton [15] has developed and analyzed a scheduling algorithm with provably good time andspace bounds. Blelloch, Gibbons, Matias, and Narlikar [3, 4] have also recently developedand analyzed scheduling algorithms with provably good time and space bounds. It is notyet clear whether any of these algorithms are as practical as work stealing.The remainder of this paper is organized as follows. In Section 2 we review the graphtheoretic model of multithreaded computations introduced in [10], which provides a theoretical basis for analyzing schedulers. Section 3 gives a simple scheduling algorithm whichuses a central queue. This \busy-leaves" algorithm forms the basis for our randomized workstealing algorithm, which we present in Section 4. In Section 5 we introduce the atomic-accessmodel that we use to analyze execution time and communication costs for the work-stealingalgorithm, and we present and analyze a combinatorial \balls and bins" game that we useto derive a bound on the contention that arises in random work stealing. We then use thisbound along with a delay-sequence argument [41] in Section 6 to analyze the execution timeand communication cost of the work-stealing algorithm. To conclude, in Section 7 we brie ydiscuss how the theoretical ideas in this paper have been applied to the Cilk programminglanguage and runtime system [8, 25], as well as make some concluding remarks.2 A model of multithreaded computationThis section reprises the graph-theoretic model of multithreaded computation introducedin [10]. We also de ne what it means for computations to be \fully strict." We concludewith a statement of the greedy-scheduling theorem, which is an adaptation of theorems byBrent [13] and Graham [27, 28] on dag scheduling.A multithreaded computation is composed of a set of threads, each of which is a sequential ordering of unit-time instructions. The instructions are connected by dependencyedges, which provide a partial ordering on which instructions must execute before whichother instructions. In Figure 1, for example, each shaded block is a thread with circlesrepresenting instructions and the horizontal edges, called continue edges, representing thesequential ordering. Thread 5 of this example contains 3 instructions: v10 , v11 , and v12 .The instructions of a thread must execute in this sequential order from the rst (leftmost)instruction to the last (rightmost) instruction. In order to execute a thread, we allocate forit a chunk of memory, called an activation frame, that the instructions of the thread canuse to store the values on which they compute.A P -processor execution schedule for a multithreaded computation determines whichprocessors of a P -processor parallel computer execute which instructions at each step. Anexecution schedule depends on the particular multithreaded computation and the number Pof processors. In any given step of an execution schedule, each processor executes at mostone instruction.During the course of its execution, a thread may create, or spawn, other threads. Spawning a thread is like a subroutine call, except that the spawning thread can operate concurrently with the spawned thread. We consider spawned threads to be children of the threadthat did the spawning, and a thread may spawn as many children as it desires. In this way,3

14v18v15v19v20Γ5v8v10v11Figure 1:v12A multithreaded computation. This computation contains 23 instructionsand 6 threads 1 26.;1 2v ;v ;:::;v23;:::;threads are organized into a spawn tree as indicated in Figure 1 by the downward-pointing,shaded dependency edges, called spawn edges, that connect threads to their spawned children. The spawn tree is the parallel analog of a call tree. In our example computation, thespawn tree's root thread 1 has two children, 2 and 6, and thread 2 has three children,3 , 4 , and 5 . Threads 3 , 4 , 5 , and 6 , which have no children, are leaf threads.Each spawn edge goes from a speci c instruction the instruction that actually doesthe spawn operation in the parent thread to the rst instruction of the child thread. Anexecution schedule must obey this edge in that no processor may execute an instruction ina spawned child thread until after the spawning instruction in the parent thread has beenexecuted. In our example computation (Figure 1), due to the spawn edge (v6 ; v7), instructionv7 cannot be executed until after the spawning instruction v6 . Consistent with our unit-timemodel of instructions, a single instruction may spawn at most one child. When the spawninginstruction executes, it allocates an activation frame for the new child thread. Once a threadhas been spawned and its frame has been allocated, we say the thread is alive or living.When the last instruction of a thread executes, it deallocates its frame and the thread dies.An execution schedule generally respects other dependencies besides those representedby continue and spawn edges. Consider an instruction that produces a data value to beconsumed by another instruction. Such a producer/consumer relationship precludes theconsuming instruction from executing until after the producing instruction. To enforcesuch orderings, other dependency edges, called join edges, may be required, as shown inFigure 1 by the curved edges. If the execution of a thread arrives at a consuming instructionbefore the producing instruction has executed, execution of the consuming thread cannotcontinue the thread stalls. Once the producing instruction executes, the join dependency isresolved, which enables the consuming thread to resume its execution the thread becomesready. A multithreaded computation does not model the means by which join dependenciesget resolved or by which unresolved join dependencies get detected. In implementation,resolution and detection can be accomplished using mechanisms such as join counters [8],futures [30], or I-structures [2].We make two technical assumptions regarding join edges. We rst assume that eachinstruction has at most a constant number of join edges incident on it. This assumption4

is consistent with our unit-time model of instructions. The second assumption is that nojoin edges enter the instruction immediately following a spawn. This assumption meansthat when a parent thread spawns a child thread, the parent cannot immediately stall. Itcontinues to be ready to execute for at least one more instruction.An execution schedule must obey the constraints given by the spawn, continue, and joinedges of the computation. These dependency edges form a directed graph of instructions,and no processor may execute an instruction until after all of the instruction's predecessorsin this graph have been executed. So that execution schedules exist, this graph must beacyclic. That is, it must be a directed acyclic graph, or dag. At any given step of anexecution schedule, an instruction is ready if all of its predecessors in the dag have beenexecuted.We make the simplifying assumption that a parent thread remains alive until all itschildren die, and thus, a thread does not deallocate its activation frame until all its children'sframes have been deallocated. Although this assumption is not absolutely necessary, it givesthe execution a natural structure, and it will simplify our analyses of space utilization. Inaccounting for space utilization, we also assume that the frames hold all the values used bythe computation; there is no global storage available to the computation outside the frames(or if such storage is available, then we do not account for it). Therefore, the space usedat a given time in executing a computation is the total size of all frames used by all livingthreads at that time, and the total space used in executing a computation is the maximumsuch value over the course of the execution.To summarize, a multithreaded computation can be viewed as a dag of instructions connected by dependency edges. The instructions are connected by continue edges into threads,and the threads form a spawn tree with the spawn edges. When a thread is spawned, anactivation frame is allocated and this frame remains allocated as long as the thread remainsalive. A living thread may be either ready or stalled due to an unresolved dependency.A given multithreaded program when run on a given input can sometimes generate morethan one multithreaded computation. In that case, we say the program is nondeterministic. If the same multithreaded computation is generated by the program on the inputno matter how the computation is scheduled, then the program is deterministic. In thispaper, we shall analyze multithreaded computations, not multithreaded programs. Speci cally, we shall not worry about how the multithreaded computation is generated. Instead,we shall study its properties in an a posteriori fashion.Because multithreaded computations with arbitrary dependencies can be impossible toschedule e ciently [10], we study subclasses of general multithreaded computations in whichthe kinds of syncrhonizations that can occur are restricted. A strict multithreaded computation is one in which all join edges from a thread go to an ancestor of the thread inthe activation tree. In a strict computation, the only edge into a subtree (emanating fromoutside the subtree) is the spawn edge that spawns the subtree's root thread. For example,the computation of Figure 1 is strict, and the only edge into the subtree rooted at 2 is thespawn edge (v2 ; v3). Thus, strictness means that a thread cannot be invoked before all ofits arguments are available, although the arguments can be garnered in parallel. A fullystrict computation is one in which all join edges from a thread go to the thread's parent. Afully strict computation is, in a sense, a \well-structured" computation, in that all join edgesfrom a subtree (of the spawn tree) emanate from the subtree's root. The example compu-5

tation of Figure 1 is fully strict. Any multithreaded computation that can be executed in adepth- rst manner on a single processor can be made either strict or fully strict by alteringthe dependency structure, possibly a ecting the achievable parallelism, but not a ecting thesemantics of the computation [5].We quantify and bound the execution time of a computation on a P -processor parallelcomputer in terms of the computation's \work" and \critical-path length." We de ne thework of the computation to be the total number of instructions and the critical-pathlength to be the length of a longest directed path in the dag. Our example computation(Figure 1) has work 23 and critical-path length 10. For a given computation, let T (X ) denotethe time to execute the computation using P -processor execution schedule X , and letTP minT (X )Xdenote the minimum execution time with P processors the minimum being taken over all P processor execution schedules for the computation. Then T1 is the work of the computation,since a 1-processor computer can only execute one instruction at each step, and T1 is thecritical-path length, since even with arbitrarily many processors, each instruction on a pathmust execute serially. Notice that we must have TP T1 P , because P processors canexecute only P instructions per time step, and of course, we must have TP T1.Early work on dag scheduling by Brent [13] and Graham [27, 28] shows that there exist P processor execution schedules X with T (X ) T1 P T1. As the sum of two lower bounds,this upper bound is universally optimal to within a factor of 2. The following theorem,proved in [10, 20], extends these results minimally to show that this upper bound on TP canbe obtained by greedy schedules: those in which at each step of the execution, if at leastP instructions are ready, then P instructions execute, and if fewer than P instructions areready, then all execute.Theorem 1 (The greedy-scheduling theorem) For any multithreaded computation with workT1 and critical-path length T1, and for any number P of processors, any greedy P -processorexecution schedule X achieves T (X ) T1 P T1.Generally, we are interested in schedules that achieve linear speedup, that is T (X ) O(T1 P ). For a greedy schedule, linear speedup occurs when the parallelism, which wede ne to be T1 T1, satis es T1 T1 (P ).To quantify the space used by a given execution schedule of a computation, we de ne thestack depth of a thread to be the sum of the sizes of the activation frames of all its ancestors,including itself. The stack depth of a multithreaded computation is the maximum stackdepth of any of its threads. We shall denote by S1 the minimum amount of space possible forany 1-processor execution of a multithreaded computation, which is equal to the stack depthof the computation. Let S (X ) denote the space used by a P -processor execution scheduleX of a multithreaded computation. We shall be interested in those execution schedules thatexhibit at most linear expansion of space, that is, S (X ) O(S1P ), which is existentiallyoptimal to within a constant factor [10].6

3 The busy-leaves propertyOnce a thread has been spawned in a strict computation, a single processor can completethe execution of the entire subcomputation rooted at even if no other progress is madeon other parts of the computation. In other words, from the time the thread is spawneduntil the time dies, there is always at least one thread from the subcomputation rootedat that is ready. In particular, no leaf thread in a strict multithreaded computation canstall. As we shall see, this property allows an execution schedule to keep the leaves \busy."By combining this \busy-leaves" property with the greedy property, we derive executionschedules that simultaneously exhibit line

con tribution is a randomized w ork-stealing sc heduling algorithm for fully strict m ultithreaded computations whic h is pro v ably e cien t in terms of time, space, and com-m unication. W e pro v that the exp ected time to execute a fully strict computation on P pro cessors using our w ork-stealing sc heduler is T 1 P O (), where the minim .

Related Documents:

Permutation Tests for Classification Sayan Mukherjee, Polina Golland and Dmitry Panchenko AI Memo 2003-019 August 2003 2003 massachusetts institute of technology, cambridge, ma 02139 usa — www.ai.mit.edu massachusetts institute of technology — artificial intelligence laboratory

2004 massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu massachusetts institute of technology — computer science and artificial intelligence laboratory. 2 ABSTRACT We consider the problem of detecting a large number of different classes of objects in cluttered scenes. Traditional

Selected Massachusetts Organizations, Life Sciences Economic Development Initiatives Massachusetts Technology Collaborative Mass Biomedical Initiatives Mass Development Massachusetts Alliance for Economic Development Life Sciences Industry Associations Massachusetts Biotechnology Council Massachusetts Medical Device Industry Council

Indian Institute of Technology Roorkee Roorkee IITR* Indian Institute of Technology Mandi Mandi IITMandi Indian Institute of Technology Ropar Ropar IITRPR South Zone Indian Institute of Technology Madras Chennai IITM* Indian Institute of Technology Hyderabad Hyderabad IITH Indian Institute of Technology Palakkad Palakkad IITPKD

ii Massachusetts State Health Assessment Massachusetts State Health Assessment . October 2017 . Suggested Citation . Massachusetts Department of Public Health. Massachusetts State Health Assessment.

Massachusetts tax law differs in important ways from the Federal tax code. The purpose of this Guide for Massachusetts Tax-Aide Volunteers (Mass Manual) is to provide training and reference material relative to Massachusetts tax law and use of the TaxSlayer software in preparing Massachusetts tax returns for our clients.

Massachusetts Dept. of Revenue Letter Ruling 11-4, (April 12, 2011) Massachusetts Dept. of Revenue Letter Ruling 12-5 (May 7, 2012) Massachusetts Dept. of Revenue Letter Ruling 12-10 (Sept. 12, 2012) Massachusetts Dept. of Revenue Letter Ruling 12-13 (Nov. 9, 2012) Massachusetts Dept. of Revenue Letter Ruling 13-2 (March 11, 2013)

Massachusetts Institute of Technology . For example, if your current plan is an HMO outside of eastern Massachusetts, your coverage will likely be limited — or unavailable — outside of your HMO’s service area. As a result, it probably won’t be considered