Dynamo: A Transparent Dynamic Optimization System

2y ago
10 Views
2 Downloads
712.95 KB
12 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Adalynn Cowell
Transcription

Dynamo: A Transparent Dynamic Optimization SystemVasanth BalaEvelyn DuesterwaldSanjeev @incert.comHewlett-Packard Labs 1 Main Street,Cambridge, MA tly, the use of dynamic code generation environments (likeJava JITs and dynamic binary translators) makes the applicabilityof heavyweight static compiler optimization techniquesimpractical. Meanwhile, on the hardware side, technology ismoving toward offloading more complexity from the hardwarelogic to the software compiler, as evidenced by the CISC to RISCto VLIW progression.The problem with this trend is that the static compiler istaking on an increasingly greater performance burden while theobstacles to traditional static compiler analysis are continuing toincrease. This will inevitably lead to either very complex compilersoftware that provides only modest performance gains on generalpurpose applications, or highly customized compilers that aretailored for very narrow classes of applications.The Dynamo project was started in 1996 to investigate atechnology that can complement the static compiler’s traditionalstrength as a static performance improvement tool with a noveldynamic performance improvement capability [3]. In contrast tothe static compiler, Dynamo offers a client-side performancedelivery mechanism that allows computer system vendors toprovide some degree of machine-specific performance without theISV’s involvement.Dynamo is a dynamic optimization system (i.e., the input is anexecuting native instruction stream), implemented entirely insoftware. Its operation is transparent: no preparatory compilerphase or programmer assistance is required, and even legacy nativebinaries can be dynamically optimized by Dynamo. BecauseDynamo operates at runtime, it has to focus its optimization effortvery carefully. Its optimizations have to not only improve theexecuting native program, but also recoup the overhead ofDynamo’s own operation.The input native instruction stream to Dynamo can come froma statically prepared binary created by a traditional optimizingcompiler, or it can be dynamically generated by an applicationsuch as a JIT. Clearly, the runtime performance opportunitiesavailable for Dynamo can vary significantly depending on thesource of this input native instruction stream. The experimentsreported in this paper only discuss the operation of Dynamo in themore challenging situation of accelerating the execution of astatically optimized native binary. The performance data presentedhere thus serve as an indicator of the limits of the Dynamo system,rather than its potential. The data demonstrates that even in thisextreme test case, Dynamo manages to speedup many applications,and comes close to breaking even in the worst case.Section 1 gives an overview of how Dynamo works. Thefollowing sections highlight several key innovations of theDynamo system. Section 2 describes Dynamo’s startupmechanism, Section 4 gives an overview of the hot code selection,optimization and code generation process, Section 5 describes howdifferent optimized code snippets are linked together, Section 6describes how the storage containing the dynamically optimizedAbstractWe describe the design and implementation of Dynamo, asoftware dynamic optimization system that is capable oftransparently improving the performance of a native instructionstream as it executes on the processor. The input native instructionstream to Dynamo can be dynamically generated (by a JIT forexample), or it can come from the execution of a staticallycompiled native binary. This paper evaluates the Dynamo systemin the latter, more challenging situation, in order to emphasize thelimits, rather than the potential, of the system. Our experimentsdemonstrate that even statically optimized native binaries can beaccelerated Dynamo, and often by a significant degree. Forexample, the average performance of –O optimized SpecInt95benchmark binaries created by the HP product C compiler isimproved to a level comparable to their –O4 optimized versionrunning without Dynamo. Dynamo achieves this by focusing itsefforts on optimization opportunities that tend to manifest only atruntime, and hence opportunities that might be difficult for a staticcompiler to exploit. Dynamo’s operation is transparent in the sensethat it does not depend on any user annotations or binaryinstrumentation, and does not require multiple runs, or any specialcompiler, operating system or hardware support. The Dynamoprototype presented here is a realistic implementation running onan HP PA-8000 workstation under the HPUX 10.20 operatingsystem.1. IntroductionRecent trends in software and hardware technologies appearto be moving in directions that are making traditional performancedelivery mechanisms less effective. The use of object-orientedlanguages and techniques in modern software development hasresulted in a greater degree of delayed binding, limiting the size ofthe scope available for static compiler analysis. Shrink-wrappedsoftware is being shipped as a collection of DLLs rather than asingle monolithic executable, making whole-program optimizationat static compile-time virtually impossible. Even in cases wherepowerful static compiler optimizations can be applied, computersystem vendors have to rely on the ISV (independent softwarevendor) to enable them. This puts computer system vendors in theuncomfortable position of not being able to control the very keysthat unlock the performance potential of their own machines. More*The author is presently with InCert Corporation, Cambridge, MA.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.PLDI 2000, Vancouver, British Columbia, Canada.Copyright 2000 ACM 1-58113-199-2/00/0006 5.00.1

native instruction streamAinterpret untiltaken branchnoBmisslookup branchtarget in cacheCstart-of-tracecondition?yeshitincrement counterassociated withbranch target addrjump to top offragment incacheO/S signalsignalhandlerKnoDFEcounter valueexceeds hotthreshold ?yescontextswitchGFragmentCacheinterpret codegenuntil taken branchIcreate newfragment andoptimize itJemit into cache, link withother fragments & recyclethe associated counterHyesend-of-tracecondition?noFigure 1. How Dynamo workscode is managed, and Section 7 describes signal handling. Finally,Section 8 summarizes the experimental data to evaluate Dynamo’sperformance. Dynamo is a complex system that took several yearsto engineer. This paper only provides an overview of the wholesystem. Further details are available in [2] and on the Dynamoproject website (www.hpl.hp.com/cambridge/projects/Dynamo).Our current prototype defines start-of-trace as targets of backwardtaken branches (likely loop headers) and fragment cache exitbranches (exits from previously identified hot traces). If thecounter value exceeds a preset hot threshold (E), the interpretertoggles state and goes into “code generation mode” (G). Wheninterpreting in this mode, the native instruction sequence beinginterpreted is recorded in a hot trace buffer, until an “end-of-trace”condition is reached (H). At that point the hot trace buffer isprocessed by a fast, lightweight optimizer (I) to create anoptimized single-entry, multi-exit, contiguous sequence ofinstructions called the fragment1. Our current prototype definesend-of-trace as backward taken branches or taken branches whosetargets correspond to fragment entry points in the fragment cache(i.e., fragment cache hits). A trace may also be truncated if itslength exceeds a certain number of instructions. The fragmentgenerated by the optimizer is emitted into the fragment cache by alinker (J), which also connects fragment exit branches to otherfragments in the fragment cache if possible. Connecting fragmentstogether in this manner minimizes expensive fragment cache exitsto the Dynamo interpretive loop. The new fragment is tagged withthe application binary address of the start-of-trace instruction.As execution proceeds, the application’s working setgradually materializes in the fragment cache, and the Dynamooverhead (time spent in the Dynamo interpretive loop / time spentexecuting in the fragment cache) begins to drop. Assuming that themajority of an application’s execution time is typically spent in asmall portion of its code, the performance benefits from repeatedreuse of the optimized fragments can be sufficient to offset theoverhead of Dynamo’s operation. On the SpecInt95 benchmarks,2. OverviewFrom a user’s perspective, Dynamo looks like a PA-8000software interpreter that itself runs on a PA-8000 processor (thehardware interpreter). Interpretation allows Dynamo to observeexecution behavior without having to instrument the applicationbinary. Since software interpretation is much slower than directexecution on the processor, Dynamo only interprets the instructionstream until a “hot” instruction sequence (or trace) is identified. Atthat point, Dynamo generates an optimized version of the trace(called a fragment) into a software code cache (called the fragmentcache). Subsequent encounters of the hot trace’s entry addressduring interpretation will cause control to jump to the top of thecorresponding cached fragment. This effectively suspends theinterpreter and allows the cached code to execute directly on theprocessor without incurring any further interpretive overhead.When control eventually exits the fragment cache, Dynamoresumes interpreting the instruction stream, and the process repeatsitself.Figure 1 illustrates this flow of control in more detail.Dynamo starts out by interpreting the input native instructionstream until a taken branch is encountered (A). If the branch targetaddress corresponds to the entry point of a fragment already in thefragment cache (B), control jumps to the top of that fragment,effectively suspending Dynamo, and causing execution of thecached fragments to occur directly on the underlying processor (F).Otherwise, if the branch target satisfies a “start-of-trace” condition(C), a counter associated with the target address is incremented (D).12A fragment is similar to a superblock, except for the fact that it isa dynamic instruction sequence, and can cross static programboundaries like procedure calls and returns.

Application crt0 code.push stack frame;spill caller-s ave regs;call dynamo exec;restore caller-save regs;pop stack frame;.app runsnativelyapp runsunder DynamoDynamo library codedynamo exec:save callee-save regs to app-context;copy caller-save regs from stack frameto app-context;save stackptr to app-context;return-pc value of link reg;swap Dynamo & application stack;// stackptr now points to Dynamo stackinitialize internal data structures;call interpreter (return-pc, app-context);// control does not return here!Figure 2. How Dynamo gains control over the applicationthe average Dynamo overhead is less than 1.5% of execution time.Dynamo’s interpreter-based hot trace selection process (A-H)dominates this overhead, with the optimizer and linker components(I, J) contributing a relatively insignificant amount.4. Fragment FormationDue to the significant overheads of operating at runtime,Dynamo has to maximize the impact of any optimization that itperforms. Furthermore, since the objective is to complement, notcompete, with the compiler that generated the instruction stream,Dynamo primarily looks for performance opportunities that tend tomanifest themselves in the runtime context of the application.These are generally redundancies that cross static programboundaries like procedure calls, returns, virtual function calls,indirect branches and dynamically linked function calls. Anotherperformance opportunity is instruction cache utilization, since adynamically contiguous sequence of frequently executinginstructions may often be statically non-contiguous in theapplication binary.Dynamo’s unit of runtime optimization is a trace, defined as adynamic sequence of consecutively executed instructions. A tracestarts at an address that satisfies the start-of-trace condition andends at an address that satisfies the end-of-trace condition. Tracesmay extend across statically or dynamically linked procedurecalls/returns, indirect branches and virtual function calls. Dynamofirst selects a “hot” trace, then optimizes it, and finally emitsrelocatable code for it into the fragment cache. The emittedrelocatable code is contiguous in the fragment cache memory, andbranches that exit this code jump to corresponding exit stubs at thebottom of the code. This code is referred to as a fragment. Thetrace is a unit of the application’s dynamic instruction stream (i.e.,a sequence of application instructions whose addresses areapplication binary addresses) whereas the fragment is a Dynamointernal unit, addressed by fragment cache addresses. Thefollowing subsections outline the trace selection, trace optimizationand fragment code generation mechanisms of Dynamo.3. Startup and InitializationDynamo is provided as a user-mode dynamically linkedlibrary (shared library). The entry point into this library is theroutine dynamo exec. When dynamo exec is invoked by anapplication, the remainder of the application code after return fromthe dynamo exec call will execute under Dynamo control.As outlined in Figure 2, dynamo exec first saves a snapshotof the application’s context (i.e., the machine registers and stackenvironment) to an internal app-context data structure. It thenswaps the stack environment so that Dynamo’s own code uses acustom runtime stack allocated separately for its use. Dynamo’soperation thus does not interfere with the runtime stack of theapplication running on it. The interpreter (box A in Figure 1) iseventually invoked with the return-pc corresponding to theapplication’s dynamo exec call. The interpreter starts interpretingthe application code from this return-pc, using the context saved inapp-context. The interpreter never returns to dynamo exec (unlessa special bailout condition occurs, which is discussed later), andDynamo has gained control over the application. From this pointonwards, an application instruction is either interpreted, or a copyof it is executed in the fragment cache. The original instruction isnever executed in place the way it would have been if theapplication were running directly on the processor.We provide a custom version of the execution startup codecrt0.o, that checks to see if the Dynamo library is installed on thesystem, and if it is, invokes dynamo start prior to the jump tostart (the application’s main entry point). Application binariesthat are linked with this version of crt0.o will transparently invokeDynamo if Dynamo is installed on the system, otherwise they willexecute normally. The application binary itself remains unchangedwhether or not it is run under Dynamo. This strategy allowsDynamo to preserve the original mapping of the application’s textsegment, a key requirement for transparent operation.As part of the initialization done in dynamo exec prior toactually invoking the interpreter, Dynamo mmaps a separate areaof memory that it manages itself. All dynamically allocated objectsin Dynamo code are created in this area of memory. Access to thisarea is protected to prevent the application from inadvertently ormaliciously corrupting Dynamo’s state.4.1 Trace selectionSince Dynamo operates at runtime, it cannot afford to useelaborate profiling mechanisms to identify hot traces (such as[14][4]). Moreover, most profiling techniques in use today havebeen designed for offline use, where the gathered profile data iscollated and analyzed post-mortem. The objective here is notaccuracy, but predictability. If a particular trace is very hot over ashort period of time, but its overall contribution to the executiontime is small, it may still be an important trace to identify. Anotherconcern for Dynamo is the amount of counter updates and counterstorage required for identifying hot traces, since this adds to theoverhead and memory footprint of the system.As discussed in Section 2, Dynamo uses softwareinterpretation of the instruction stream to observe runtimeexecution behavior. Interpretation is expensive but it prevents the3

need to instrument the application binary or otherwise perturb it inany way. Interpretation is preferable to statistical PC samplingbecause it does not interfere with applications that use timerinterrupts. Also, as we will elaborate shortly, interpretation allowsDynamo to select hot regions directly without having to collate andanalyze point statistics like the kind produced by PC samplingtechniques. Another important advantage of interpretation is that itis a deterministic trace selection scheme, which makes the task ofengineering the Dynamo system much easier.It is worth noting that the “interpreter” here is a nativeinstruction interpreter and that the underlying CPU is itself a veryfast native instruction interpreter implemented in hardware. Thisfact can be exploited on machines that provide fast breakpointtraps (e.g., through user-mode accessible breakpoint windowregisters) to implement the Dynamo interpreter very efficiently [2].On the PA-8000 however, breakpoint traps are very expensive, andit was more efficient to implement the interpreter by usingemulation. The higher the interpretive overhead, the earlierDynamo has to predict the hot trace in order to keep the overheadslow. In general, the more speculative the trace prediction scheme,the larger we need to size the fragment cache, to compensate forthe larger number of traces picked as a result. Thus, theinterpretive overhead has a ripple effect throughout the rest of theDynamo system.Dynamo uses a speculative scheme we refer to as MRET (formost recently executed tail) to pick hot traces without doing anypath or branch profiling. The MRET strategy works as follows.Dynamo associates a counter with certain selected start-of-tracepoints such as the target addresses of backward taken branches.The target of a backward taken branch is very likely to be a loopheader, and thus the head of several hot traces in the loop body. Ifthe counter associated with a certain start-of-trace address exceedsa preset threshold value, Dynamo switches its interpreter to a modewhere the sequence of interpreted instructions is recorded as theyare being interpreted. Eventually, when an end-of-trace conditionis reached, the recorded sequence of instructions (the most recentlyexecuted tail starting from the hot start-of-trace) is selected as a hottrace.The insight behind MRET is that when an instructionbecomes hot, it is statistically likely that the very next sequence ofexecuted instructions that follow it is also hot. Thus, instead ofprofiling the branches in the rest of the sequence, we simply recordthe tail of instructions following the hot start-of-trace andoptimistically pick this sequence as a hot trace. Besides itssimplicity and ease of engineering, MRET has the advantage ofrequiring much smaller counter storage than traditional branch orpath profiling techniques. Counters are only maintained forpotential loop headers. Furthermore, once a hot trace has beenselected and emitted into the fragment cache, the counterassociated with its start-of-trace address can be recycled. This ispossible because all future occurrences of this address will causethe cached version of the code to be executed and no furtherprofiling is required.Subsequent hot traces that also start at the same start-of-traceaddress will be selected when control exits the first selected tracefor that start-of-trace address. Exits from previously selected hottra

software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come f

Related Documents:

faq-d mk23.doc– march 2012 valley-dynamo faq guide to older dynamo pool tables march 2012 update about valley and dynamo and valley-dynamo how old is my dynamo table? how big of an issue is parts availability with older dynamo tables? could you check to make sure you don’t have a _ still available? what size is my table? how does size matter when it comes to parts?

109 Early onset and cessation of the dynamo is difficult to reconcile with the notion of a dynamo driven by 110 solidification of an inner core [Schubert et al., 1992], the preferred energy source for the Earth’s 111 dynamo. Alternatively, an early dynamo c

Dynamo: Visual Programming for Design Page 3 of 56 Description These tutorial will demonstrate how to use Dynamo Visual Programming for Autodesk Revit software and Autodesk Vasari. The lab will provide users with resources and step-by-step examples for automating geometry creation, adjusting family parameters

named Dynamo together with the design tool Autodesk Revit and Green Building Studio is used in the simulation process. A script will be coded in the Dynamo tool that will determine and allow the parameter variations of the building model in Revit. A comparison of 4 di erent case studies is graphically presented at the end.

Figure 7 – Luminance stability of TFEL display (actual use) Transparent EL display technology . Transparent electroluminescent displays are constructed using the standard EL display structure by replacing the metal rear electrode with a transparent electrode (e.g. ITO) and removing the rest of the non-transparent layers from the display .

Since the eld { also referred to as black-box optimization, gradient-free optimization, optimization without derivatives, simulation-based optimization and zeroth-order optimization { is now far too expansive for a single survey, we focus on methods for local optimization of continuous-valued, single-objective problems.

Dec 06, 2018 · Dynamic Strategy, Dynamic Structure A Systematic Approach to Business Architecture “Dynamic Strategy, . Michael Porter dynamic capabilities vs. static capabilities David Teece “Dynamic Strategy, Dynamic Structure .

Business Accounting Volume 1is the world’s best-selling textbook on bookkeeping and accounting. Now in its tenth edition, it has become the standard introductory text for accounting students and professionals alike. New to this edition: Over 120 brand new review questions for exam practice Coverage of International Accounting Standards 2005 Additional and updated worked examples for areas of .