Developing Optimizations

2y ago
796.43 KB
30 Pages
Last View : 23d ago
Last Download : 6m ago
Upload by : Camille Dion

DevelopingOptimizationsProgrammable Optimization AndEmpirical Tuning7/11/2014DragonStar 2014 - Qing Yi1

Outlinep Programmable composition ofoptimizationsn Pattern-based optimization compositionp p E.g., dense matrix codes, stencils, graphsEmpirical tuningn 7/11/2014Searching the optimization spaceDragonStar 2014 - Qing Yi2

Programmable Optimization and Empirical Tuningp 7/11/2014Goal: customizable programoptimization environmentn Analysis engine(compiler) interacts withdevelopersp Use the ROSE C/C compilern Analysis resultsexpressed in POETp An open-sourceprogramtransformationscripting qyi/poetp Programmable controlby developersn POET transformationsempirically tunedDragonStar 2014 - Qing Yi3

What is POET?p It is a scripting language forn n n p Applying parameterized program transformationsProgrammable control of compiler optimizationsAd-hoc translation between arbitrary languagesDeveloped since 2007 at UTSA and UCCSn n Open source (BSD license)Language documentation and download availableatp qyi/poetFeedback welcome and appreciated7/11/2014DragonStar 2014 - Qing Yi4

Use Cases Of POETp Parameterization of Optimizations forEmpirical Tuningn n p Programmable control of compileroptimizationsn p Lightweight portable program transformationengineParameterized at the finest granularityFlexible composition of independently defined optsDomain-specific code generation/ad-hoctranslationn 7/11/2014Source-to-source translator among arbitrarylanguagesDragonStar 2014 - Qing Yi5

An example POET scriptinclude opt.pip parameter inFile message "input file"/ parameter outFile message "output file"/ parameter p threads default (2) / parameter b factor default (32) / parameter p block default (256 ) / trace inputCode,Nest1,Nest2,Nest3,Nest4 / Can be autogenerated fromhigher-leveltransformationspecifications input from inFile syntax "Cfront.code"to inputCode / eval BlockLoops[factor p block](Nest2 grp3[Nest.body],Nest2 grp3); / output to outFile syntax "Cfront.code"from inputCode / 7/11/2014DragonStar 2014 - Qing Yin n n Invokes predefinedoptimizations in thePOET librarySimple input/outputcommandsDynamically tracep p n Input code fragmentsResult oftransformationsFlexible compositionof optimizations6

Example Annotated Input Codevoid dgemm test(const int M,const int N,const int K,const double alpha,const double*A,const int lda,const double *B,const int ldb,const double beta,double *C,const int ldc){int I,j,l;/*@; BEGIN(nest1 Nest) @*/for (j 0; j -1 N; j 1) {/*@; BEGIN(nest3 Nest) @*/for (i 0; i -1 M; i 1) {C[(j * ldc) i] (beta * (C[(j * ldc) i]));/*@; BEGIN(nest2 Nest) @*/for (l 0; l -1 K; l 1) {C[(j * ldc) i] ((C[(j * ldc) i]) ((alpha * (A[(l * lda) i])) * (B[(j * ldb) l])));} } } }p Each loop to be optimized is given a handle namen n 7/11/2014POET transformations can directly operate on these handlesEach transformation modifies the handles to containequivalent optimized codeDragonStar 2014 - Qing Yi7

Domain-Specific Optimizationp Utilize domain knowledge from the HPCn n n p E.g., strategies for stencil code and dense matricesTrace key components of input code (e.g., loops)Apply optimizations known to be beneficialSupport small annotation and specificationlanguagesn Quickly translate between ad-hoc languagesp n Map multiple languages to a single ASTp p 7/11/2014E.g., tester/timer generation, dependence oroptimization specificationInput: read in the AST using one syntaxOutput: unparse the AST using a different syntaxDragonStar 2014 - Qing Yi8

Empirical Tuning of POET ScriptsFinal optconfig.SearchengineRuntimefeedbackp DeveloperPOET scriptInput code POET scriptParameter valuesPOETxformengineOptimizedcode driverExecutableVentor compiler(gcc)Used POET to parse parameter declarations and construct searchspace description7/11/2014DragonStar 2014 - Qing Yi9

The POET Optimization Libraryp p Defined in POET/lib/ (interface in opt.pi)Loop optimizationsn Targeting multi-core architecturesp n Targeting memory performancep n Loop blocking, interchange, fusion, fission, skewingTargeting register-level performancep p OpenMP loop parallelizationLoop unroll&jam, unrolling, SSE vectorizationData layout optimizationsn Reducing the cost of array referencesp 7/11/2014Array copying, scalar replacement, strength reductionDragonStar 2014 - Qing Yi10

Required Xform Parametersp Single loop xforms: Op [optional params] (loop)n Operate on a given loop xp p p Loop nest xforms : Op [optional params] (inner,outer)n Operate between an inner body n and an outer loop xp p p ParallelizeLoop(x): OpenMP loop parallelizationCleanupBlockedNests(x): generate cleanup codeUnrollLoops(n,x)/UnrollJam(n,x): Loop ,x): loop blocking/interchangeOther xforms: opt[optional params](config, loop)n Operate on input x based on various configurationsp p p p p 7/11/2014DistributeLoops(bodiesToDist,x): distribute loop xFuseLoops(nestsToFuse,pivot): replace pivot with fused loopVectorizeLoop(r, x): loop vectorization using register assign rCopyRepl(ref,dim, x): copy memory accessed by arraysScalarRepl(ref,dim,x): use scalars to substitute memoryDragonStar 2014 - Qing Yi11

Optional Xform Parametersp Configuration parametersn factor: a list of integer blocking/unrolling factorsp n cleanup (1/0/-1): whether to generate cleanup code.p p p p p Default values are set to commonly used onescleanup 1: generate cleanup code now;cleanup -1: there is no need for cleanup codecleanup 0: will generate cleanup later (not now)By default, cleanup code is generated now (i.e., cleanup 1)Side-effects parameters: handles used to save resultsn n n n n 7/11/2014trace: result handle to save transformations to inputtrace cleanup: result handle for generated cleanup codetrace decl: result handle for inserting variable declarationstrace include: result handle for adding new include files;trace mod: trace the modification of a list of expressionsDragonStar 2014 - Qing Yi12

POET Data Typesp Atomic types and associated operationsn p Integers and stringsCompound types and associated operationsn Lists: a singly linked listp p n Tuples: a static finite sequence of valuesp p n p p Construction: Loop[maxIter 100]#(“I”,0,”m”,1), Nest#(c, b)Accessing components: n[Nest.ctrl], c[Loop.I]Handles: special variables used to track AST transformsp 7/11/2014Construction: MAP{“a” 1,”b” 2} or MAP(type1, type2)Accessing components: m[“a”] 3, b m[“b”]Code templates(ASTs): user defined types in POETp n Construction: (a,b,c,d)Accessing components: t[index] where index is an integerMaps: associate pairs of related valuesp n Construction: (a b c), a::bAccessing components: HEAD(l), TAIL(l)Can be modified by xform routines via side effectsDragonStar 2014 - Qing Yi13

Example: Loop Unrollinginclude opt.pi parameter out default "" message "output file location" / parameter ur parse INT default 2 message "Loop unrolling factor for target"/ trace inputCode,target/ input from "mgrid.f" syntax "Ffront.code" to inputCode/ eval UnrollLoops[factor ur;trace inputCode](target[Nest.body],target); / output to out syntax "Ffront.code" from (inputCode)/ p p Unroll the loop tagged by target in “mgrid.f”To tune optimizationn 7/11/2014pcg -pout “out.f” -pur 4 opt unroll.ptDragonStar 2014 - Qing Yi14

Writing Your Own Optimizationsp POET xforms are oblivious of language syntaxn p Each optimization is a routine (global function)whichn n n p Traverses the AST to collect informationModifies the AST using built-in operationsMaintains the consistency of embedded handlesPOET support for building optimizationsn n n p Operate on ASTs shared by different languagesPattern matching and pattern-based traversalC-like control flow and recursive functionsHandle aware transformation operationsGetting started:n 7/11/2014Use PRINT and DEBUG operations for debuggingDragonStar 2014 - Qing Yi15

Example: Collecting Information xform FindLoopsInNest pars (inner, input) if (input : inner) {“” } * reaching inner, stopelse if (input : Nest#(loop,body))? {innerloops FindLoopsInNest(inner, body); * recursion(innerloops "")? loop : loop::innerloops; * concatenate}else ERROR(“Did not find inner body: ” inner); /xform p Find all the loops outside inner and inside inputn p Pattern matching:n n p Recursively navigates down input until reaching innerx:yDoes x equal to y or match the structure of y?input : Nest#(loop,body) : is input a Nest? if yes, use loop and body tosave its childrenList concatenation: x :: yn 7/11/2014Build a new list with x as the first element followed by yDragonStar 2014 - Qing Yi16

Example: Pattern-based Traversal xform FindStmtsOutsideNest pars (nest, input) res NULL;foreach (input : (cur (nest ExpStmt)) : TRUE) {if (cur ! nest) res cur::res;}ReverseList(res) /xform p Find all ExpStmts that are outside nest and inside inputn p The last expression ReverseList(res) is returned as resultAST traversal loop: evaluates body for each matching ASTp n n n succ true : do not traverse inside the matching ASTssucc false : continue traversal inside each matching ASTTo traverse input in reverse orderp 7/11/2014foreach (input : pattern : succ) bodyforeach r (input : pattern : succ) bodyDragonStar 2014 - Qing Yi17

Using Maps To Save Information xform MapLoopsInNest pars (input, map) foreach (input : (cur Nest#((CLEAR loop), )): FALSE){if (map[loop] "") map[loop] cur;else map[loop] cur :: map[loop];} /xform p Map each loop control in input to the whole loopn Among all POET compound data structures, Maps arethe only type of value that can be modifiedp p E.g., you can build a new list, but not modify an existing one,as different lists may share internal componentsPattern specifiersn n n 7/11/2014cur pattern: use cur to save the matched ASTCLEAR var : uninitialize var so that it matches anarbitray value and then saves the matched value: matches an arbitrary value (without saving it)DragonStar 2014 - Qing Yi18

Developing Program Xformsp A program transformation takes an input ASTand returns a new onen n For optimization purposes, the new code must beequivalent to the original oneMay want to modify the original AST directlyp p E.g., to keep a single version of working ASTEach POET transformation is an operationthatn n Takes an AST and returns the transformed oneModifies the input AST if it contains result handlesp 7/11/2014An AST cannot be directly modified as different ASTsmay share common componentsDragonStar 2014 - Qing Yi19

Support Of AST Transformationsp Built-in AST transforming operations (input AST: e)n n REPLACE(c1,c2,e): replace all occurrences of c1 with c2REPLACE(((o1 ,r1 ).(om, rm)), e)p p n REBUILD(e) : rebuild the input ASTp n Each copy replacing c1 by a different component in c2PERMUTE( (I 1 , I 2 , ., I m),e): reorder the input listp p p Invoke an associated rebuild routine for each AST nodeDUPLICATE(c1,c2,e): replicate input ASTp n Locate and replace each oi (i 1,.,m) with riMust encounter o1, ,om in order in pre-order traversal of eThe input must be a list of AST nodesThe j th (j 1,.,m) element is located at I j in the resultAll transformations return a single list/AST as resultn 7/11/2014Modify trace handles inside the input AST if appropriateDragonStar 2014 - Qing Yi20

Handles In POETp A special kind of global variablesn n p Lifetime span all POET files in a programCan be embedded inside ASTs to tracetransformationsHandles can be declared in groupsp n n 7/11/2014 trace inputCode, nest1, nest3, nest2/ They are encountered in order in a pre-ordertraversal of inputThey are used as input/output of POET xformroutinesDragonStar 2014 - Qing Yi21

Handle Operationsp Insertion and removal of handles (input AST: e)n INSERT (x, e): insert handle x inside input AST ep p n ERASE(x, e): remove all occurrences of handle x from ep n p All handles following x in the same group are also insertedAll handles must already contain fragments of e as valuesDoes not affect other handles in the same groupCOPY(e): remove all handles in e and return the resultMust save modification result unless input is itselfa handlen 7/11/2014In which case the input handle is modified to containresultDragonStar 2014 - Qing Yi22

Example: Erase Handles xform EraseTraceHandle pars (handle, trace) nested handle TRUErepl " keep” trace ERASE(handle, trace);for (origvlaue ERASE(handle); * return the value of handle * origvalue : VAR && nested handle; origvalue ERASE(origvalue)){ trace ERASE(origvalue, trace); }if (repl ! " keep") trace REPLACE(origvalue, repl, trace);(origvalue, trace) /xform p Erase a handle from an input AST tracen n Return the value of handle and modified traceTrace handles may directly nest inside one anotherp p Erase nested handles holding the same valueUse tuple to directly return multiple values to the caller7/11/2014DragonStar 2014 - Qing Yi23

Example: Modify Handles xform ModifyTraceHandle pars (handle, newvalue)trace "” nested handle TRUE (handle1, newvalue) EraseTraceHandle(handle, newvalue);REPLACE(handle1, newvalue, handle) /xform p Modify handle with a new valuen n p trace: handle containing the overall ASTReturn the modified handle or traceGoal: avoid creating any cycles in ASTn n 7/11/2014The input newvalue may contain handle ascomponentsNeed to erase handle from trace or newvaluebefore replacementDragonStar 2014 - Qing Yi24

POET Code Templatesp Code templates are user-defined data types thatn Can be used to build compound acyclic data structuresp n n p To avoid cycles, internal data members cannot be modifiedCan be associated with concrete syntaxes for parsing/unparsingCan be used to automatically build ASTs for arbitrarylanguagesFor example code GraphEdge pars (from:GraphNode,to:GraphNode) "@from@"- "@to@" /code n n Data members of data structure: from and toSyntax of data structure: the body of GraphEdgep p n Type annotations for data members: GraphNodep 7/11/2014Used to automatically convert GraphEdge to/from stringsThe @ @ sign: used to surround a POET expressionUsed to specify how to parse/unparse each data memberDragonStar 2014 - Qing Yi25

Example: From C2F.code code VoidType subroutine /code code IntType pars (name:"char" "int" "unsigned" "long") @(switch(name){case "char": "integer*1"case ("int" "unsigned"): "integer"case "long" : "integer*4"})@ /code p Map C concepts to Fortrann n 7/11/2014Redefine syntax for common conceptsMay need to use global variables or tables to saveinformationDragonStar 2014 - Qing Yi26

Example: Loop Permutation xform PermuteLoops pars (inner,input) order 0 trace "" if (order 0) { input }else if (! (input : Nest#(loop,body)) ) {ERROR("Input is not a loop nest!")}else {loops FindLoopsInNest(inner, input);if (LEN(loops) ! LEN(order))ERROR("Incorrect reordering indices: " order "\n Loops are: " loops);nloops PERMUTE (order, loops);res BuildNest(nloops, inner);res TraceNestedLoops[trace input](nests, res);if (trace : VAR) REPLACE(ERASE(input), res, trace);else { res } } /xform p Key: modify handles in AST with correct values7/11/2014DragonStar 2014 - Qing Yi27

Domain-specific Code Generationp Use code templates to represent domainspecific conceptsn n n p Define a compound data type for each conceptSpecify how to parse and unparse the data typeNo need to express everything using statementsExample: generating testing driversn Code templates could be defined forp n n 7/11/2014Allocate buffer, parameter initialization, initialize timer,reading timing, The generated timer could be in C, Fortran, or anyother languageJust like translating programs from one language toanotherDragonStar 2014 - Qing Yi28

Example: Timer Generation code StaticBufferAllocate pars (type,name,size,align,nrep) @name@ size @TimerAlignSize#(size,align)@; @ (if (nrep 1) { @@name@ rep CacheSZ / @name@ size 1; @})@ /code code Static2DBufferAllocate pars (type,name,size,size2,align,nrep) @name@ size @TimerAlignSize#(size,align)@; @ (if (nrep 1) { @@name@ rep CacheSZ / @name@ size 1; @})@@name@ size2 @TimerAlignSize#(size2,align)@; /code code TimerBufferInitialize pars (name, nrep, value, valueIncr) /code "p No details of the underlying language7/11/2014DragonStar 2014 - Qing Yi29

Summaryp Compose optimizations using POETn n p Writing your own analysis or optimizationn n n n p Invoke routines from the opt libraryFrom high-level specifications to POET scriptsDefinition of code templatesUse of built-in typesTraversing the ASTModifying the AST through handlesTuning of optimizationsn 7/11/2014Navigating meaningful combinations of parametervaluesDragonStar 2014 - Qing Yi30

E.g., tester/timer generation, dependence or optimization specification " Map multiple languages to a single AST ! Input: read in the AST using one syntax ! Output: unparse the AST using

Related Documents:

Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector, Multimedia extended ISA, GPU, loop level parallelism, Chapter4 slides you may also refer to chapter3-ilp.ppt starting with slide #114 3.

Static analysis and compile-time optimizations For the next few lectures. Agenda Static analysis and compile-time optimizations For the next few lectures Intraprocedural Data Flow Analysis. Agenda

FFmpeg ME-only for VR Quality 2017 SDK 8.0 10-bit transcode 10/12-bit decode OpenGL Dec. optimizations WP, AQ, Enc. Quality Q1 2018 SDK 8.1 B-as-ref QP/emphasis map 4K60 HEVC encode Reusable classes & new sample apps Q3 2018 SDK 8.2 Decode inference optimizations SDK 9.0 Turing Multi-NV

Which hardware platform does Cisco WAAS software run on? A. ACE module B. DRE appliance C. WAE D. CSM module . Answer: C . Question: 66 . Which Cisco WAAS requirement results in a greater support for optimizations? A. deployment of WCCP B. deployment in two or more sites C. deployment of multiple ACE linecards D. transparent optimizations Answer: B

Building openSUSE with link-time optimizations . Outlilne . (from linker’s resolution data most symbols become “static”) . IP analysis streaming out Symbol & type streaming in merging Inter-procedural (whole program) Opts: Dead symbol ellim. Symbol promotion profile analysis

number of optimizations available it is not possible for an application developer or even the compiler writer to create an exhaustive set of scenarios that could benefit

- 4 - CS 4300 - Computer Architecture Machine Independent Opt. ResultsMachine Independent Opt. Results Optimizations Reduce function calls and memory references within loop Performance Anomaly Computing FP product of all elements exceptionally slow. Very large speedup when accumulate in temporary Caused by quirk of IA32 floating point zMemory uses 64-bit format, register use 80

ßHow do we take something running in software and make it fast? ßAlgorithmic changes ßBy FAR, the most significant ßArchitecture choices ßImplementation choices ßSynthesis Optimizations ßPlace and Route Optimizations. 2 Rincon Research Corporation - FPGA Development 5 Algorithmic Changes