CS 152 Computer Architecture And Engineering Lecture 13 .

2y ago
25 Views
2 Downloads
567.29 KB
28 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Tripp Mcmullen
Transcription

CS 152 Computer Architecture andEngineeringLecture 13 - VLIW Machines andStatically Scheduled ILPKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California at Berkeleyhttp://www.eecs.berkeley.edu/ krste!http://inst.eecs.berkeley.edu/ cs152!March 8, 2012CS152, Spring 2012

Last time in Lecture 12 Unified physical register file machines remove datavalues from ROB– All values only read and written during execution– Only register tags held in ROB– Allocate resources (ROB slot, destination physical register,memory reorder queue location) during decode– Issue window can be separated from ROB and made smaller thanROB (allocate in decode, free after instruction completes)– Free resources on commit Speculative store buffer holds store values beforecommit to allow load-store forwarding Can execute later loads past earlier stores whenaddresses known, or predicted no dependenceMarch 8, 2012CS152, Spring 20122

Superscalar Control Logic ScalingIssue Width WIssue GroupPreviouslyIssuedInstructionsLifetime L Each issued instruction must somehow check against W*Linstructions, i.e., growth in hardware W*(W*L) For in-order machines, L is related to pipeline latencies and check isdone during issue (interlocks or scoreboard) For out-of-order machines, L also includes time spent in instructionbuffers (instruction window or ROB), and check is done bybroadcasting tags to waiting instructions at write back (completion) As W increases, larger instruction window is needed to find enoughparallelism to keep machine busy greater L Out-of-order control logic grows faster than W2 ( W3)March 8, 2012CS152, Spring 20123

Out-of-Order Control Complexity:MIPS R10000ControlLogic[ SGI/MIPSTechnologiesInc., 1995 ]March 8, 2012CS152, Spring 20124

Sequential ISA BottleneckSequentialsource codeSuperscalar compilerSequentialmachine codea foo(b);for (i 0, i Find independentoperationsScheduleoperationsSuperscalar processorCheck instructiondependenciesMarch 8, 2012ScheduleexecutionCS152, Spring 20125

VLIW: Very Long Instruction WordInt Op 1Int Op 2Mem Op 1Mem Op 2FP Op 1FP Op 2Two Integer Units,Single Cycle LatencyTwo Load/Store Units,Three Cycle Latency Two Floating-Point Units,Four Cycle Latency Multiple operations packed into one instructionEach operation slot is for a fixed functionConstant operation latencies are specifiedArchitecture requires guarantee of:– Parallelism within an instruction no cross-operation RAW check– No data use before data ready no data interlocksMarch 8, 2012CS152, Spring 20126

Early VLIW Machines FPS AP120B (1976)– scientific attached array processor– first commercial wide instruction machine– hand-coded vector math libraries using software pipelining andloop unrolling Multiflow Trace (1987)– commercialization of ideas from Fisher’s Yale group including“trace scheduling”– available in configurations with 7, 14, or 28 operations/instruction– 28 operations packed into a 1024-bit instruction word Cydrome Cydra-5 (1987)– 7 operations encoded in 256-bit instruction word– rotating register fileMarch 8, 2012CS152, Spring 20127

VLIW Compiler Responsibilities Schedule operations to maximizeparallel execution Guarantees intra-instruction parallelism Schedule to avoid data hazards (nointerlocks)– Typically separates operations with explicit NOPsMarch 8, 2012CS152, Spring 20128

Loop Executionfor (i 0; i N; i )B[i] A[i] C;Compileloop:Int1loop:Int 2add x1M1M2FP FPxfldfld f1, 0(x1)add x1, 8fadd f2, f0, f1faddSchedulefsd f2, 0(x2)add x2, 8add x2 bnefsdbne x1, x3,loopHow many FP ops/cycle?1 fadd / 8 cycles 0.125March 8, 2012CS152, Spring 20129

Loop Unrollingfor (i 0; i N; i )B[i] A[i] C;Unroll inner loop to perform 4iterations at oncefor (i 0; i N; i 4){B[i] A[i] C;B[i 1] A[i 1] C;B[i 2] A[i 2] C;B[i 3] A[i 3] C;}Need to handle values of N that are not multiplesof unrolling factor with final cleanup loopMarch 8, 2012CS152, Spring 201210

Scheduling Loop Unrolled CodeUnroll 4 waysloop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2fadd f7, f0, f3fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)fsd f8, 24(x2)add x2, 32bne x1, x3, loopInt1Int 2loop:add x1M1M2fld f1fld f2fld f3fld f4ScheduleFP FPxfadd f5fadd f6fadd f7fadd f8fsd f5fsd f6fsd f7add x2 bne fsd f8How many FLOPS/cycle?4 fadds / 11 cycles 0.36March 8, 2012CS152, Spring 201211

Software PipeliningInt1Unroll 4 ways firstloop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2fadd f7, f0, f3fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)add x2, 32fsd f8, -8(x2)bne x1, x3, loopInt 2fld f1fld f2fld f3add x1fld f4prologfld f1fld f2fld f3add x1fld f4loop:fld f1iteratefld f2add x2 fld f3add x1 bne fld f4How many FLOPS/cycle?epilogadd x2bne4 fadds / 4 cycles 1March 8, 2012M1CS152, Spring 2012M2fsd f5fsd f6fsd f7fsd f8fsd f5fsd f6fsd f7fsd f8fsd f5FP FPxfadd f5fadd f6fadd f7fadd f8fadd f5fadd f6fadd f7fadd f8fadd f5fadd f6fadd f7fadd f812

Software Pipelining vs.Loop UnrollingLoop UnrolledWind-down overheadperformanceStartup overheadLoop IterationtimeSoftware PipelinedperformanceLoop IterationtimeSoftware pipelining pays startup/wind-downcosts only once per loop, not once per iterationMarch 8, 2012CS152, Spring 201213

CS152 Administrivia Lab 3 due date pushed back two days Now due on same day as Quiz 3, Thursday Mar 22March 8, 2012CS152, Spring 201214

What if there are no loops?Basic blockMarch 8, 2012 Branches limit basic block size incontrol-flow intensive irregularcode Difficult to find ILP in individualbasic blocksCS152, Spring 201215

Trace Scheduling [ Fisher,Ellis] Pick string of basic blocks, a trace, thatrepresents most frequent branch path Use profiling feedback or compiler heuristicsto find common branch paths Schedule whole “trace” at once Add fixup code to cope with branchesjumping out of traceMarch 8, 2012CS152, Spring 201216

Problems with “Classic” VLIW Object-code compatibility– have to recompile all code for every machine, even for two machines insame generation Object code size– instruction padding wastes instruction memory/cache– loop unrolling/software pipelining replicates code Scheduling variable latency memory operations– caches and/or memory bank conflicts impose statically unpredictablevariability Knowing branch probabilities– Profiling requires an significant extra step in build process Scheduling for statically unpredictable branches– optimal schedule varies with branch pathMarch 8, 2012CS152, Spring 201217

VLIW Instruction EncodingGroup 1Group 2Group 3 Schemes to reduce effect of unused fields– Compressed format in memory, expand on I-cache refill» used in Multiflow Trace» introduces instruction addressing challenge– Mark parallel groups» used in TMS320C6x DSPs, Intel IA-64– Provide a single-op VLIW instruction» Cydra-5 UniOp instructionsMarch 8, 2012CS152, Spring 201218

Intel Itanium, EPIC IA-64 EPIC is the style of architecture (cf. CISC, RISC)– Explicitly Parallel Instruction Computing (really just VLIW) IA-64 is Intel’s chosen ISA (cf. x86, MIPS)– IA-64 Intel Architecture 64-bit– An object-code-compatible VLIW Merced was first Itanium implementation (cf. 8086)– First customer shipment expected 1997 (actually 2001)– McKinley, second implementation shipped in 2002– Recent version, Poulson, eight cores, 32nm, announced 2011March 8, 2012CS152, Spring 201219

Eight Core Itanium “Poulson” [Intel 2011] 8 cores1-cycle 16KB L1 I&D caches9-cycle 512KB L2 I-cache8-cycle 256KB L2 D-cache32 MB shared L3 cache544mm2 in 32nm CMOSOver 3 billion transistorsMarch 8, 2012 Cores are 2-way multithreaded 6 instruction/cycle fetch– Two 128-bit bundles Up to 12 insts/cycle executeCS152, Spring 201220

IA-64 Instruction FormatInstruction 2 Instruction 1 Instruction 0Template128-bit instruction bundle Template bits describe grouping of theseinstructions with others in adjacent bundles Each group contains instructions that can executein parallelbundle j-1 bundle jgroup i-1March 8, 2012bundle j 1 bundle j 2group igroup i 1CS152, Spring 2012group i 221

IA-64 Registers 128 General Purpose 64-bit Integer Registers 128 General Purpose 64/80-bit Floating PointRegisters 64 1-bit Predicate Registers GPRs “rotate” to reduce code size for softwarepipelined loops– Rotation is a simple form of register renaming allowing oneinstruction to address different physical registers on eachiterationMarch 8, 2012CS152, Spring 201222

IA-64 Predicated ExecutionProblem: Mispredicted branches limit ILPSolution: Eliminate hard to predict branches with predicated execution– Almost all IA-64 instructions can be executed conditionally under predicate– Instruction becomes NOP if predicate register falseb0: Inst 1Inst 2br a b, b2b1: Inst 3Inst 4br b3b2: Inst 5Inst 6ifelsePredicationthenInst 1Inst 2p1,p2 - cmp(a b)(p1) Inst 3 (p2) Inst 5(p1) Inst 4 (p2) Inst 6Inst 7Inst 8One basic blockb3: Inst 7Inst 8Four basic blocksMarch 8, 2012Mahlke et al, ISCA95: On average 50% branches removedCS152, Spring 201223

Fully Bypassed DatapathPC for JAL, .stall0x4nopAddPCaddrASrcinst IRInstMemoryDIRMIR31wers1rs2rd1wswd atawdataMD1MD2Where does predication fit in?March 8, 2012CS152, Spring 201224W

IA-64 Speculative ExecutionProblem: Branches restrict compiler code motionSolution: Speculative operations that don’t cause exceptionsInst 1Inst 2br a b, b2Load r1Use r1Inst 3Can’t move load above branchbecause might cause spuriousexceptionLoad.s r1Inst 1Inst 2br a b, b2Chk.s r1Use r1Inst 3Speculative loadnever causesexception, but sets“poison” bit ondestination registerCheck for exception inoriginal home blockjumps to fixup code ifexception detectedParticularly useful for scheduling long latency loads earlyMarch 8, 2012CS152, Spring 201225

IA-64 Data SpeculationProblem: Possible memory hazards limit code schedulingSolution: Hardware to check pointer hazardsInst 1Inst 2StoreLoad r1Use r1Inst 3Can’t move load above storebecause store might be to sameaddressLoad.a r1Inst 1Inst 2StoreLoad.cUse r1Inst 3Data speculative loadadds address toaddress check tableStore invalidates anymatching loads inaddress check tableCheck if load invalid (ormissing), jump to fixupcode if soRequires associative hardware in address check tableMarch 8, 2012CS152, Spring 201226

Limits of Static Scheduling Unpredictable branches Variable memory latency (unpredictable cachemisses) Code size explosion Compiler complexityDespite several attempts, VLIW has failed ingeneral-purpose computing arena (so far).– More complex VLIW architectures close to in-ordersuperscalar in complexity, no real advantage on largecomplex apps.Successful in embedded DSP market– Simpler VLIWs with more constrained environment, friendliercode.March 8, 2012CS152, Spring 201227

Acknowledgements These slides contain material developed andcopyright by:––––––Arvind (MIT)Krste Asanovic (MIT/UCB)Joel Emer (Intel/MIT)James Hoe (CMU)John Kubiatowicz (UCB)David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252March 8, 2012CS152, Spring 201228

fsd f5, 0(x2) fsd f6, 8(x2) fsd f7, 16(x2) add x2, 32 fsd f8, -8(x2) bne x1, x3, loop Unroll 4 ways first Int1 Int 2 M1 M2 FP FPx fld f1 fld f2 fld f3 fld f4 fadd f5 fadd f6 fadd f7 fadd f8 fsd f5 fsd f6 fsd f7 fsd f8 add x1 add x2 bne fld f1 fld f2 fld f3 fld f4 fadd f5 fadd f6 fadd f7 fadd f8 fsd f5

Related Documents:

What is Computer Architecture? “Computer Architecture is the science and art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals.” - WWW Computer Architecture Page An analogy to architecture of File Size: 1MBPage Count: 12Explore further(PDF) Lecture Notes on Computer Architecturewww.researchgate.netComputer Architecture - an overview ScienceDirect Topicswww.sciencedirect.comWhat is Computer Architecture? - Definition from Techopediawww.techopedia.com1. An Introduction to Computer Architecture - Designing .www.oreilly.comWhat is Computer Architecture? - University of Washingtoncourses.cs.washington.eduRecommended to you b

Paper Name: Computer Organization and Architecture SYLLABUS 1. Introduction to Computers Basic of Computer, Von Neumann Architecture, Generation of Computer, . “Computer System Architecture”, John. P. Hayes. 2. “Computer Architecture and parallel Processing “, Hwang K. Briggs. 3. “Computer System Architecture”, M.Morris Mano.

150.2 Removal from service 150.3 Prusik and similar knots 151 Clearance, maximum arresting force and swing Anchors 152 Anchor strength — permanent 152.1 Anchor strength — temporary 152.2 Duty to use anchors 152.3 Independence of anchors 152.4 Wire rope sling as anchor 153 Flexible and rigid horizontal lifeline systems

VISION TM Toolkits MEASURE CALIBRATE DEVELOP OPTIMIZE SUCCEED www.accuratetechnologies.com VISION Calibration and Data Acquisition Toolkits VISION Toolkit Dependency: Part Number Name 152-0200 VISION Standard Calibration Package 152-0201 VISION Standard Calibration Package w/Third Party I/O 152-0208 VISION Data Acquisition Package 152-0209 VISION ECU Flashing Package 152-0210 .

152 152 II MAXIMUMUSEFULLOAD: 152 152 II BAGGAGEALLOWANCE WINGLOADING: Pounds/SqFt POWERLOADING: Pounds/HP FUELCAPACITY: Total Standard Tanks LongRangeTanks OIL CAPACITY ENGINE: Avco Lycoming 110 BHPat 2550 RPM PROPELLER: Fixed Pitch, Diameter 110 K

UFC 4-152-01 24 January 2017 . UNIFIED FACILITIES CRITERIA (UFC) REVISION SUMMARY SHEET . Document: UFC 4-152-01, DESIGN: PIERS AND WHARVES Superseding: UFC 4-152-01, DESIGN: PIERS AND WHARVES, dated 28 July 2005 with Change 1, dated 1 September 2012.

D-302544 MKP-150/151/152 Manuel d’utilisation 5 4. ENREGSITRER / EFFACER LE MKP-150/151/152 DE LA MEMOIRE DU POWERMAX Pour enregistrer toutes les fonctions du MKP-150/151/152 (pour la liste complète, voir le guide d’installation du Powermax ), accédez au mode installateur de la centrale Powermax à p

152 152 II MAXIMUMUSEFULLOAD: 152 152 II BAGGAGEALLOWANCE WINGLOADING: Pounds/SqFt POWERLOADING: Pounds/HP FUELCAPACITY: Total Standard Tanks LongRangeTanks OIL CAPACITY ENGINE: Avco Lycoming 110 BHPat 2550 RPM PROPELLER: Fixed Pitch, Diameter 110 KNOTS 107 KNOTS 320 NM 3.1 HRS 545 NM 5.2 HRS 415 NM 5.2