Five Instruction Execution Steps

2y ago
25 Views
3 Downloads
596.50 KB
20 Pages
Last View : 17d ago
Last Download : 3m ago
Upload by : Konnor Frawley
Transcription

Five instruction execution steps Instruction fetchInstruction decode and register readExecution, memory address calculation, or branchcompletionMemory access or R-type instruction completionWrite-back Instruction execution takes 3 5 cycles! CS/COE1541: Introduction toComputer Architecture PipeliningSangyeun ChoComputer Science DepartmentUniversity of PittsburghCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh2Single-cycle vs. multi-cycle Goal of pipelining is Resource usage Throughput! Observation Single-cycle: Multi-cycle: Instructions follow “steps”Control design Some steps are common (e.g., fetch, decode) Single-cycle:Because we “program” ALU, the execution step is also common Multi-cycle: Performance (CPI CCT) Basic idea: overlap instruction execution Single-cycleWhile instruction #1 is in the decode stage, fetch instruction #2Provide more resources such that overlapping is not hindered by sharing ofresources Multi-cycleCS/CoE1541: Intro. to Computer ArchitecturePipeliningCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh3University of Pittsburgh4

Laundry analogyPipelining instruction execution Consider instruction execution steps Fetch instruction from memory Separate instruction memory (Harvard architecture) vs. singlememory (von Neumann)Decode instructionRead operands from register filePerform specified computationAccess memory if requiredNeed to compute effective address beforehand Store result to specified register if needed Each task’s latency is not reduced!Because instruction throughput isimproved, program latency is reduced!CS/CoE1541: Intro. to Computer ArchitectureMake sure different pipeline stages can simultaneouslywork on different instructions Pipelined datapath Pipelined controlCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh5Five pipeline stages6Pipelining instruction executionIFThis is a “vanilla” design In fact, some commercial processors follow this design Early MIPS design ARM9 (very popular embedded processor core)cc1ADD Fetch (F)SUB Fetch instruction Decode (D)IFIDEX MEM DEXLOADcc6cc7cc8cc9WBMEMWB Decode instruction and read operands ANDExecute (X) ALU operation, address calculation in the case of memory accessORMemory access (M) Write back (W)WBMEM Update register fileCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh7University of Pittsburgh8

Multi-cycle to pipelined datapathMulti-cycle to pipelined datapathThese registers are flip-flops;inputs are captured on each clock edgeCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh9lw in the “F” stagelw in the “D” stageRead an instruction from instruction memory;address is in PC; compute (PC 4) to update PCCS/CoE1541: Intro. to Computer Architecture10Read operands from register file;sign-extend the immediate fieldCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh11University of Pittsburgh12

lw in the “X” stagelw in the “M” stageAdd the base register value and the immediate valueto form memory access address;CS/CoE1541: Intro. to Computer ArchitectureRead a value from memoryCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh13Pipeline controllw in the “W” stageUpdate register fileCS/CoE1541: Intro. to Computer Architecture14 There are multiple instructions in flight (in different pipelinestages) Hence, control signals for an instruction should flow throughthe pipeline stages with the instruction Alternatively, with the instruction information in eachpipeline register, one can generate control signals bydecoding the information Pipeline control becomes more complex than previousdesigns because of potential dependences betweeninstructions in flightCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh15University of Pittsburgh16

Pipelining performanceMW control signalsfor Mcontrol signalsfor WTime between instructions Pipelining: 1 cycle Time for one instruction (# cycles) / # pipeline stagesWMControllogiccontrol signals for XWXPipeline control Non-pipelinedSingle-cycle: 1 cycle (but it is LONG)Multi-cycle: N cycles (depending on instruction) Pipelining does NOT improve latency of a single instruction In fact, instruction execution latency can be even longer (why?) Pipelining improves “throughput” (what is throughput?) It improves the program execution time (why?)CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh17MIPS ISA and pipelining 18Pipelining facts summaryFixed instruction length (4 bytes) Pipelining increases throughput, not latency (of instruction) Clock cycle time is limited by the longest pipeline stage Potential speedup # of pipeline stages Time to fill and time to drain pipeline can affect performance Can you explain the trade-offs between different processorimplementation strategies? Allows simple fetching (c.f., IA32) Few instruction formats; source register fields fixed Quick register operand fetching from register file Load/store architecture No need to place arithmetic operation after memory (c.f., IA32) Memory operands are aligned Single-cycle vs. multi-cycle vs. pipeliningCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh19University of Pittsburgh20

Tackling hazardsPipeline hazards Hazards are the conditions that hinder seamless instructionexecution through pipeline stages Hardware approach – pipeline “interlock” Detection: continuously check conditions that lead to a hazard Treatment: insert a “bubble” in the pipeline to delay instructionexecution such that the condition disappearsThree types of hazards The bubble is also called “pipeline stall” Structural: hardware can’t support a particular sequence ofinstructions (due to lack of resources) Data: an instruction depends on a prior instruction (to produce its Detection: compiler inspects the generated code and sees if there is anresult) still in executioninstruction sequence that will lead to a pipeline hazardE.g., lw followed by an add instruction using the loaded value Treatment: insert a “NOP” instruction to avoid the hazard condition Compiler must have knowledge about the hardware (pipeline) Control: can’t decide if this instruction should be executed due to aprior branch instruction in execution CS/CoE1541: Intro. to Computer ArchitectureSoftware approachTrade-offs?CS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh21Data hazardStructural hazard 22Register file example Let’s assume that we decided to support (R R) addressing mode in12345memory instructionsE.g., lw 4, ( 5 3) and sw 4, ( 5 3) In the case of the store instruction above, we need to read from theregister file three times If we have only two read ports, what will happen? What about some other addressing mode like post-increment?lw 4, 16( 5 ) BACK:lw 5, 4( 16)add 6, 6, 5addi 16, 16, 4bne 16, 4, BACKadd (in line 3) has 5 as its input operand, while lw has 5as its output add can execute only after lw produces its result in 5ALU example Can we use ALU for branch target computation?CS/CoE1541: Intro. to Computer ArchitectureConsider the following code sequence:In pipelined execution, when add is in its execution stage, lwis in its memory stage and has not produced its result!CS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh23University of Pittsburgh24

Tackling data hazardsData hazardadd 6, 6, 5 lw 5, 4( 16)Source of the problem is dependency between nearbyinstructions add 6, 5, 4 sub 3, 6, 8 add 6, 5, 4nopnopsub 3, 6, 8Solutions Software approach: reorder instructions in the program so thatdependent instructions are apart; insert NOPs if reordering is notsufficient Hardware approach: detect dependence and stall the secondinstruction (dependent on the first one) in the decoding stage untilwhen the data becomes ready in the register fileHow many bubbles do we need?CS/CoE1541: Intro. to Computer Architecture What are trade-offs between these two approaches?CS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh25Tackling data hazards 26Data forwardingresultThe severity of the problem is determined by the differencein the timing of register file update (first instruction) and thetiming of register file read (second instruction)resultresultR1 In our 5-stage design, register update is in the 5th stage and registerread is in the 2nd stage If register update and read can be done within a single cycle, basicallywe need two bubbles (unused instruction execution slots!) for the twodependent instructions back-to-back R1R1This severity can be tackled by passing the result of the firstinstruction to the second directly without going through theregister fileR1R1 This is called data forwarding (or sometimes bypassing)CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh27University of Pittsburgh28

Data forwardingData forwardingresultR1resultBack to the future!R1R4R1R4Forwarding target isM stage, not X stageR1CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh2930Control hazardData forwarding hardware Consider the following code sequence:123456In fact, datapath design is straightforward;But what about control? Larger muxbeq 4, 5, FOOadd 6, 6, 10j BARFOO:sub 6, 6, 10BAR:After fetching beq (in line 1), do we know where to go, sayto line 4 or line 2? Before beq reaches its execution stage to evaluate ( 4 5) if it willbranch or not is not known What shall we do?Wires!CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh31University of Pittsburgh32

Control hazardImpact of branch stalls timebranchFDXM CPI 1 0.2 3 1.6W These aremis-fetched!I1Solution Determine early if branch is taken or not Compute branch target address earlyI2I3correct instr.If CPI 1, 20% of instructions are branches, 3 cycle stalls Move zero test to ID stageFCS/CoE1541: Intro. to Computer ArchitectureExample Separate adder to calculate new PC in ID stageCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh3334Control hazardModified pipelinetimeAdded hardwarebranchFDXCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh35WThis might have beenmis-fetched!I1correct instr.MFUniversity of Pittsburgh36

Pipeline depth vs. branch penalty Branch handling strategiesToday’s processors employ a deep pipeline (possibly morethan 20 stages!) to increase the clock rate Many stages means smaller amount of work per stage shorter timeneeded per stage higher clock rate! Stall until branch direction is known Predict “NOT TAKEN” Execute fall-off instructions that follow Squash (or cancel) instructions in pipeline if branch is actuallyBut what about branch penalty?taken (PC 4) already computed, so use it to get next instruction Penalty depends on the pipeline length! Branches represent 15 20% of all instructions executed 67% MIPS branches taken on average Start fetching from the taken path as soon as the targetSituation is compounded by the increased issue bandwidth(superscalar processors)address becomes available Lost instruction execution opportunities are branch penalty issuewidth Accurate branch prediction mechanism is neededCS/CoE1541: Intro. to Computer ArchitecturePredict “TAKEN” Delayed branchCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh37Delayed branch 38Delayed branchAssume that we have “N” fetch slots after a branchbefore fetching the branch target The instructions in the slots are executed regardless of thebranch outcome N is set to be the number of cycles needed to resolve thebranch It’s compiler’s job to find instructions to fill the slots Simple hardware – no branch interlockPossibly more performance – if slots are filled withuseful instructions Code size will increase If you don’t have instructions to fill the slots, put NOPs there.CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh39University of Pittsburgh40

Branch prediction What to predict – T/NTGoal Predict the branch outcome (taken or not taken) and the branchtarget (when taken, where should we go?) In short, just figure out what’s the next PC? Let’s first focus on predicting taken (T)/not taken (NT) Static prediction Associate with each branch a hintAlways takenAlways not taken(Don’t know) Forward not taken, backward taken Compiler hintWhen do we predict? Predict the next PC when we are fetching from the current PC Dynamic prediction Simple 1-bit predictorHow?Remember the last behavior (taken/not taken) 2-bit predictor We will study this Bias added CombinedChoose between two predictorsCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh411-bit predictor 2-bit predictorRemember the last behavior of the branchfor (i 0; i 100; i ) {A[i] B[i] * C[i];D[i] E[i] / F[i];}42 Requires two consecutive mispredictions to flip directionAssume that this is a branchHow many prediction hits and misses? Prediction accuracy?for (j 0; j 10; j ) {for (i 0; i 100; i ) {A[i] B[i] * C[i];D[i] E[i] / F[i];}} Any shortcoming in this scheme?CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh43University of Pittsburgh44

Branch prediction buffer 2-bit predictor performanceSpecial cache to keep 2N 2-bit counters, indexed with (partof) PCBranch predictorN bitsPCindexing2-bit counterstaken/not takenCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh45Correlating predictor (m,n) predictorBehavior of a branch can be correlated with other branches if (aa 2)aa 0;if (bb 2)bb 0;if (aa! bb) { } 46 Considers last m branches (m-bit global history)Choose from 2m separate BP arrays, each has n-bit predictorBranch history N-bit vector keeping the last N branches’ outcomes (it’s a shiftregister) 11001101 TTNNTTNT (the rightmost being the most recent) Also called a “global” predictorHow shall we utilize this branch history?CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh47University of Pittsburgh48

Another way of utilizing history (2,2) predictor performanceForm index from PC and GHN bitsPC2-bit countersCombiningfunctionindexingGlobal historyCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh49Combined predictor Selector designQuestion: “Is a local predictor better or a global predictor betterfor a specific branch?” Choose between a local and a global predictorSelector – is a predictor to choose between the two predictors Alpha 21264 example 50predictor 1 was rightpredictor 2 was rightX/Y A 4K-entry global predictor (2-bit counters)Indexed by 12-bit global history A hierarchical local predictor1K 10-bit pattern table1K-entry 3-bit counters A tournament predictor (selector)4K-entry 2-bit countersCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh51University of Pittsburgh52

Local vs. global?Combined predictor performance“Sweet spot”CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh53What to predict – target 54Target predictionRemember – the goal of branch prediction is to determinethe next PC (target of fetching) every cycle It’s much more difficult to “predict” target Taken/Not taken – just two cases A 32-bit target has 232 possibilities! Requirements When fetching a branch, we need to predict (simultaneously withCS/CoE1541: Intro. to Computer ArchitectureBut taken target remains the same! Just remember the last target then fetching) if it’s going to be taken or not we talked about this At the same time, we need to determine the target of the branch ifthe branch is predicted taken we are going to talk about thisCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh55University of Pittsburgh56

Target prediction w/ BTBTarget prediction w/ BTBBTB Branch Target Buffer Use “PC” to look up – why PC? “Match” means it’s a branch for sure All-bit matching needed; why? If match and Predicted Taken, use the stored target for thenext PCWhen no match and it’s a branch (detected later) Use some other prediction Assume it’s not taken CS/CoE1541: Intro. to Computer ArchitectureAfter processing a branch, update BTB with correctinformationCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh5758BTB performanceBranch prediction & pipelining Two bad cases Wrong prediction (or “misprediction”) A branch not found in BTB Example: Prediction accuracy is 90%, hit rate in BTB is 90%,2-cycle penalty, 60% of branches taken Probability (branch in buffer & misprediction) 90% 10% 0.09 Probability (branch not in buffer & taken) 10% 60% 0.06 Branch penalty (0.09 0.06) 2 0.30 (cycles)CS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh59University of Pittsburgh60

What about “indirect jumps”? RAS performanceIndirect jumps Branch target is not unique E.g., jr 31 BTB has a single target PC entry – can’t store multiple targets With multiple targets stored, how do we choose the right target? Fortunately, most indirect jumps are for function return Return target can be predicted using a stack Return AddressStack (RAS) The basic idea is to keep storing all the return addresses in a Last In FirstOut mannerCS/CoE1541: Intro. to Computer ArchitectureCS/CoE1541: Intro. to Computer ArchitectureUniversity of PittsburghUniversity of Pittsburgh6162Some key pointsPerformance of branch prediction Hazards result in pipeline bubbles Performance degradation – non-ideal, higher-than-1 CPI Data hazards require dependent instruction wait for the operandsto become available Forwarding helps reduce the penalty Stalls still required in some casesControl hazards require control-dependent instructions to wait forthe branch to be resolved Branch prediction is a integral mechanism found in all highperformance processors Program performance is very sensitive to branch prediction accuracy inOn a hypothetical “64-issue” superscalar processor modelwith 2k instruction windowCS/CoE1541: Intro. to Computer Architecturemodern pipelined processorsCS/CoE1541: Intro. to Computer ArchitectureUniversity of Pittsburgh63University of Pittsburgh64

What is instruction level parallelism? Key questionsExecute independent instructions in parallel Provide more hardware function units (e.g., adders, cache ports) Static, compile-time vs. dynamic, run-time Detect instructions that can be executed in parallel (in hardware or Data dependence and control dependence place high barssoftware) Schedule instructions to multiple function units (in hardware orsoftware) Goal is to improve instruction throughput How does it differ from general parallel processing?How does it differ from pipelining? How do we find parallel instructions? What is the role of ISA for ILP packaging? VLIW approach (static compiler approach) Superscalar approach (dynamic hardware approach) How can we e

CS/COE1541: Introduction to Computer Architecture Pipelining Sangyeun Cho Computer Science Department University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 2 Five instruction execution steps Instruction fetch Instruction decode and register read Ex

Related Documents:

Five Meanings of Direct Instruction Barak Rosenshine. The opinions expressed herein do not necessarily reflect the position of the supporting . Direct instruction refers to instruction led by the teacher, as in “the teacher provided direct instruction in solving these problems.” But if one enters “direct instruction”

PIC24 Instruction Set Features . . PIC 24 Addressing Mode Exercise . For each questions a-g, assume the memory/register conte nts shown below at the start of instruction execution and give the modified memory location or register and its contents after instruction execution.

The Five Senses: Smell Smell Science: The Nose Knows! Your Sense of Taste The Five Senses: Taste Taste Test A Tasty Experiment Your Sense of Touch Your Sense of Touch: Cold Five Senses Your Five Senses #2 Learning the Five Senses My Five Senses Match Your Five Senses #1 Match Your Five Senses #2 Match Your Fiv

approach to character creation that is the foundation of Five by Five. The 5x5 task roll is original to Five by Five, but combat, weapons, and armor were all adapted from Warhammer Fantasy Roleplay.2 Five by Five was created ad-hoc for playing a quick game session with some friends by m

PLC-5 Instruction Set Alphabetical Listing PLC-5 Instruction Set Alphabetical Listing For this Instruction: See Page: For this Instruction: See Page: For this Instruction: See Page: For this Instruction: See Page: ABL 17-51 CMP 3-3 JSR 13-12 RES 2-25 ACB 17-71 COP 9-20 LBL 13-5 RET 13-12 AC

five days using either rich instruction or traditional instruction. Rich instruction consisted of . word meanings, depth of word knowledge, writing quality and number of target words used in . this study is based on the assumption that direct instruction, specifically, rich instruction, .

Steps DMAIC is an abbreviation of the five improvement steps it comprises: Define, Measure, Analyze, Improve and Control. All of the DMAIC process steps are required and always proceed in the given order. The five steps of DMAIC Define The purpose of this step is to clearly articulate the business problem, goal, potential

automotive industry based on patents and text-mining of company websites. The third section presents findings about private equity investment and startup/spinoff activity. The fourth section explores the supply and demand of skills related to advanced technologies in the automotive industry. The fifth chapter concludes with a short future outlook. Section . Technological trends of .