MIPS Pipeline - Oregon State University

2y ago
43 Views
2 Downloads
5.31 MB
84 Pages
Last View : 5m ago
Last Download : 3m ago
Upload by : Amalia Wilborn
Transcription

MIPS Pipelinen Five stages, one step per stage1.2.3.4.5.IF: Instruction fetch from memoryID: Instruction decode & register readEX: Execute operation or calculate addressMEM: Access memory operandWB: Write result back to registerChapter 4 — The Processor — 1

Pipeline Performancen Assume time for stages isn n n 100ps for register read or write200ps for other stagesCompare pipelined datapath with single-cycledatapathInstrInstr fetch RegisterreadALU opMemoryaccessRegisterwriteTotal timelw200ps100 ps200ps200ps100 ps800pssw200ps100 ps200ps200psR-format200ps100 ps200psbeq200ps100 ps200ps700ps100 ps600ps500psChapter 4 — The Processor — 2

Pipeline PerformanceSingle-cycle (Tc 800ps)Pipelined (Tc 200ps)Chapter 4 — The Processor — 3

Pipeline Speedupn If all stages are balancedn n n n i.e., all take the same timeTime between instructionspipelined Time between instructionsnonpipelinedNumber of stagesIf not balanced, speedup is lessSpeedup due to increased throughputn Latency (time for each instruction) does notdecreaseChapter 4 — The Processor — 4

Pipelining and ISA Designn MIPS ISA designed for pipeliningn All instructions are 32-bitsn n n Few and regular instruction formatsn n Can decode and read registers in one stepLoad/store addressingn n Easier to fetch and decode in one cyclec.f. x86: 1- to 17-byte instructionsCan calculate address in 3rd stage, access memoryin 4th stageAlignment of memory operandsn Memory access takes only one cycleChapter 4 — The Processor — 5

Hazardsn n Situations that prevent starting the nextinstruction in the next cycleStructure hazardsn n Data hazardn n A required resource is busyNeed to wait for previous instruction tocomplete its data read/writeControl hazardn Deciding on control action depends onprevious instructionChapter 4 — The Processor — 6

Structure Hazardsn n Conflict for use of a resourceIn MIPS pipeline with a single memoryn n Load/store requires data accessInstruction fetch would have to stall for thatcyclen n Would cause a pipeline “bubble”Hence, pipelined datapaths requireseparate instruction/data memoriesn Or separate instruction/data cachesChapter 4 — The Processor — 7

Data Hazardsn An instruction depends on completion ofdata access by a previous instructionn addsub s0, t0, t1 t2, s0, t3Chapter 4 — The Processor — 8

Forwarding (aka Bypassing)n Use result when it is computedn n Don’t wait for it to be stored in a registerRequires extra connections in the datapathChapter 4 — The Processor — 9

Load-Use Data Hazardn Can’t always avoid stalls by forwardingn n If value not computed when neededCan’t forward backward in time!Chapter 4 — The Processor — 10

Code Scheduling to Avoid Stallsn n Reorder code to avoid use of load result inthe next instructionC code for A B E; C B F;stallstalllwlwaddswlwaddsw t1, t2, t3, t3, t4, t5, t5,0( t0)4( t0) t1, t212( t0)8( t0) t1, t416( t0)13 cycleslwlwlwaddswaddsw t1, t2, t4, t3, t3, t5, t5,0( t0)4( t0)8( t0) t1, t212( t0) t1, t416( t0)11 cyclesChapter 4 — The Processor — 11

Control Hazardsn Branch determines flow of controln n Fetching next instruction depends on branchoutcomePipeline can’t always fetch correct instructionn n Still working on ID stage of branchIn MIPS pipelinen n Need to compare registers and computetarget early in the pipelineAdd hardware to do it in ID stageChapter 4 — The Processor — 12

Stall on Branchn Wait until branch outcome determinedbefore fetching next instructionChapter 4 — The Processor — 13

Branch Predictionn Longer pipelines can’t readily determinebranch outcome earlyn n Predict outcome of branchn n Stall penalty becomes unacceptableOnly stall if prediction is wrongIn MIPS pipelinen n Can predict branches not takenFetch instruction after branch, with no delayChapter 4 — The Processor — 14

MIPS with Predict Not TakenPredictioncorrectPredictionincorrectChapter 4 — The Processor — 15

More-Realistic Branch Predictionn Static branch predictionn n Based on typical branch behaviorExample: loop and if-statement branchesn n n Predict backward branches takenPredict forward branches not takenDynamic branch predictionn Hardware measures actual branch behaviorn n e.g., record recent history of each branchAssume future behavior will continue the trendn When wrong, stall while re-fetching, and update historyChapter 4 — The Processor — 16

Pipeline SummaryThe BIG Picturen Pipelining improves performance byincreasing instruction throughputn n n Subject to hazardsn n Executes multiple instructions in parallelEach instruction has the same latencyStructure, data, controlInstruction set design affects complexity ofpipeline implementationChapter 4 — The Processor — 17

§4.6 Pipelined Datapath and ControlMIPS Pipelined DatapathMEMRight-to-leftflow leads tohazardsWBChapter 4 — The Processor — 18

Pipeline registersn Need registers between stagesn To hold information produced in previous cycleChapter 4 — The Processor — 19

Pipeline Operationn Cycle-by-cycle flow of instructions throughthe pipelined datapathn “Single-clock-cycle” pipeline diagramn n n c.f. “multi-clock-cycle” diagramn n Shows pipeline usage in a single cycleHighlight resources usedGraph of operation over timeWe’ll look at “single-clock-cycle” diagramsfor load & storeChapter 4 — The Processor — 20

IF for Load, Store, Chapter 4 — The Processor — 21

ID for Load, Store, Chapter 4 — The Processor — 22

EX for LoadChapter 4 — The Processor — 23

MEM for LoadChapter 4 — The Processor — 24

WB for LoadWrongregisternumberChapter 4 — The Processor — 25

Corrected Datapath for LoadChapter 4 — The Processor — 26

EX for StoreChapter 4 — The Processor — 27

MEM for StoreChapter 4 — The Processor — 28

WB for StoreChapter 4 — The Processor — 29

Multi-Cycle Pipeline Diagramn Form showing resource usageChapter 4 — The Processor — 30

Multi-Cycle Pipeline Diagramn Traditional formChapter 4 — The Processor — 31

Single-Cycle Pipeline Diagramn State of pipeline in a given cycleChapter 4 — The Processor — 32

Pipelined Control (Simplified)Chapter 4 — The Processor — 33

Pipelined Controln Control signals derived from instructionn As in single-cycle implementationChapter 4 — The Processor — 34

Pipelined ControlChapter 4 — The Processor — 35

n Consider this sequence:subandoraddswn 2, 1, 3 12, 2, 5 13, 6, 2 14, 2, 2 15,100( 2)We can resolve hazards with forwardingn §4.7 Data Hazards: Forwarding vs. StallingData Hazards in ALU InstructionsHow do we detect when to forward?Chapter 4 — The Processor — 36

Dependencies & ForwardingChapter 4 — The Processor — 37

Detecting the Need to Forwardn Pass register numbers along pipelinen n ALU operand register numbers in EX stageare given byn n e.g., ID/EX.RegisterRs register number for Rssitting in ID/EX pipeline registerID/EX.RegisterRs, ID/EX.RegisterRtData hazards when1a. EX/MEM.RegisterRd ID/EX.RegisterRs1b. EX/MEM.RegisterRd ID/EX.RegisterRt2a. MEM/WB.RegisterRd ID/EX.RegisterRs2b. MEM/WB.RegisterRd ID/EX.RegisterRtFwd fromEX/MEMpipeline regFwd fromMEM/WBpipeline regChapter 4 — The Processor — 38

Detecting the Need to Forwardn But only if forwarding instruction will writeto a register!n n EX/MEM.RegWrite, MEM/WB.RegWriteAnd only if Rd for that instruction is not zeron EX/MEM.RegisterRd 0,MEM/WB.RegisterRd 0Chapter 4 — The Processor — 39

Forwarding PathsChapter 4 — The Processor — 40

Forwarding Conditionsn EX hazardn n n if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)and (EX/MEM.RegisterRd ID/EX.RegisterRs))ForwardA 10if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)and (EX/MEM.RegisterRd ID/EX.RegisterRt))ForwardB 10MEM hazardn n if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)and (MEM/WB.RegisterRd ID/EX.RegisterRs))ForwardA 01if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)and (MEM/WB.RegisterRd ID/EX.RegisterRt))ForwardB 01Chapter 4 — The Processor — 41

Double Data Hazardn Consider the sequence:add 1, 1, 2add 1, 1, 3add 1, 1, 4n Both hazards occurn n Want to use the most recentRevise MEM hazard conditionn Only fwd if EX hazard condition isn’t trueChapter 4 — The Processor — 42

Revised Forwarding Conditionn MEM hazardn n if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)and (EX/MEM.RegisterRd ID/EX.RegisterRs))and (MEM/WB.RegisterRd ID/EX.RegisterRs))ForwardA 01if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)and (EX/MEM.RegisterRd ID/EX.RegisterRt))and (MEM/WB.RegisterRd ID/EX.RegisterRt))ForwardB 01Chapter 4 — The Processor — 43

Datapath with ForwardingChapter 4 — The Processor — 44

Load-Use Data HazardNeed to stallfor one cycleChapter 4 — The Processor — 45

Load-Use Hazard Detectionn n Check when using instruction is decodedin ID stageALU operand register numbers in ID stageare given byn n Load-use hazard whenn n IF/ID.RegisterRs, IF/ID.RegisterRtID/EX.MemRead and((ID/EX.RegisterRt IF/ID.RegisterRs) or(ID/EX.RegisterRt IF/ID.RegisterRt))If detected, stall and insert bubbleChapter 4 — The Processor — 46

How to Stall the Pipelinen Force control values in ID/EX registerto 0n n EX, MEM and WB do nop (no-operation)Prevent update of PC and IF/ID registern n n Using instruction is decoded againFollowing instruction is fetched again1-cycle stall allows MEM to read data for lwn Can subsequently forward to EX stageChapter 4 — The Processor — 47

Stall/Bubble in the PipelineStall insertedhereChapter 4 — The Processor — 48

Stall/Bubble in the PipelineOr, moreaccurately Chapter 4 — The Processor — 49

Datapath with Hazard DetectionChapter 4 — The Processor — 50

Stalls and PerformanceThe BIG Picturen Stalls reduce performancen n But are required to get correct resultsCompiler can arrange code to avoidhazards and stallsn Requires knowledge of the pipeline structureChapter 4 — The Processor — 51

n If branch outcome determined in MEM§4.8 Control HazardsBranch HazardsFlush theseinstructions(Set controlvalues to 0)PCChapter 4 — The Processor — 52

Reducing Branch Delayn Move hardware to determine outcome to IDstagen n n Target address adderRegister comparatorExample: branch taken36:40:44:48:52:56:72:subbeqandoraddslt.lw 10, 1, 12, 13, 14, 15, 4, 3, 2, 2, 4, 6, 87 5 6 2 7 4, 50( 7)Chapter 4 — The Processor — 53

Example: Branch TakenChapter 4 — The Processor — 54

Example: Branch TakenChapter 4 — The Processor — 55

Data Hazards for Branchesn If a comparison register is a destination of2nd or 3rd preceding ALU instructionadd 1, 2, 3IFadd 4, 5, 6 beq 1, 4, targetn IDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBCan resolve using forwardingChapter 4 — The Processor — 56

Data Hazards for Branchesn If a comparison register is a destination ofpreceding ALU instruction or 2nd precedingload instructionn lwNeed 1 stall cycle 1, addrIFadd 4, 5, 6beq stalledbeq 1, 4, targetIDEXMEMWBIFIDEXMEMWBIFIDIDEXMEMWBChapter 4 — The Processor — 57

Data Hazards for Branchesn If a comparison register is a destination ofimmediately preceding load instructionn lwNeed 2 stall cycles 1, addrIFbeq stalledbeq stalledbeq 1, 0, targetIDEXIFIDMEMWBIDIDEXMEMWBChapter 4 — The Processor — 58

Dynamic Branch Predictionn n In deeper and superscalar pipelines, branchpenalty is more significantUse dynamic predictionn n n n Branch prediction buffer (aka branch history table)Indexed by recent branch instruction addressesStores outcome (taken/not taken)To execute a branchn n n Check table, expect the same outcomeStart fetching from fall-through or targetIf wrong, flush pipeline and flip predictionChapter 4 — The Processor — 59

1-Bit Predictor: Shortcomingn Inner loop branches mispredicted twice!outer: inner: beq , , inner beq , , outern n Mispredict as taken on last iteration ofinner loopThen mispredict as not taken on firstiteration of inner loop next time aroundChapter 4 — The Processor — 60

2-Bit Predictorn Only change prediction on two successivemispredictionsChapter 4 — The Processor — 61

Calculating the Branch Targetn Even with predictor, still need to calculatethe target addressn n 1-cycle penalty for a taken branchBranch target buffern n Cache of target addressesIndexed by PC when instruction fetchedn If hit and instruction is branch predicted taken, canfetch target immediatelyChapter 4 — The Processor — 62

n “Unexpected” events requiring changein flow of controln n Different ISAs use the terms differentlyExceptionn Arises within the CPUn n e.g., undefined opcode, overflow, syscall, Interruptn n §4.9 ExceptionsExceptions and InterruptsFrom an external I/O controllerDealing with them without sacrificingperformance is hardChapter 4 — The Processor — 63

Handling Exceptionsn n In MIPS, exceptions managed by a SystemControl Coprocessor (CP0)Save PC of offending (or interrupted) instructionn n In MIPS: Exception Program Counter (EPC)Save indication of the problemn n In MIPS: Cause registerWe’ll assume 1-bitn n 0 for undefined opcode, 1 for overflowJump to handler at 8000 00180Chapter 4 — The Processor — 64

Handler Actionsn n n Read cause, and transfer to relevant handlerDetermine action requiredIf restartablen n n Take corrective actionuse EPC to return to programOtherwisen n Terminate programReport error using EPC, cause, Chapter 4 — The Processor — 65

Exceptions in a Pipelinen n Another form of control hazardConsider overflow on add in EX stageadd 1, 2, 1n Prevent 1 from being clobberedn Complete previous instructionsn Flush add and subsequent instructionsn Set Cause and EPC register valuesn Transfer control to handlern Similar to mispredicted branchn Use much of the same hardwareChapter 4 — The Processor — 66

Speculationn “Guess” what to do with an instructionn n Start operation as soon as possibleCheck whether guess was rightn n n n If so, complete the operationIf not, roll-back and do the right thingCommon to static and dynamic multiple issueExamplesn Speculate on branch outcomen n Roll back if path taken is differentSpeculate on loadn Roll back if location is updatedChapter 4 — The Processor — 67

Compiler/Hardware Speculationn Compiler can reorder instructionsn n n e.g., move load before branchCan include “fix-up” instructions to recoverfrom incorrect guessHardware can look ahead for instructionsto executen n Buffer results until it determines they areactually neededFlush buffers on incorrect speculationChapter 4 — The Processor — 68

Static Multiple Issuen Compiler groups instructions into “issuepackets”n n n Group of instructions that can be issued on asingle cycleDetermined by pipeline resources requiredThink of an issue packet as a very longinstructionn n Specifies multiple concurrent operations Very Long Instruction Word (VLIW)Chapter 4 — The Processor — 69

Scheduling Static Multiple Issuen Compiler must remove some/all hazardsn n n Reorder instructions into issue packetsNo dependencies with a packetPossibly some dependencies betweenpacketsn n Varies between ISAs; compiler must know!Pad with nop if necessaryChapter 4 — The Processor — 70

MIPS with Static Dual Issuen Two-issue packetsn n n One ALU/branch instructionOne load/store instruction64-bit alignedn n ALU/branch, then load/storePad an unused instruction with nopAddressInstruction typePipeline StagesnALU/branchIFIDEXMEMWBn 4Load/storeIFIDEXMEMWBn 8ALU/branchIFIDEXMEMWBn 12Load/storeIFIDEXMEMWBn 16ALU/branchIFIDEXMEMWBn 20Load/storeIFIDEXMEMWBChapter 4 — The Processor — 71

MIPS with Static Dual IssueChapter 4 — The Processor — 72

Hazards in the Dual-Issue MIPSn n More instructions executing in parallelEX data hazardn n Forwarding avoided stalls with single-issueNow can’t use ALU result in load/store in same packetn n n Load-use hazardn n add t0, s0, s1load s2, 0( t0)Split into two packets, effectively a stallStill one cycle use latency, but now two instructionsMore aggressive scheduling requiredChapter 4 — The Processor — 73

Scheduling Examplen Schedule this for dual-issue MIPSLoop: lwadduswaddibneLoop:n t0, t0, t0, s1, s1,0( s1) t0, s20( s1) s1,–4 zero, Loop##### t0 array elementadd scalar in s2store resultdecrement pointerbranch s1! 0ALU/branchLoad/storecyclenoplw1addi s1, s1,–4nop2addu t0, t0, s2nop3bnesw s1, zero, Loop t0, 0( s1) t0, 4( s1)4IPC 5/4 1.25 (c.f. peak IPC 2)Chapter 4 — The Processor — 74

Loop Unrollingn Replicate loop body to expose more parallelismn n Reduces loop-control overheadUse different registers per replicationn n Called “register renaming”Avoid loop-carried “anti-dependencies”n n Store followed by a load of the same registerAka “name dependence”n Reuse of a register nameChapter 4 — The Processor — 75

Loop Unrolling ExampleLoop:ALU/branchLoad/storecycleaddi s1, s1,–16lw t0, 0( s1)1noplw t1, 12( s1)2addu t0, t0, s2lw t2, 8( s1)3addu t1, t1, s2lw t3, 4( s1)4addu t2, t2, s2sw t0, 16( s1)5addu t3, t4, s2sw t1, 12( s1)6nopsw t2, 8( s1)7sw t3, 4( s1)8bnen s1, zero, LoopIPC 14/8 1.75n Closer to 2, but at cost of registers and code sizeChapter 4 — The Processor — 76

Dynamic Multiple Issuen n “Superscalar” processorsCPU decides whether to issue 0, 1, 2, each cyclen n Avoiding structural and data hazardsAvoids the need for compiler schedulingn n Though it may still helpCode semantics ensured by the CPUChapter 4 — The Processor — 77

Speculationn Predict branch and continue issuingn n Don’t commit until branch outcome determinedLoad speculationn Avoid load and cache miss delayn n n n n Predict the effective addressPredict loaded valueLoad before completing outstanding storesBypass stored values to load unitDon’t commit load until speculation clearedChapter 4 — The Processor — 78

Why Do Dynamic Scheduling?n n Why not just let the compiler schedule code?Not all stalls are predicablen n Can’t always schedule around branchesn n e.g., cache missesBranch outcome is dynamically determinedDifferent implementations of an ISA havedifferent latencies and hazardsChapter 4 — The Processor — 79

Does Multiple Issue Work?The BIG Picturen n n Yes, but not as much as we’d likePrograms have real dependencies that limit ILPSome dependencies are hard to eliminaten n Some parallelism is hard to exposen n Limited window size during instruction issueMemory delays and limited bandwidthn n e.g., pointer aliasingHard to keep pipelines fullSpeculation can help if done wellChapter 4 — The Processor — 80

Power Efficiencyn n Complexity of dynamic scheduling andspeculations requires powerMultiple simpler cores may be betterMicroprocessorYearClock No110WPentium Pro1997200MHz103Yes129WP4 Willamette20012000MHz223Yes175WP4 s275WUltraSparc III20031950MHz144No190WUltraSparc T120051200MHz61No870WChapter 4 — The Processor — 81

72 physicalregisters§4.11 Real Stuff: The AMD Opteron X4 (Barcelona) PipelineThe Opteron X4 MicroarchitectureChapter 4 — The Processor — 82

The Opteron X4 Pipeline Flown For integer operationsn n n FP is 5 stages longerUp to 106 RISC-ops in progressBottlenecksn n n Complex instructions with long dependenciesBranch mispredictionsMemory access delaysChapter 4 — The Processor — 83

§4.13 Fallacies and PitfallsFallaciesn Pipelining is easy (!)n n The basic idea is easyThe devil is in the detailsn n e.g., detecting data hazardsPipelining is independent of technologyn n n So why haven’t we always done pipelining?More transistors make more advanced techniques feasiblePipeline-related ISA design needs to take account oftechnology trendsn e.g., predicated instructionsChapter 4 — The Processor — 84

MIPS ISA designed for pipelining ! All instructions are 32-bits ! Easier to fetch and decode in one cycle ! c.f. x86: 1- to 17-byte instructions ! Few and regular instruction formats ! Can decode and read registers in one step ! Load/store addressing ! Can cal

Related Documents:

bits, gọi là MIPS-64. MIPS xem xét trong môn học này là MIPS làm việc với các thanh ghi chỉ 32 bit, gọi là MIPS-32. ÞTrong phạm vi môn học này, MIPS dùng chung sẽ hiểu là MIPS-32 Tóm lại, chỉ có 3 loại toán hạng trong một lệnh của MIPS 1. Toán hạng thanh ghi (Register Operands) 2.

Table 1: How 2020 MIPS Final Scores Relate to 2022 MIPS Payment Adjustments Final Score Points MIPS Payment Adjustment 0.00 – 11.25 points Negative (-) MIPS payment adjustment of -9% 11.26 – 44.99 points Negative (-) MIPS payment adjustment, between 0% and -9%, on a linear sliding scale 45.00 points (Performance threshold 45.00 points)

Performance on EEMBC benchmarks aggregate for Consumer, Telecom, Office, Network, based on ARM1136J-S (Freescale i.MX31), ARM1026EJ-S, Tensilica Diamond 570T, T1050 and T1030, MIPS 20K, NECVR5000). MIPS M4K, MIPS 4Ke, MIPS 4Ks, MIPS 24K, ARM 968E-S, ARM 966E-S, ARM926EJ-S, ARM7TDMI-S scaled by ratio of Dhrystone MIPS within architecture family.

ACOs in MIPS receive advantages by being scored under the MIPS APM Scoring Standard, which gives ACOs favorable treatment for their commitment to value-base care. Based on the low bar set for 2019 reporting in MIPS, ACOs should easily avoid penalties under MIPS and will be eligible for MIPS bonuses and exceptional performance bonuses.

Chapter 1: Getting started with mips Remarks This section provides an overview of what mips is, and why a developer might want to use it. It should also mention any large subjects within mips, and link out to the related topics. Since the Documentation for mips is new, you may need to create initial versions of those related topics. Examples

Pipeline device hardware, Pipeline network, the Pipeline host hardware, the Pipeline application software and the Pipeline media disk storage systems. Each of these components is described below. Pipeline device hardware Pipeline device hardware has up to four independent channels with SDI I/O for capture and play out. The SDI

CSE 30321 - Lecture 07 - Introduction to the MIPS ISA 1 Lecture 07 Introduction to the MIPS ISA University of Notre Dame CSE 30321 - Lecture 07 - Introduction to the MIPS ISA Shortcomings of the simple processor – Only 16 bits f

Page 2 of 22 ODNR‐DSWR Pipeline Standard 12‐3‐13 Pipeline ‐ The pipeline and its related appurtenances. Pipeline Company ‐ The entity responsible for installing the pipeline, its successors, and assigns, on its own behalf and as operator of the company. Right‐of‐Way ‐ Includes the permanent and temporary easements that the pipeline company acquires