ELE 475 / COS 475 Computer Architecture Lecture 13 .

2y ago
4 Views
1 Downloads
1.45 MB
39 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Mya Leung
Transcription

Computer ArchitectureELE 475 / COS 475Slide Deck 12: MultithreadingDavid WentzlaffDepartment of Electrical EngineeringPrinceton University1

Agenda Multithreading Motivation Course Grain Multithreading Simultaneous Multithreading2

Multithreading Difficult to continue to extract instruction-levelparallelism (ILP) or data level parallelism (DLP)from a single sequential thread of control Many workloads can make use of thread-levelparallelism (TLP)– TLP from multiprogramming (run independent sequential jobs)– TLP from multithreaded applications (run one job faster usingparallel threads) Multithreading uses TLP to improve utilization ofa single processor3

Pipeline Hazardst0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5F D X MWF D D D D X MWF F F F D D D D X MWF F F F D D D D Each instruction may depend on the next4

Pipeline Hazardst0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5F D X MWF D D D D X MWF F F F D D D D X MWF F F F D D D D Each instruction may depend on the nextWhat is usually done to cope with this?5

Pipeline Hazardst0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5F D X MWF D D D D X MWF F F F D D D D X MWF F F F D D D D Each instruction may depend on the nextWhat is usually done to cope with this?– interlocks (slow)– or bypassing (needs hardware, doesn’t help allhazards)6

MultithreadingHow can we guarantee no dependencies betweeninstructions in a pipeline?-- One way is to interleave execution of instructionsfrom different program threads on same pipeline7

MultithreadingHow can we guarantee no dependencies betweeninstructions in a pipeline?-- One way is to interleave execution of instructionsfrom different program threads on same pipelineInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipet0 t1 t2 t3 t4 t5 t6 t7T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)F D X MF D XF DFWMXDFt8t9WMWX MWD X MW8

MultithreadingHow can we guarantee no dependencies betweeninstructions in a pipeline?-- One way is to interleave execution of instructionsfrom different program threads on same pipelineInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipet0 t1 t2 t3 t4 t5 t6 t7T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)F D X MF D XF DFWMXDFt8WMWX MWD X MWt9Prior instruction ina thread alwayscompletes writeback before nextinstruction insame thread readsregister file9

Simple Multithreaded PipelinePCPCPC 1PC 111I IRGPR1GPR1GPR1GPR1XYD 12 Threadselect2 Have to carry thread select down pipeline to ensure correct state bitsread/written at each pipe stage Appears to software (including OS) as multiple, albeit slower, CPUs10

Multithreading Costs Each thread requires its own user state– PC– GPRs Also, needs its own system state– virtual memory page table base register– exception handling registers– Other system state Other overheads:– Additional cache/TLB conflicts from competing threads– (or add larger cache/TLB capacity)– More OS overhead to schedule more threads (where do all thesethreads come from?)11

Thread Scheduling Policies Fixed interleave (CDC 6600 PPUs, 1964)– Each of N threads executes one instruction every N cycles– If thread not ready to go in its slot, insert pipeline bubble– Can potentially remove bypassing and interlocking logic Software-controlled interleave (TI ASC PPUs, 1971)– OS allocates S pipeline slots amongst N threads– Hardware performs fixed interleave over S slots, executingwhichever thread is in that slot Hardware-controlled thread scheduling (HEP, 1982)– Hardware keeps track of which threads are ready to go– Picks next thread to execute based on hardware priorityscheme12

Coarse-Grain Hardware Multithreading Some architectures do not have many lowlatency bubbles Add support for a few threads to hideoccasional cache miss latency Swap threads in hardware on cache miss13

Denelcor HEP(Burton Smith, 1982)BRL HEP MachineImage c-computers/png/hep2.pngFirst commercial machine to use hardware threading inmain CPU––––120 threads per processor10 MHz clock rateUp to 8 processorsprecursor to Tera MTA / Cray XMT (Multithreaded Architecture)14

Tera (Cray) MTA (1990) Up to 256 processors Up to 128 active threads per processor Processors and memory modules populatea sparse 3D torus interconnection fabric Flat, shared main memory––No data cacheSustains one main memory access per cycle per processor GaAs logic in prototype, 1KW/processor @260MHzImage Credit:Tera Computer Company– Second version CMOS, MTA-2, 50W/processor– New version, XMT, fits into AMD Opteronsocket, runs at 500MHz15

MTA PipelineIssue PoolInst FetchWWrite PoolMemory PoolMACWW Every cycle, oneVLIW instruction fromone active thread islaunched into pipeline Instruction pipelineis 21 cycles long Memory operationsincur 150 cycles oflatencyRetry PoolInterconnection NetworkMemory pipeline16

MIT Alewife (1990) Modified SPARC chips– register windows holddifferent thread contexts Up to four threads per node Thread switch on local cachemissImage Credit: MIT17

Oracle/Sun Niagara processors Target is datacenters running web servers anddatabases, with many concurrent requests Provide multiple simple cores each with multiplehardware threads, reduced energy/operationthough much lower single thread performance Niagara-1 [2004], 8 cores, 4 threads/core Niagara-2 [2007], 8 cores, 8 threads/core Niagara-3 [2009], 16 cores, 8 threads/core18

Oracle/Sun Niagara-3, “Rainbow Falls” 2009Image Credit: Oracle/SunImage Credit: Oracle/Sun19From Hot Chips 2009 Presentation by Sanjay Patel

Simultaneous Multithreading (SMT)for OOO Superscalars Techniques presented so far have all been“vertical” multithreading where each pipelinestage works on one thread at a time SMT uses fine-grain control already presentinside an OOO superscalar to allowinstructions from multiple threads to enterexecution on same clock cycle. Gives betterutilization of machine resources.20

Ideal Superscalar Multithreading[Tullsen, Eggers, Levy, UW, 1995]Issue widthTime Interleave multiple threads to multiple issueslots with no restrictions21

For most apps, most execution units lieidle in an OOO superscalarFor an 8-waysuperscalar.Image From: Tullsen, Eggers,and Levy,“Simultaneous Multithreading:Maximizing On-chip Parallelism”,ISCA 1995.22

Superscalar Machine EfficiencyIssue widthInstructionissueCompletely idle cycle(vertical waste)TimePartially filled cycle,i.e., IPC 4(horizontal waste)23

Vertical MultithreadingIssue widthInstructionissueSecond thread interleavedcycle-by-cycleTimePartially filled cycle,i.e., IPC 4(horizontal waste) What is the effect of cycle-by-cycle interleaving?24

Vertical MultithreadingIssue widthInstructionissueSecond thread interleavedcycle-by-cycleTimePartially filled cycle,i.e., IPC 4(horizontal waste) What is the effect of cycle-by-cycle interleaving?– removes vertical waste, but leaves some horizontalwaste25

Chip Multiprocessing (CMP)Issue widthTime What is the effect of splitting into multiple processors?26

Chip Multiprocessing (CMP)Issue widthTime What is the effect of splitting into multiple processors?– reduces horizontal waste,– leaves some vertical waste, and– puts upper limit on peak throughput of each thread.27

Ideal Superscalar Multithreading[Tullsen, Eggers, Levy, UW, 1995]Issue widthTime Interleave multiple threads to multiple issueslots with no restrictions28

OOO Simultaneous Multithreading[Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996] Add multiple contexts and fetch engines andallow instructions fetched from different threadsto issue simultaneously Utilize wide out-of-order superscalar processorissue queue to find instructions to issue frommultiple threads OOO instruction window already has most of thecircuitry required to schedule from multiplethreads Any single thread can utilize whole machine29

SMT adaptation to parallelism typeFor regions with high thread levelparallelism (TLP) entire machine widthis shared by all threadsIssue widthTimeFor regions with low thread levelparallelism (TLP) entire machine width isavailable for instruction level parallelism(ILP)Issue widthTime30

Power 4[POWER 4 system microarchitecture, Tendler et al, IBM J. Res. & Dev., Jan 2002] Image Credit: IBMCourtesy of International Business Machines, International Business Machines.Power 52 commits(architectedregister sets)2 fetch (PC),2 initial decodes[POWER 5 system microarchitecture, Sinharoy et al, IBM J. Res. & Dev., Jul/Sept 2005] Image Credit: IBMCourtesy of International Business Machines, International Business Machines.31

Power 5 data flow .Image Credit: Carsten Schulz[POWER 5 system microarchitecture, Sinharoy et al, IBM J. Res. & Dev., Jul/Sept 2005] Image Credit: IBMCourtesy of International Business Machines, International Business Machines.Why only 2 threads? With 4, one of the sharedresources (physical registers, cache, memorybandwidth) would be prone to bottleneck32

Changes in Power 5 to support SMT Increased associativity of L1 instruction cache andthe instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and L3caches Added separate instruction prefetch and bufferingper thread Increased the number of virtual registers from 152to 240 Increased the size of several issue queues The Power5 core is about 24% larger than thePower4 core because of the addition of SMTsupport33

Pentium-4 Hyperthreading (2002) First commercial SMT design (2-way SMT)– Hyperthreading SMT Logical processors share nearly all resources of the physicalprocessor– Caches, execution units, branch predictors Die area overhead of hyperthreading 5% When one logical processor is stalled, the other can make progress– No logical processor can use all entries in queues when two threads areactive Processor running only one active software thread runs atapproximately same speed with or without hyperthreading Hyperthreading dropped on OOO P6 based follow-ons to Pentium-4(Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalemgeneration machines in 2008. Intel Atom (in-order x86 core) has two-way vertical multithreading34

Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 speedup forSPECint rate benchmark and 1.07 for SPECfp rate– Pentium 4 is dual threaded SMT– SPECRate requires that each SPEC benchmark be run against avendor-selected number of copies of the same benchmark Running on Pentium 4 each of 26 SPEC benchmarks pairedwith every other (262 runs) speed-ups from 0.90 to 1.58;average was 1.20 Power 5, 8-processor server 1.23 faster for SPECint ratewith SMT, 1.16 faster for SPECfp rate Power 5 running 2 copies of each app speedup between0.89 and 1.41– Most gained some– Floating Point apps had most cache conflicts and least gains35

Icount Choosing PolicyFetch from thread with the least instructions in flight.Why does this enhance throughput?36

Time (processor cycle)Summary: Multithreaded CategoriesSuperscalarFine-Grained Coarse-GrainedThread 1Thread 2MultiprocessingThread 3Thread 4SimultaneousMultithreadingThread 5Idle slot37

Acknowledgements These slides contain material developed and copyright by:–––––––Arvind (MIT)Krste Asanovic (MIT/UCB)Joel Emer (Intel/MIT)James Hoe (CMU)John Kubiatowicz (UCB)David Patterson (UCB)Christopher Batten (Cornell) MIT material derived from course 6.823 UCB material derived from course CS252 & CS152 Cornell material derived from course ECE 475038

Copyright 2013 David Wentzlaff39

t0 t1 t2 t3 t4 t5 t6 t7 t8 F D D D D X M W F F F F D D D D X M W F D t9 t10 t11 t12 t13 t14 . Modified SPARC chips –register windows hold . Niagara-3 [2009], 16 cores, 8 threads/core 18 . Oracle/Sun

Related Documents:

2. Perkalian Sinus dan Sinus Dari rumus jumlah dan selisih dua sudut, dapat diperoleh rumus sebagai berikut: cos (A B) cos A cos B – sin A sin B cos (A – B) cos A cos B sin A sin B _ cos (A B) – cos (A –B) –2 sin A sin B Jadi, rumus perkalian antara sinus dengan sinus adalah: 3. Perkalian Cosinus dan Sinus

ASM/COS) 1.Respondent Debriefings (2015 ASM/COS) 2. Paradata Analysis (2015 ASM/COS) 2. Usability Testing (2016 ASM/COS) 1. Two Rounds Usability Testing (2017 Economic Census) 2. Respondent Debriefings (2016 ASM/COS) 3. Paradata Analysis (2016 ASM/COS) 1. Paradata Analysis (2016 ASM/COS) 2. Paradata Analy

410 Chapter 5 Analytic Trigonometry Half-Angle Formulas The signs of and depend on the quadrant in which lies. u 2 sin cos tan u 2 1 cos u sin u sin u 1 cos u cos u 2 1 cos u 2 sin u 2 1 cos u 2 Example 6 To find the exact value of a trigonometric function with an angle measure in for

Formulas from Trigonometry: sin 2A cos A 1 sin(A B) sinAcosB cosAsinB cos(A B) cosAcosB tansinAsinB tan(A B) A tanB 1 tanAtanB sin2A 2sinAcosA cos2A cos2 A sin2 A tan2A 2tanA 1 2tan A sin A 2 q 1 cosA 2 cos A 2 q 1 cos A 2 tan 2 sinA 1 cosA sin2 A 1 2 21 2 cos2A cos A 1 2 1 2 cos2A sinA sinB 2sin 1 2 (A B)cos 1 2 (A 1B .

The polar equation r cos 6' produces a shifted circle. The top point is at 6' s/4, which gives r h/2. When 6' goes from 0 to 27r, we go two times around the graph. Rewriting as r2 r cos 6' leads to the xy equation x2 y2 x. Substituting r cos 6' into x r cos 6' yields x cos 26' and similarly y cos 6' sin 6'. In this form

5. BACHILLERATO 5 2 Ecuaciones trigonométricas Página 134 Hazlo tú. Resuelve sen (α 30 ) 2 cos α. sen (α 30 ) 2 cos α sen α cos 30 cos α sen 30 2 cos α sen aa cosc os a 2 1 2 3 2 Dividimos los dos miembros entre cos α: tg a8 tg a8 tg a 2 1 2 3 23 44 - 3

biotin usp 300 mcg potassium iodide eq. to ele. iodine ip 50 mcg zinc oxide eq. to ele. zinc ip 10 mg. manganese suplhate eq. to ele. manganese usp 4 mg. copper sulphate pentahydrate eq. to ele. copper bp 1 mg. sodium selenite pentahydrate eq. to ele selnium bp 40 mcg chromium picolinate eq.

Examination Part 1, BUS 475 Capstone Final Examination Part 1 Test Paper, UOP Business 475 Final Exam Solution, BUS 475 Capstone Final Examination Part 1 Questions and Answers, BUS 475 Complete Course, BUS 475 Complete Assignment for University Of Phoenix. 1. Article 6 of the Treaty on European Union, called the Maastricht Treaty,