Lecture 01: Introduc/on - GitHub Pages

2y ago
9 Views
2 Downloads
6.34 MB
57 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Lilly Andre
Transcription

Lecture 01: Introduc/onCSE 564 Computer ArchitectureSummer 2017Department of Computer Science and EngineeringYonghong Yanyan@oakland.eduwww.secs.oakland.edu/ yan1

Copyright and Acknowledgement Most slides were adapted from lectures notes of the two textbooks with copyright of publisher or the originalauthors including Elsevier Inc, Morgan Kaufmann, David A. PaIerson and John L. Hennessy.Some slides were adapted from the following courses:–– UC Berkeley course “Computer Science 252: Graduate Computer Architecture” of David E. Culler Copyright 2005UCB hIp://people.eecs.berkeley.edu/ culler/courses/cs252-s05/Great Ideas in Computer Architecture (Machine Structures) by Randy Katz and Bernhard Boser hIp://inst.eecs.berkeley.edu/ cs61c/fa16/I also refer to the following courses and lecture notes when preparing materials for this course–––––Computer Science 152: Computer Architecture and Engineering, Spring 2016 by Dr. George Michelogiannakis fromUC Berkeley hIp://www-inst.eecs.berkeley.edu/ cs152/sp16/Computer Science 252: Graduate Computer Architecture, Fall 2015 by Prof. Krste Asanović from UC Berkeley hIp://www-inst.eecs.berkeley.edu/ cs252/fa15/Computer Science S 250: VLSI Systems Design, Spring 2016 by Prof. John Wawrzynek from UC Berkeley hIp://www-inst.eecs.berkeley.edu/ cs250/sp16/Computer System Architecture, Fall 2005 by Dr. Joel Emer and Prof. Arvind from MIT turefall-2005/Synthesis Lectures on Computer Architecture hIp://www.morganclaypool.com/toc/cac/1/1 The uses of the slides of this course are for educa/onal purposes only and should beused only in conjunc/on with the textbook. Deriva/ves of the slides mustacknowledge the copyright no/ces of this and the originals. Permission forcommercial purposes should be obtained from the original copyright holder and thesuccessive copyright holders including myself.2

Contents Computers and computer components Computer architectures and great ideas in history and now Performance3

The Computer Revolu/on Progress in computer technology– Underpinned by Moore’s Law Makes novel applicacons feasible–––––Computers in automobilesCell phonesHuman genome projectWorld Wide WebSearch Engines Computers are pervasive4

Classes of Computers Personal Mobile Device (PMD)– e.g. smartphones, tablet computers– Emphasis on energy efficiency and real-cme Desktop Compucng– Emphasis on price-performance Servers– Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers– Used for “Sogware as a Service (SaaS)”– Emphasis on availability and price-performance– Sub-class: Supercomputers, emphasis: floacng-point performance andfast internal networks Embedded Computers– Emphasis: price5

The PostPC Era6

The PostPC Era Personal Mobile Device (PMD)––––BaIery operatedConnects to the InternetHundreds of dollarsSmart phones, tablets, electronic glasses Cloud compucng– Warehouse Scale Computers (WSC)– Sogware as a Service (SaaS)– Porcon of sogware run on a PMD and a porcon run in theCloud– Amazon and Google7

Old School Computer8

New School Computer (#1)PersonalMobileDevices99

New School “Computer” (#2)1010

Components of a ComputerThe BIG Picture Same components forall kinds of computer– Desktop, server,embedded Input/output includes– User-interface devices Display, keyboard, mouse– Storage devices Hard disk, CD/DVD, flash– Network adapters For communicacng with othercomputers

Inside the Processor (CPU) Funcconal units: performs computaconsDatapath: performs operacons on dataControl: sequences datapath, memory, .Cache memory– Small fast SRAM memory for immediate access to dataApple A512

A Safe Place for Data Volacle main memory– Loses instruccons and data when power off Non-volacle secondary memory– Magnecc disk– Flash memory– Opccal disk (CDROM, DVD)

Contents Computers and computer components Computer architectures and great ideas in history andnow Performance14

What is “Computer nstr. Set Proc.FirmwareI/O systemInstruccon SetArchitectureDatapath & ControlDigital DesignCircuit DesignLayout & fabSemiconductor Materials15

The Instruc/on Set: a Cri/cal Interfacesogwareinstruccon sethardware Properces of a good abstraccon––––Lasts through many generacons (portability)Used in many different ways (generality)Provides convenient funcconality to higher levelsPermits an efficient implementacon at lower levels16

Elements of an ISA Set of machine-recognized data types– bytes, words, integers, floacng point, strings, . . . Operacons performed on those data types– Add, sub, mul, div, xor, move, . Programmable storage– regs, PC, memory Methods of idencfying and obtaining data referenced byinstruccons (addressing modes)– Literal, reg., absolute, relacve, reg offset, Format (encoding) of the instruccons– Op code, operand fields, 17

Computer ArchitectureHow things are put together in design and implementa/on Capabilices & Performance Characterisccs of PrincipalFuncconal Units– (e.g., Registers, ALU, Shigers, Logic Units, .) Ways in which these components are interconnected Informacon flows between components Logic and means by which such informacon flow iscontrolled. Choreography of FUs to realize the ISA18

Great Ideas in Computer Architectures1. Design for Moore’s Law2. Use abstraction to simplify design3. Make the common case fast4. Performance via parallelism5. Performance via pipelining6. Performance via prediction7. Hierarchy of memories8. Dependability via redundancy19

Great Idea: “Moore’s Law”Gordon Moore, Founder of Intel 1965: since the integrated circuit was invented, the number oftransistors/inch2 in these circuits roughly doubled every year;this trend would concnue for the foreseeable future 1975: revised - circuit complexity doubles every two yearsImage credit: Intel20

Microprocessor Transistor Counts 1971-2011 &Moore's Lawhcps://en.wikipedia.org/wiki/Transistor count21

Moore’s Law trends More transistors opportunices for exploicng parallelism in theinstruccon level (ILP)– Pipeline, superscalar, VLIW (Very Long Instruccon Word), SIMD (SingleInstruccon Mulcple Data) or vector, speculacon, branch prediccon General path of scaling– Wider instruccon issue, longer piepline– More speculacon– More and larger registers and cache Increasing circuit density increasing frequency increasingperformance Transparent to users– An easy job of gevng beIer performance: buying faster processors (higherfrequency) We have enjoyed this free lunch for several decades, however (TBD) 22

Great Idea: PipelineFundamental Execu/on CycleInstruc'onFetchInstruc'onDecodeObtain instruccon fromprogram storageDetermine requiredaccons and instrucconsizeOperandFetchLocate and obtainoperand dataExecuteCompute result value orstatusResultStoreNextInstruc'onDeposit results in storagefor later useMemoryProcessorprogramregsF.U.sDatavon NeumanboIleneckDetermine successorinstruccon23

Pipelined Instruc/on Execu/onTime (clock emALURegALUOrderIfetchALUInstr.ALUCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7RegDMemReg24

Great Idea: Abstrac/on(Levels of Representa/on/Interpreta/on)High Level LanguageProgram (e.g., C)CompilerAssembly LanguageProgram (e.g., MIPS)AssemblerMachine LanguageProgram (MIPS)temp v[k];v[k] v[k 1];v[k 1] temp;lwlwswsw0000101011000101 t0, 0( 2) t1, 4( 2) t1, 0( 2) t0, 4( 2)10011111011010001100010110100000Anything can be representedas a number,i.e., data or 11000011001011100000010101000011010011111 !MachineInterpreta4onHardware Architecture Descrip/on(e.g., block diagrams)ArchitectureImplementa4onLogic Circuit Descrip/on(Circuit Schema/c Diagrams)25

The Memory Abstrac/on Associacon of name, value pairs– typically named as byte addresses– ogen values aligned on mulcples of size Sequence of Reads and Writes Write binds a value to an address Read of addr returns most recently wriIen value bound tothat addresscommand (R/W)address (name)data (W)data (R)done26

Processor-DRAM Memory Gap r)“Joy’s Law”Processor-MemoryPerformance Gap:(grows 50% / !1997!1998!1999!2000!1!TimeDRAM9%/yr.(2X/10 yrs)27

The Principle of Locality The Principle of Locality:– Program access a relacvely small porcon of the address spaceat any instant of cme. Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, itwill tend to be referenced again soon (e.g., loops, reuse)– Spacal Locality (Locality in Space): If an item is referenced,items whose addresses are close by tend to be referencedsoon(e.g., straightline code, array access) Last 30 years, HW relied on locality for speedP MEM28

Great idea: Memory HierarchyLevels of the Memory HierarchyUpper LevelCapacityAccess TimeCostStagingXfer UnitCPU Registers100s Bytes 1s nsRegistersCache10s-100s K Bytes 1 ns 1s/ MByteCacheMain MemoryM Bytes100ns- 300ns 1/ MByteDisk10s G Bytes, 10 ms(10,000,000 ns) 0.001/ MByteTapeinfinitesec-min 0.0014/ MByteInstr. OperandsBlocksfasterprog./compiler1-8 bytescache cntl8-128 bytesMemoryPagesOS512-4K bytesFilesuser/operatorMbytesDiskTapeLargerLower Level29

Jim Gray’s Storage Latency Analogy:How Far Away is the Data?Andromeda10910100Tape /OpticalRobot6 DiskMain Memory10 On Board Cache2 On Chip Cache1 Registers(ns)2,000 YearsPlutoLansingThis CampusThis RoomMy Head2 YearsJim Gray1.5 hr Turing Award10 min1 minB.S. Cal 1966Ph.D. Cal 1969!30

The Cache Design SpaceCache Size Several interaccng dimensions–––––cache sizeblock sizeassociacvityreplacement policywrite-through vs write-backAssociativityBlock Size The opcmal choice is a compromise– depends on access characterisccs workload use (I-cache, D-cache, TLB)– depends on technology / cost Simplicity ogen winsBadGoodFactor ALessFactor BMore31

Great Idea: Parallelism32

Defining Computer Architecture “Old” view of computer architecture:– Instruccon Set Architecture (ISA) design– i.e. decisions regarding: registers, memory addressing, addressing modes, instruccon operands,available operacons, control flow instruccons, instruccon encoding “Real” computer architecture:– Specific requirements of the target machine– Design to maximize performance within constraints: cost,power, and availability– Includes ISA, microarchitecture, hardware33

Computer Architecture TopicsInput/Output and StorageDisks, WORM, TapeMemoryHierarchyVLSIL2/L3 CacheL1 CacheInstruction Set ArchitecturePipelining, Hazard Resolution,Superscalar, Reordering,Prediction, Speculation,Vector, Dynamic CompilationEmerging TechnologiesInterleavingBus cationAddressing,Protection,Exception HandlingOther ProcessorsDRAMRAIDPipelining and InstructionLevel Parallelism34

Why is Architecture Exci/ng Today?CPU Speed FlatCPeedpSocklCUeary/ 15%35

Problems of tradi/onal ILP scaling Fundamental circuit limitacons1– delays as issue queues and mulc-port register files – increasing delays limit performance returns from wider issue Limited amount of instruccon-level parallelism1– inefficient for codes with difficult-to-predict branches Power and heat stall clock frequencies[1] The case for a single-chip mulcprocessor, K. Olukotun, B. Nayfeh, L.Hammond, K. Wilson, and K. Chang, ASPLOS-VII, 1996.36

ILP impacts37

Simula/ons of 8-issue Superscalar38

Power/heat density limits frequency Some fundamental physical limits are being reached39

We will have this 40

Revolu/on is happening now Chip density isconcnuing increase 2xevery 2 years– Clock speed is not– Number of processorcores may doubleinstead There is liIle or nohidden parallelism (ILP)to be found Parallelism must beexposed to andmanaged by sogware– No free lunchSource: Intel, Microsog (SuIer) andStanford (Olukotun, Hammond)41

Single Processor PerformanceMove to multi-processorRISC42

The trendsSuper Scalar/Vector/Parallel1 PFlop/s(1015)IBMBG/LParallelASCI WhitePacificASCI Red1 TFlop/s(1012)TMC CM-52X Transistors/ChipEvery 1.5 Years1 GFlop/sVectorTMC CM-2Cray 2Cray X-MPSuper Scalar(109)Cray T3DCray 1CDC 76001 MFlop/s Scalar(106)IBM 360/195CDC 6600IBM 70901 KFlop/s(103)UNIVAC 1EDSAC 987199219931997200020051 (Floating Point operations / second, Flop/s)1001,000 (1 KiloFlop/s, KFlop/s)10,000100,0001,000,000 (1 MegaFlop/s, MFlop/s)10,000,000100,000,0001,000,000,000 (1 GigaFlop/s, ,000 (1 TeraFlop/s, TFlop/s)10,000,000,000,000131,000,000,000,000 (131 Tflop/s)19902000432010

Recent mul/core processors44

An Overview of the GK110 Kepler ArchitectureRecent manycore GPU processorsKepler GK110 was built first and foremost for Tesla, and its goal was to be the highest performingparallel computing microprocessor in the world. GK110 not only greatly exceeds the raw computehorsepower delivered by Fermi, but it does so efficiently, consuming significantly less power andgenerating much less heat output.A full Kepler GK110 implementation includes 15 SMX units and six 64 bit memory controllers. Differentproducts will use different configurations of GK110. For example, some products may deploy 13 or 14SMXs.Kepler Memory Subsystem / L1, L2, ECCKey features of the architecture that will be discussed below in more depth include: 3k coresThe new SMX processor architectureAn enhanced memory subsystem, offering additional caching capabilities, more bandwidth ateach level of the hierarchy, and a fully redesigned and substantially faster DRAM I/Oimplementation.Hardware support throughout the design to enable new programming model capabilitiesKepler GK110 Full chip block diagramStreaming Multiprocessor (SMX) ArchitectureKepler GK110)s new SMX introduces several architectural innovations that make it not only the mostpowerful multiprocessor we)ve built, but also the most programmable and power efficient.Kepler&s memory hierarchy is organized similarly to Fermi. The Kepler architecturememory request path for loads and stores, with an L1 cache per SMX multiprocessenables compiler directed use of an additional new cache for read only data, as deSMX: 192 single precision CUDA cores, 64 double precision units, 32 special function units (SFU), and 32 load/store units(LD/ST).64 KB Configurable Shared Memory and L1 CacheIn the Kepler GK110 architecture, as in the previous generation Fermi architectureof on chip memory that can be configured as 48 KB of Shared memory with 16 KBKB of shared memory with 48 KB of L1 cache. Kepler now allows for additional flexthe allocation of shared memory and L1 cache by permitting a 32KB / 32KB split bememory and L1 cache. To support the increased throughput of each SMX unit, thebandwidth for 64b and larger load operations is also doubled compared to the Fercore clock.48KB Read Only Data CacheIn addition to the L1 cache, Kepler introduces a 48KB cache for data that is knownthe duration of the function. In the Fermi generation, this cache was accessible onExpert programmers often found it advantageous to load data through this path etheir data as textures, but this approach had many limitations.45

Current Trends in Architecture Cannot concnue to leverage Instruccon-Levelparallelism (ILP)– Single processor performance improvement ended in 2003 New models for performance:– Data-level parallelism (DLP)– Thread-level parallelism (TLP)– Heterogeneity These require explicit restructuring of the applicacon46

Parallelism Classes of parallelism in applicacons:– Data-Level Parallelism (DLP)– Task-Level Parallelism (TLP) Classes of architectural parallelism:––––Instruccon-Level Parallelism (ILP)Vector architectures/Graphic Processor Units (GPUs)Thread-Level ParallelismHeterogeneity47

CE 5950Activity 1Architectural ChallengesCase Study: Scalar vs. Vector ProcessorsActivity 2Course Motivation: Research s"Era"Enabled by: Enabled by: Enabled by: Voltage Scaling MicroArchitecture o"we#are#here#? Throughput Performance Constrained by: Power Parallel SW availability Scalability Constrained by: Power Complexity Single1thread Performance Desire for Throughput 20 years of SMP arch o"we#are#here# Abundant data parallelism Power efficient GPUs Currently)constrained by: Programming models Communication overheads Targeted Application Performance Complex Digital ASIC Design o"we#are#here#Source: Chuck Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011Time Time (##of#Processors)#Time (Data1parallel#exploitation)# Massive (ca. 4X) increase in concurrencyCourse Overview18 / 354"Data Processing in Exascale1class Computing Systems April 27, 2011 CRM – Mulccore (4 - 100) à Manycores (100s – 1ks) Heterogeneity– System-level (accelerators) vs chip level (embedded) Compute power and memory speed challenges (two walls)– 500x compute power and 30x memory of 2PF HW– Memory access 'me lags further behind48

Exercise: Inspect ISA for sum cp yan/sum.c (copy sum.c file from my home folder to yourhome folder) gcc -save-temps sum.c –o sum ./sum 102400 vi sum.c vi sum.s Or check from:– hIps://passlab.github.io/CSE564/exercises/sum/ View them from H drive Other system commands:– cat /proc/cpuinfo to show the CPU and #cores– top command to show system usage and memory49

Backup50

New-School Machine StructuresSoLware Parallel RequestsAssigned to computere.g., Search “cats” Parallel ThreadsAssigned to coree.g., Lookup, AdsHardwareHarnessParallelism &Achieve HighPerformance Parallel Instruccons 1 instruccon @ one cmee.g., 5 pipelined instruccons Parallel Data 1 data item @ one cmee.g., Add of 4 pairs of words Hardware descripconsAll gates funcconing in parallelat same cmeSmartPhoneWarehouse-ScaleComputerComputer CoreMemoryCore(Cache)Input/OutputInstruccon Unit(s)CoreFuncconalUnit(s)A0 B0 A1 B1 A2 B2 A3 B3Main MemoryLogic Gates5151

Coping with Failures 4 disks/server, 50,000 servers Failure rate of disks: 2% to 10% / year– Assume 4% annual failure rate On average, how ogen does a disk fail?a)b)c)d)1 / month1 / week1 / day1 / hour52

Coping with Failures 4 disks/server, 50,000 servers Failure rate of disks: 2% to 10% / year– Assume 4% annual failure rate On average, how ogen does a disk fail?a)b)c)d)1 / month1 / week1 / day1 / hour50,000 x 4 200,000 disks200,000 x 4% 8000 disks fail365 days x 24 hours 8760 hours53

Great Idea:Dependability via Redundancy Redundancy so that a failing piece doesn’t make the wholesystem fail1 1 21 1 22 of 3 agree1 1 21 1 1FAIL!Increasing transistor density reduces the cost of redundancy54

Great Idea:Dependability via Redundancy Applies to everything from datacenters to storage tomemory to instructors– Redundant datacenters so that can lose 1 datacenter butInternet service stays online– Redundant disks so that can lose 1 disk but not lose data(Redundant Arrays of Independent Disks/RAID)– Redundant memory bits of so that can lose 1 bit but no data(Error Correccng Code/ECC Memory)55

Understanding Computer Architecturede.pinterest.com56

End of Moore’s Law?Cost per transistoris rising as transistorsize con/nues toshrink57

Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case fast 4. Performance via parallelism 5. Performance via pipelining 6. Performance via prediction 7.

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

GitHub Tutorial for Shared LaTeX Projects Figure 6: Initial history of repository with GitHub for Mac the panel in GitHub for Mac will show the repository now under Cloned Repositories as seen in Figure 5 Next click the arrow pointing right in the repository panel to open the history of the repository.

Introduction to Quantum Field Theory for Mathematicians Lecture notes for Math 273, Stanford, Fall 2018 Sourav Chatterjee (Based on a forthcoming textbook by Michel Talagrand) Contents Lecture 1. Introduction 1 Lecture 2. The postulates of quantum mechanics 5 Lecture 3. Position and momentum operators 9 Lecture 4. Time evolution 13 Lecture 5. Many particle states 19 Lecture 6. Bosonic Fock .

Lecture 11 – Eigenvectors and diagonalization Lecture 12 – Jordan canonical form Lecture 13 – Linear dynamical systems with inputs and outputs Lecture 14 – Example: Aircraft dynamics Lecture 15 – Symmetric matrices, quadratic forms, matrix norm, and SVD Lecture 16 – SVD applications