Micronetwork-based Processor Microarchitectures

11m ago
2.71 MB
24 Pages
Last View : 9d ago
Last Download : 3m ago
Upload by : Abram Andresen

Micronetwork-based ProcessorMicroarchitecturesSteve KecklerDepartment of Computer SciencesThe University of Texas at AustinTRIPS10/7/061

Motivations Limitations of monolithic processors and memories Goal: scalable processor and memories Recast as distributed systemsTiles connected via a collection of networksMicronet microarchitectural network TRIPS10/7/06Design complexity scalabilityAbility for more resources to work togetherApproach Wires, design complexity, port limits“Just” a network tightly integrated into processor/memory2

TRIPS Tiled and Networked ProcessorDRAMcontroller16 64KB Banks(L2 cache) Processor 0DMASOC-like design styleIndividuallydesigned tiles 3-8 mm2 each 170M transistors Networks 18mmProcessor 1 MemoryOperandsControlNetworks enableDistributed andscalable design Fast design cycle Configurability DMATRIPS10/7/06DRAMcontrollerChip-to-chip Link3

Outline TRIPSarchitecture overview Integrated Replaces bypass and processor/cache busReflection on design Memory TRIPS10/7/06processor network (OPN)network (OCN)NUCA cacheSystem interconnectExtendable to multiple chips (C2C)4

TRIPS Prototype Chip 2 TRIPS Processors 16 FPUs each Explicit Data GraphExecution (EDGE) 442D mesh networkReplaces on-chip bus2 DDR SDRAM controllers2 DMA controllersExternal bus controllerC2C network controllerFabricated in 130nm ASICGPIOJTAGCLKTESTPLLS16EBCPROC 0NUCAL2CachePROC 1DMASDCC2C108DDRSDRAMTRIPS10/7/06IRQ1 MB, 16 banksControllers SDCEBIOCNOn-Chip Network (OCN) DMANUCA L2 Cache DDRSDRAM1088x39C2CLinks5

TRIPS Tile-level MicroarchitectureTRIPS TilesG: Processor control - TLB w/ variable size pages, dispatch,next block predict, commitR: Register file - 32 registers x 4 threads, register forwardingI: Instruction cache - 16KB storage per tileD: Data cache - 8KB per tile, 256-entry load/store queue, TLBE: Execution unit - Int/FP ALUs, 64 reservation stationsM: Memory - 64KB, configurable as L2 cache or scratchpadN: OCN network interface - router, translation tablesDMA: Direct memory access controllerSDC: DDR SDRAM controllerEBC: External bus controller - interface to external PowerPCC2C: Chip-to-chip network controller - 4 links to XY neighborsTRIPS10/7/066

TRIPS Execution ModelProgram CFGArchitecture of a BlockDistributed ExecutionRegistersBlock Ar2r3r5PCBlock BMemoryi1i2i6i3i5i5r3i1i2i4r2i3i4i6PCr5Basic blockRegistersCoarse-grained program sequencing using blocksDataflow execution within one block, instructions encode communicationSpatial distribution exposed to the compilerTRIPS10/7/067

TRIPS Processor Tiles and Networks Control Networks Operand network Bypass network among ALUsRegister file inputsLoad/store accessMemory network (OCN) TRIPS10/7/06Instruction fetch/dispatch(GDN)Completion/commit/flushnetwork (GCN)I/D cache misses to L2/memoryRead/write to remote memoryIGRRRRIDEEEEIDEEEEIDEEEEIDEEEEOperand Network LinksFetch Network LinksMemory Network LinksControl Network Links8

TRIPS Operand Network (OPN)Processor 0 Topology Routing Y-X dimension order4 entry input FIFOsDestination frominstruction targetsFlow control 5x5 mesh network1 cycle per hop140 bit channels1 physical channel (no VCs)On-off link controlDeadlock free as storageat target is pre-allocatedLightweight and tightlycoupled to processor core Takes place of bypass busBisection BW 80GB/sec at500MHzProcessor 1TRIPS10/7/069

Obligatory Router Diagram5x55x5TRIPS10/7/0610

Processor Architecture Influences NW Latency critical to performance (1 cycle per hop) Deadlock avoidance TRIPS10/7/06Control header leads data payload by 1 cycle110 bit payload (64-bit datum plus 40-bit address)But - separate control/data wiresSpeculative header injection Avoid bottlenecks from RTs and to DTs2-flit messages (sort of) Easy because destination buffers pre-allocatedY-X routing Simple routers, no VCsCan be canceled by null data flitNetwork selectively flushed when block flushed11

2 parallel networks Control (30 bits) forrouting and wakeupData (110 bits) TRIPS10/7/06Includes 40 bit addressand 64 bit operand forstoreBypassed directly intoALU at targetFor early wakeup attargetMay require cancel onnext cycleControl/data interleavedacross operand messagesBlock flush includesflushing block’s state inrouterADDRouter output port30 bitsSpeculative injection ofcontrol packet ET5ADDwakeup110 bitsET9DataControlRouterinput portSelect SelectOPN Integration to Processor CoreSUBReservationStationSUB12

Design Experience Area - remember ASIC standard cell design 1 OPN router 0.25mm2 in 130nm A little larger than a 64-bit integer multiplier25 OPN routers 14% of processor areaArea breakdown of OPN routerFIFOs75%Crossbar20%Arbitration/routing logic TRIPS10/7/065%Static timing estimates - nominal cornerFIFO read386psArbitration253psCrossbar187psFIFO muxing473psLatch setup, clock uncertainty367psTotal1.7ns13

Performance Observations Average Compiler controls instruction placement Average number of hops in network 2network latency 2But can have high varianceSmall number of critical messages can degradeperformance Loadvaries across network node and acrossapplications Depends on concurrency profile StandardTRIPS10/7/06NW loads are not representative14

Highly Non-uniform Injection Uniform in RT/DTdue to interleaving orregisters and datacacheHigh injection ratesin ETs near registersand data TRIPS10/7/06Injection ratereflects instructionplacement15

Network Protocol OverheadsColumns show percentage of critical path (methodology adapted from Fields, et %74.0% mcfTRIPS10/7/06and twolf compiled but not hand-optimized16

Operand Network Enhancements Operand multicast Instructions have limited number of targets (2, 3, or 4)Network injects one copy per cycleTree of instructions required for high-fanout operandsOptimization we are studying Bulk operand movement (i.e. L/S multiple) Current architecture transmits one operand per message TRIPS10/7/06Streaming data into arithmetic array is difficultOptimizations we are studying Instruction specifies bit-mask of targetsOperand network replicates copiesSingle load request fetches multiple operands into successivereservation stationsSaves headers and streamlines return of dataReplicating network to provide more link BW17

TRIPS Memory Network (OCN)DRAMcontroller16 64KB Banks(L2 cache) Topology RoutingDMArouter embeddedin memory tilenetwork interface(router route table)TRIPS10/7/06Chip-to-chip Linkmesh network 1 cycle per hop 128-bit x 2 links Y-Xdimension order 2 entry input FIFOs Destination memoryaddress Flow control Replaces memory busDMADRAMcontroller 4x10 1-5128-bit flits/msg 4 VCs for 4 priorities Wormhole routed Credit-based flowcontrol Pipelined credit return BisectionBW 64GB/sec at 500MHz18

Non-Uniform L2 Cache (NUCA)Y0X123 N-tileProcessor 02Exploit physical locality incached data301 46789 Processor 15RequestReply M-tile or SDC if on this chipC2C controller if on another chipInjects ld/st request on VC01-byte up to full cache lineM-tile performs lookup andreturns response on VC364KB per M-tileHop count depends ondestination TRIPS10/7/06Resolves address to coordinateStatic NUCATotal Unloaded latency 7-22cycles19

Network Based Memory ConfigurationN-tile mechanismsY0X123 0Interleaved across 16 tiles2. Interleaved across 8 tiles (splitcache)1.3Proc 0124 89TRIPS10/7/06 Proc 1716-entry translation table 56Split mode to adjust cacheline address interleaving Indexed w/ 4 bits of PAProduces X/Y coordinate of MTConvert cache banks toscratchpadRemap address range from oneMT to another Create new TLB entry to mapnew physical region into VA space 20

OCN Design Observations Bandwidth and Latency Peak injection BW: 74GB/sec, but load ismuch less Unloaded hit latency: 7-22 cyclesArea FIFO buffers: 75% of router area OCN routers/wires: 32% of L2 area, 10% ofdie area Opportunity to economize designTiming Control was the critical path for the router Timing path: 1.5ns (nominal case) TRIPS10/7/06400ps: VC arbitration427ps: crossbar arbitration393ps: FIFO control247ps: latch setup, skew21

Chip-to-Chip NetworkBoard 0Board 1Board 2Board 3Board 4Board 5Board 6Board 70 2P1 34 6P5 78 10P9 1112 1413P 1516 1817P 1920 2221P 2324 2625P 2728 3029P 3132-bit x 2 linkEthernet SwitchHOST PCTRIPS10/7/06 On-chip 4-port router for C2C mesh network 32-bit x 2 links at 1/2 core clock speed Protocol is direct extension of OCN Global memory addressing identifies target22

Summary Fast dynamic networks enable: Design experience Networks were easy to build and verifyLarger than expected, but optimization possibleFuture challenges TRIPS10/7/06Distributed processor and memory architecturesConfigurabilityBetter traffic management w/out increasing latencyDrive router power down to beat other network topologiesHow many different NWs and types of NWs are needed TRIPS has 3 routed data networksMultiple control networks Better workloads for network analysisDoes it make sense to design for worst case?Network interface primitives to the programmer23

Acknowledgements Co-PIs: Doug Burger and Kathryn McKinley TRIPS Hardware Team TRIPS Software Team IBM Microelectronics Austin ASIC Group TRIPS SponsorsTRIPS10/7/06Raj Desikan, Saurabh Drolia, Madhu Sibi Govindan, DivyaGulati, Paul Gratz, Heather Hanson, Changkyu Kim, HaimingLiu, Ramdas Nagarajan, Nitya Ranganathan, KaruSankaralingam, Simha Sethumadhavan, PremkishoreShivakumarKathryn McKinley, Jim Burrill, Katie Coons, Mark Gebhart,Sundeep Kushwaha, Bert Maher, Nick Nethercote, SadiaSharif, Aaron Smith, Bill YoderDARPA Polymorphous Computing ArchitecturesAir Force Research LaboratoriesNational Science FoundationIBM, Intel, Sun Microsystems24

Instructions have limited number of targets (2, 3, or 4) . Wormhole routed . Ethernet Switch Board 0 Board 1 Board 2 Board 3 Board 4 Board 5 Board 6 Board 7 HOST PC

Related Documents:

Alfa Romeo 145 old Processor new Processor 2004 146 old Processor By new Processor DIGA-Soft.de 147 Eeprom 147 NEC-Processor 156 before 2002 Cluster-Plug since 2002 Cluster-Plug 159 Eeprom 166 Processor Model 2002 Eeprom Spider Processor GT Eeprom GTV Processor All JTD (Diesel)

- The annoying post office dispatch of the equipment is void. . Alfa Romeo 145 old Processor new Processor . 147 NEC-Processor 156 before 2002 Cluster-Plug since 2002 Cluster-Plug 159 Eeprom 166 Processor Model 2002 Eeprom Spider Processor GT Eeprom GTV Processor All JTD (Diesel) Motor-Control Unit .

3050 SFF Intel i 5-7 00. Puertos y ranuras: factor de forma pequeño 1. Botón de encendido 2. . Small Form Factor Height: 289.6 mm Weight (Approximate): 5.14 kg Width: 94 mm Processor & Chipset Processor Generation: 7th Gen Processor Manufacturer: Intel Processor Model: i5-7500 . Processor Speed: 3.40 GHz Processor Type: Core i5 Software .

Intel Core Duo Processor for Intel Centrino Duo Processor Technology Based on Mobile Intel 945 Express Chipset Family Datasheet Intel Core Duo Processor and Intel Core Solo Processor on ñ nm Process Datasheet Intel Pentium Dual-Core Mobile Processor Datasheet Intel

850W Food Processor Your 850W Food Processor p8 Assembly and Operation p10 of Your 850W Food Processor Maintaining and Cleaning p13 Your Kambrook 850W Food Processor Blade Operating Guide p14

on Mobile Intel 945 Express Chipset Family Datasheet Intel Core Duo Processor and Intel Core Solo Processor on 65 nm Process Datasheet Intel Pentium Dual-Core Mobile Processor Datasheet Intel Pentium M Processor with 2-MB L2 Cache and 533-MHz Front Side Bus Datasheet Intel

Used Berkeley RISC (vs. Stanford), flat memory model, superscalar –Dropped after acquiring StrongARM in late 90’s Price/perf/power no longer competitive Team went to design another i386 processor –P6 2013/02/10 Br

matching items in different but time-correlated streams. Our discussion is based on the Cell processor [19] a state-of-the-art heterogeneous multi-core processor. Al-though the Cell processor was initially intended for game consoles and multimedia rich consumer devices, the major advances it brought in terms of performance have resulted in

Cortex A15 processor running up to 1.512 GHz clock fre-quency and an Adreno 320 graphics processor and 2 GB of RAM memory. The OnePlus One is powered by a Qualcomm Snapdragon 801 (MSM8974AC) system-on-a-chip that fea-tures a quad-core ARM-based Krait 400 processor running up to 2.5GHz clock frequency, an Adreno 330 graphics processor

Intel Celeron Processor up to 766 MHz Datasheet The Intel Celeron processor is designed for uni-processor based Value PC desktops and is binary compatible with previous generation Intel architecture processors. The Intel Celeron processor provides good performan

Product Name HP Pavilion dv4 Entertainment PC Processors Intel Core Duo with 1066-MHz front side bus (FSB) T9600 2.8-GHz processor with 6-MB L2 cache T9550 2.66-GHz processor with 6-MB L2 cache T9400 2.53-GHz processor with 6-MB L2 cache P8700 2.53-GHz processor with 3-MB L2 cache

Product Name Compaq Presario A900 Notebook PC Processors Intel Core 2 Duo processors: T7250 2.00-GHz processor, 800-MHz FSB, 2-MB L2 cache T5550 1.80-GHz processor, 667-MHz FSB, 2-MB L2 cache T5450 1.67-GHz processor, 667-MHz FSB, 2-MB L2 cache T5250 1.50-GHz processor, 667-MHz FSB, 2-MB L2 cache Intel Core Duo .

System Block Diagram and Mechanization and Mechanization Data Acquisition System FCW Processor (Data Fusion & Threat Assessment) Vision System Haptic Actuator CAN Bus Speaker Sensor & I/O Processor Driver-Vehicle Interface Unit Scene Tracking Processor Target Path-Estimation & Selection Processor ACC/Radar Subsystem ACC Controller Radar Camera

Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip ByAndreasOlofsson AdaptevaInc,Lexington,MA,USA andreas@adapteva.com Abstract This paper describes the design of a 1024-core processor chip

Table 2 provides a comparison between the SPARC T5, SPARC T4, SPARC T3 processors. The SPARC T5 leverages many of the elements from the SPARC T4 processor. TABLE 2. SPARC T5, SPARC T4, AND SPARC T3 PROCESSOR FEATURE COMPARISON FEATURE SPARC T5 PROCESSOR SPARC T4 PROCESSOR SPARC T3 PROCESS

Introduction to computers 8 James Tam Processor The brains of a computer A common desktop processor www.howstuffworks.com James Tam Processor Speed Determined by: - Type of processor e.g., Pentium IV, AMD: Athlon, Opteron - Clock speed 1 Hz 1 pulse is

Processor Datapath Control Components of the processor that Component of the processor that perform arithmetic operations and holds commands the datapath, memory, data I/O devices according

Mac OS v10.4.11 (Tiger) Mac OS v10.5 (Leopard) Processor Power PC G3, G4 or G5 Processor (700 MHz or faster) processor Intel or Power PC G5 or G4 (867 MHz or faster) processor RAM 256 MB 512 MB Hard Drive Space 60 MB 60 MB Compatible OS System Updates 10.4.11 10.5.

For your protection, the Quad Blade Food Processor/Blender has a double safety switch so that it cannot run unless the food processor lid is locked in place and either the safety cover or the blender is also locked in place. 2 PLEASE NOTE A. To operate the food processor

During the American Revolution both the American Continental Army and the British Army had spies to keep track of their enemy. You have been hired by the British to recruit a spy in the colonies. You must choose your spy from one of the colonists you have identified. When making your decisions use the following criteria: 1. The Spy cannot be someone who the Patriots mistrust. The spy should be .