Micronetwork-based Processor Microarchitectures

2y ago
69 Views
2 Downloads
2.71 MB
24 Pages
Last View : 3d ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

Micronetwork-based ProcessorMicroarchitecturesSteve KecklerDepartment of Computer SciencesThe University of Texas at AustinTRIPS10/7/061

Motivations Limitations of monolithic processors and memories Goal: scalable processor and memories Recast as distributed systemsTiles connected via a collection of networksMicronet microarchitectural network TRIPS10/7/06Design complexity scalabilityAbility for more resources to work togetherApproach Wires, design complexity, port limits“Just” a network tightly integrated into processor/memory2

TRIPS Tiled and Networked ProcessorDRAMcontroller16 64KB Banks(L2 cache) Processor 0DMASOC-like design styleIndividuallydesigned tiles 3-8 mm2 each 170M transistors Networks 18mmProcessor 1 MemoryOperandsControlNetworks enableDistributed andscalable design Fast design cycle Configurability DMATRIPS10/7/06DRAMcontrollerChip-to-chip Link3

Outline TRIPSarchitecture overview Integrated Replaces bypass and processor/cache busReflection on design Memory TRIPS10/7/06processor network (OPN)network (OCN)NUCA cacheSystem interconnectExtendable to multiple chips (C2C)4

TRIPS Prototype Chip 2 TRIPS Processors 16 FPUs each Explicit Data GraphExecution (EDGE) 442D mesh networkReplaces on-chip bus2 DDR SDRAM controllers2 DMA controllersExternal bus controllerC2C network controllerFabricated in 130nm ASICGPIOJTAGCLKTESTPLLS16EBCPROC 0NUCAL2CachePROC 1DMASDCC2C108DDRSDRAMTRIPS10/7/06IRQ1 MB, 16 banksControllers SDCEBIOCNOn-Chip Network (OCN) DMANUCA L2 Cache DDRSDRAM1088x39C2CLinks5

TRIPS Tile-level MicroarchitectureTRIPS TilesG: Processor control - TLB w/ variable size pages, dispatch,next block predict, commitR: Register file - 32 registers x 4 threads, register forwardingI: Instruction cache - 16KB storage per tileD: Data cache - 8KB per tile, 256-entry load/store queue, TLBE: Execution unit - Int/FP ALUs, 64 reservation stationsM: Memory - 64KB, configurable as L2 cache or scratchpadN: OCN network interface - router, translation tablesDMA: Direct memory access controllerSDC: DDR SDRAM controllerEBC: External bus controller - interface to external PowerPCC2C: Chip-to-chip network controller - 4 links to XY neighborsTRIPS10/7/066

TRIPS Execution ModelProgram CFGArchitecture of a BlockDistributed ExecutionRegistersBlock Ar2r3r5PCBlock BMemoryi1i2i6i3i5i5r3i1i2i4r2i3i4i6PCr5Basic blockRegistersCoarse-grained program sequencing using blocksDataflow execution within one block, instructions encode communicationSpatial distribution exposed to the compilerTRIPS10/7/067

TRIPS Processor Tiles and Networks Control Networks Operand network Bypass network among ALUsRegister file inputsLoad/store accessMemory network (OCN) TRIPS10/7/06Instruction fetch/dispatch(GDN)Completion/commit/flushnetwork (GCN)I/D cache misses to L2/memoryRead/write to remote memoryIGRRRRIDEEEEIDEEEEIDEEEEIDEEEEOperand Network LinksFetch Network LinksMemory Network LinksControl Network Links8

TRIPS Operand Network (OPN)Processor 0 Topology Routing Y-X dimension order4 entry input FIFOsDestination frominstruction targetsFlow control 5x5 mesh network1 cycle per hop140 bit channels1 physical channel (no VCs)On-off link controlDeadlock free as storageat target is pre-allocatedLightweight and tightlycoupled to processor core Takes place of bypass busBisection BW 80GB/sec at500MHzProcessor 1TRIPS10/7/069

Obligatory Router Diagram5x55x5TRIPS10/7/0610

Processor Architecture Influences NW Latency critical to performance (1 cycle per hop) Deadlock avoidance TRIPS10/7/06Control header leads data payload by 1 cycle110 bit payload (64-bit datum plus 40-bit address)But - separate control/data wiresSpeculative header injection Avoid bottlenecks from RTs and to DTs2-flit messages (sort of) Easy because destination buffers pre-allocatedY-X routing Simple routers, no VCsCan be canceled by null data flitNetwork selectively flushed when block flushed11

2 parallel networks Control (30 bits) forrouting and wakeupData (110 bits) TRIPS10/7/06Includes 40 bit addressand 64 bit operand forstoreBypassed directly intoALU at targetFor early wakeup attargetMay require cancel onnext cycleControl/data interleavedacross operand messagesBlock flush includesflushing block’s state inrouterADDRouter output port30 bitsSpeculative injection ofcontrol packet ET5ADDwakeup110 bitsET9DataControlRouterinput portSelect SelectOPN Integration to Processor CoreSUBReservationStationSUB12

Design Experience Area - remember ASIC standard cell design 1 OPN router 0.25mm2 in 130nm A little larger than a 64-bit integer multiplier25 OPN routers 14% of processor areaArea breakdown of OPN routerFIFOs75%Crossbar20%Arbitration/routing logic TRIPS10/7/065%Static timing estimates - nominal cornerFIFO read386psArbitration253psCrossbar187psFIFO muxing473psLatch setup, clock uncertainty367psTotal1.7ns13

Performance Observations Average Compiler controls instruction placement Average number of hops in network 2network latency 2But can have high varianceSmall number of critical messages can degradeperformance Loadvaries across network node and acrossapplications Depends on concurrency profile StandardTRIPS10/7/06NW loads are not representative14

Highly Non-uniform Injection Uniform in RT/DTdue to interleaving orregisters and datacacheHigh injection ratesin ETs near registersand data TRIPS10/7/06Injection ratereflects instructionplacement15

Network Protocol OverheadsColumns show percentage of critical path (methodology adapted from Fields, et %74.0% mcfTRIPS10/7/06and twolf compiled but not hand-optimized16

Operand Network Enhancements Operand multicast Instructions have limited number of targets (2, 3, or 4)Network injects one copy per cycleTree of instructions required for high-fanout operandsOptimization we are studying Bulk operand movement (i.e. L/S multiple) Current architecture transmits one operand per message TRIPS10/7/06Streaming data into arithmetic array is difficultOptimizations we are studying Instruction specifies bit-mask of targetsOperand network replicates copiesSingle load request fetches multiple operands into successivereservation stationsSaves headers and streamlines return of dataReplicating network to provide more link BW17

TRIPS Memory Network (OCN)DRAMcontroller16 64KB Banks(L2 cache) Topology RoutingDMArouter embeddedin memory tilenetwork interface(router route table)TRIPS10/7/06Chip-to-chip Linkmesh network 1 cycle per hop 128-bit x 2 links Y-Xdimension order 2 entry input FIFOs Destination memoryaddress Flow control Replaces memory busDMADRAMcontroller 4x10 1-5128-bit flits/msg 4 VCs for 4 priorities Wormhole routed Credit-based flowcontrol Pipelined credit return BisectionBW 64GB/sec at 500MHz18

Non-Uniform L2 Cache (NUCA)Y0X123 N-tileProcessor 02Exploit physical locality incached data301 46789 Processor 15RequestReply M-tile or SDC if on this chipC2C controller if on another chipInjects ld/st request on VC01-byte up to full cache lineM-tile performs lookup andreturns response on VC364KB per M-tileHop count depends ondestination TRIPS10/7/06Resolves address to coordinateStatic NUCATotal Unloaded latency 7-22cycles19

Network Based Memory ConfigurationN-tile mechanismsY0X123 0Interleaved across 16 tiles2. Interleaved across 8 tiles (splitcache)1.3Proc 0124 89TRIPS10/7/06 Proc 1716-entry translation table 56Split mode to adjust cacheline address interleaving Indexed w/ 4 bits of PAProduces X/Y coordinate of MTConvert cache banks toscratchpadRemap address range from oneMT to another Create new TLB entry to mapnew physical region into VA space 20

OCN Design Observations Bandwidth and Latency Peak injection BW: 74GB/sec, but load ismuch less Unloaded hit latency: 7-22 cyclesArea FIFO buffers: 75% of router area OCN routers/wires: 32% of L2 area, 10% ofdie area Opportunity to economize designTiming Control was the critical path for the router Timing path: 1.5ns (nominal case) TRIPS10/7/06400ps: VC arbitration427ps: crossbar arbitration393ps: FIFO control247ps: latch setup, skew21

Chip-to-Chip NetworkBoard 0Board 1Board 2Board 3Board 4Board 5Board 6Board 70 2P1 34 6P5 78 10P9 1112 1413P 1516 1817P 1920 2221P 2324 2625P 2728 3029P 3132-bit x 2 linkEthernet SwitchHOST PCTRIPS10/7/06 On-chip 4-port router for C2C mesh network 32-bit x 2 links at 1/2 core clock speed Protocol is direct extension of OCN Global memory addressing identifies target22

Summary Fast dynamic networks enable: Design experience Networks were easy to build and verifyLarger than expected, but optimization possibleFuture challenges TRIPS10/7/06Distributed processor and memory architecturesConfigurabilityBetter traffic management w/out increasing latencyDrive router power down to beat other network topologiesHow many different NWs and types of NWs are needed TRIPS has 3 routed data networksMultiple control networks Better workloads for network analysisDoes it make sense to design for worst case?Network interface primitives to the programmer23

Acknowledgements Co-PIs: Doug Burger and Kathryn McKinley TRIPS Hardware Team TRIPS Software Team IBM Microelectronics Austin ASIC Group TRIPS SponsorsTRIPS10/7/06Raj Desikan, Saurabh Drolia, Madhu Sibi Govindan, DivyaGulati, Paul Gratz, Heather Hanson, Changkyu Kim, HaimingLiu, Ramdas Nagarajan, Nitya Ranganathan, KaruSankaralingam, Simha Sethumadhavan, PremkishoreShivakumarKathryn McKinley, Jim Burrill, Katie Coons, Mark Gebhart,Sundeep Kushwaha, Bert Maher, Nick Nethercote, SadiaSharif, Aaron Smith, Bill YoderDARPA Polymorphous Computing ArchitecturesAir Force Research LaboratoriesNational Science FoundationIBM, Intel, Sun Microsystems24

Instructions have limited number of targets (2, 3, or 4) . Wormhole routed . Ethernet Switch Board 0 Board 1 Board 2 Board 3 Board 4 Board 5 Board 6 Board 7 HOST PC

Related Documents:

Alfa Romeo 145 old Processor new Processor 2004 146 old Processor By new Processor DIGA-Soft.de 147 Eeprom 147 NEC-Processor 156 before 2002 Cluster-Plug since 2002 Cluster-Plug 159 Eeprom 166 Processor Model 2002 Eeprom Spider Processor GT Eeprom GTV Processor All JTD (Diesel)

- The annoying post office dispatch of the equipment is void. . Alfa Romeo 145 old Processor new Processor . 147 NEC-Processor 156 before 2002 Cluster-Plug since 2002 Cluster-Plug 159 Eeprom 166 Processor Model 2002 Eeprom Spider Processor GT Eeprom GTV Processor All JTD (Diesel) Motor-Control Unit .

3050 SFF Intel i 5-7 00. Puertos y ranuras: factor de forma pequeño 1. Botón de encendido 2. . Small Form Factor Height: 289.6 mm Weight (Approximate): 5.14 kg Width: 94 mm Processor & Chipset Processor Generation: 7th Gen Processor Manufacturer: Intel Processor Model: i5-7500 . Processor Speed: 3.40 GHz Processor Type: Core i5 Software .

processor appears as a single processor running a single C program. This is very different from some other parallel processing models where the programmer has to explicitly program multiple independent processor cores, or can only access the processor via function calls or some other indirect mechanism. The processor executes a single instruction

ThinkPad X1 Titanium Yoga Gen 1 PSREF Product Specifications Reference ThinkPad X1 Titanium Yoga Gen 1 - December 08 2022 1 of 8. PERFORMANCE Processor Processor Family 11th Generation Intel Core i5 / i7 Processor Processor** Processor Name Cores Threads Base Frequency Max Frequency Cache Memory Support Processor Graphics

Intel Core Duo Processor for Intel Centrino Duo Processor Technology Based on Mobile Intel 945 Express Chipset Family Datasheet Intel Core Duo Processor and Intel Core Solo Processor on ñ nm Process Datasheet Intel Pentium Dual-Core Mobile Processor Datasheet Intel

workstation based on the 3.3 GHz Intel Core i3-2120 processor with a 3M cache by up to 1.9X.3 Intel Xeon processor E3-1200 family: 8M4 Intel Core i7 processor cache: 8M Intel Core i5 processor cache: 6M Intel Core i3 processor cache: 3M Professional Application Performance SPECint*_rate_base2006 and SPECfp*_rate_base2006 are only

filter True for user-level API (default is False – admin API) persistent_auth True for using API REST sessions (default is False) . UI Plugin API (Demo) Scheduling API VDSM hooks. 51 UI Plugins Command Line Interface . 52 Web Admin user interface Extend oVirt Web Admin user interface. 53 Web Admin user interface. 54 Web Admin user interface . 55 Web Admin user interface. 56 Web Admin user .