Design Of Energy-Efficient On-Chip Networks

2y ago
8 Views
2 Downloads
5.67 MB
93 Pages
Last View : 5m ago
Last Download : 3m ago
Upload by : Maleah Dent
Transcription

Design of Energy-EfficientOn-Chip NetworksVladimir StojanovićIntegrated Systems GroupMITISSCC 2010 Tutorial

Manycore System Roadmap64-tile system (64-256 cores)- 4-way SIMD FMACs @ 2.5 – 5 GHz- 5-10 TFlops on one chip- Need 5-10 TB/s of off-chip I/O- Even larger bisection bandwidthISSCC 2010 Tutorial2 cm2 cmIntel 48 core -Xeon2

The rise of manycore machinesOnly way to meet future system feature set, design cost, power, andperformance requirements is by programming a processor array– Multiple parallel general-purpose processors (GPPs)– Multiple application-specific processors (ASPs)IBM CellIntel Network Processor1 GPP (2 threads)1 GPP Core8 ASPs16 ASPs (128 threads)18StripeRDRAM1PCI64b(64b)66MHzQDRSRAM1E/D Q1 18 818RDRAM218RDRAM3Intel XScale Core32K IC32K DCQDRSRAM2E/D Q1 18 8MEv2 MEv2 MEv2 MEv21234Rbuf64 @128BMEv2 MEv2 MEv2 MEv28765GASKETQDRSRAM3E/D Q1 18 8MEv2 MEv2 MEv2 MEv29101112QDRSRAM4E/D Q1 18 8Sun Niagara8 GPP cores (32 threads)Intel 4004 (1971):4-bit processor,2312 transistors, 100 KIPS,10 micron PMOS,11 mm2 chipMEv2 MEv2 MEv2 MEv216151413Picochip DSP1 GPP core248 ASPsIXP2800SP 16bI4orCS16bIXTbuf64 @128BHash48/64/1CSRs28-ScratcFast wrh-UART16KBTimers-GPIOBootROM/SlowPortCisco CRS-1192 Tensilica GPPs1000s ofprocess “The Processor isor cores the new Transistor”[Rowen]per dieISSCC 2010 Tutorial3

Interconnect bottlenecksManycore systemcoresCPU CPUCPUInterconnectNetworkCache orkISSCC 2010 TutorialBottlenecks dueto energy andbandwidthdensity limitationsNeed to jointlyoptimize on-chipand off-chipinterconnect network4

Scaling to many MDRAMDIMMDIMMDIMMDIMMProcessor MDRAMDRAMDRAMDRAMDRAMDRAMDRAM Today’s approaches Many meshes– Slow, latency varies greatly– Easy to implement Large crossbars– Fast, predictable latency– Hard to build and scaleISSCC 2010 Tutorial5

Rainbow-Falls 2-stage CrossbarBisection Bandwidth461GB/s[Patel09]ISSCC 2010 Tutorial6

On-chip network topology spectrumMeshCMeshClosCrossbarIncreasing diameterEasy to designHard to programHard to designEasy to programIncreasing radixIn power constrained systems – Need to look at networks in a cross-cut approachConnect physical implementation (channels, routers, power) withnetwork topology, routing and flow-controlRadix – Number of inputs and outputs of each switching node7Diameter – largest minimal hop count over all node pairs

NOCs Tutorial Roadmap Networking Basics Building Blocks EvaluationISSCC 2010 Tutorial8

NOCs Tutorial Roadmap Networking Basics– Topologies– Routing– Flow-Control Building Blocks EvaluationISSCC 2010 Tutorial9

Message definitions[Dally04]RI - Routing InfoSN – Sequence #Basic unit of bw and storageallocation (flow-control)Sent across channel in a clock cycle Basic trade-off– Minimize overheads (large size)– Efficient use of resources (small size)ISSCC 2010 Tutorial10

Latency Components Zero-load latency– Average latency w/o contentionRouterdelaysT0 2tr (txy tyz) L/bChanneldelaysSerializationdelayHmin – average minimum number of hopstr – Router delayDmin – average minimum distancev – signal velocityL – packet length in bitsb – router-to-router channel bandwidthISSCC 2010 Tutorial11

Ideal network throughput P Maximum traffic that can besustained by all cores Mesh throughput– 50% of data crosses the bisectionassuming uniform random traffic Bisection bandwidth 2 N b1 Data crossing the bisection Nbcore2 Maximum on-chip throughput ideal Nbcore 4 N bN number of coresb router-to-router link bandwidthbcore rate at which each core generates traffic

Network performance plotsTopologyRoutingZero-load latencyincludes effects ofrouting and flow-controlFlow-control idealISSCC 2010 Tutorial13

Tori Low-radix, large diameter networks N-ary, K-cube (mesh)– N nodes perdimension– K dimensions[Dally04]4-ary 2-cube4-ary 2-mesh Cubes have 2x larger bisection bandwidthISSCC 2010 Tutorial14

TILE64 64 cores at 750 MHz Memory BW 25 GB/s 240 GB/s bis. Bw[Bell08]ISSCC 2010 Tutorial15

TILE64 Networks[Wentzlaff07]STN – Static networkTDN – Tile Dynamic networkUDN – User Dynamic networkMDN – Memory Dynamic networkIDN – I/O Dynamic network32 bit channels on all networksWormhole, dimension-order routed5-port routers with credit-basedflow-controlSTN – Scalar operand networkTDN and MDN implement the memory sub-systemUDN/IDN – Directly accessible by processor ALU (message-based, variable length)May 22, 2009ISSCC 2010 Tutorial16

Improving Tori - Express cubes Increase bisection bandwidth, reduce latency– Add expressways - long “express” channelsOne dimension of 16-ary express cube with 4-hop express channelsOne dimension of 16-ary express cube with 4-hop express channelsAdd extra channels to diversify and/or increase bisectionISSCC 2010 Tutorial17

Buterflies[Dally04] N-ary, K-fly– N nodes per switch– K stages Example– 2-ary 4 flyISSCC 2010 Tutorial18

Path diversity problem[Dally04] Buterflies have no path diversity Bad performance for some traffic patterns– e.g. shuffle permutation Wide spread in BW Inherently blocking Fixed in Clos topologiesISSCC 2010 Tutorial19

Clos networks[Clos53]8-ary 2-fly Butterfly8-ary 3-fly Clos Redundant paths – more uniform throughputISSCC 2010 Tutorial20

Logical to Physical Mapping8-ary 3-stage Clos(I-VIII,a-h)(I-VIII, a-h, A-H)(middle stage A-H) Same topology – different physical mapping

Topology comparison[Joshi09]MeshMeshCMeshCMeshClosClosISSCC 2010 TutorialCrossbarCrossbar22

Routing Algorithms Deterministic routing algorithms– Always same path between x and y Poor load balancing (ignore inherent path diversity) Quite common in practice– Easy to implement and make deadlock-free. Oblivious algorithms– Choose a route w/o network’s present state E.g. random middle-node in Clos Adaptive algorithms– Use network’s state information in routing Length of queues, historical channel load, etcISSCC 2010 Tutorial23

Deterministic Routing[Dally04]6-ary 2-cube2-ary firstDimension-orderToriISSCC 2010 Tutorial24

Oblivious Routing[Dally04] Valiant’s algorithm (Randomized Routing)6-ary 2-cubeFolded Clos (Fat Tree)8-ary 3-fly ClosRandomly select middle switchRandomly selectnearest common ancestor switchRandomly select middle nodeDimension-order to/from nodeISSCC 2010 Tutorial25

Flow Control Bufferless flow-control (Circuit Switching) Buffered flow-control (Packet Switching)– Packet-based (store&forward, cut-through)– Flit-based (wormhole, virtual channels) Buffer Management– Credit-based, on-off, flit-reservationISSCC 2010 Tutorial26

Circuit switching[Dally04]Blocked requestheld at switchR - RequestT – Tail flitA - AcknowledgmentAcquires channel state at each hop D – Data packetsDeallocate channelse.g. Two, four-flit packets Pros– Simple to implement (simple routers, small buffers) Cons– High latency (R A) and low throughputISSCC 2010 Tutorial27

Example - Pipelined Circuit Switching[Anderson08]64 core 2D mesh, 125 mW/routerNetwork efficiency 3 pJ/bitISSCC 2010 Tutorial28

Packet-buffered Flow ControlBuffer and channel allocated to whole packet[Dally04] Store-and-forwardStart next hop after whole packet received5-flit packet Cut-throughStart next hop after head flit received5-flit packetContentionfor channel 2Both ineffective in use of buffer storageContention latency increased in channelsISSCC 2010 Tutorial29

Flit-buffered Flow Control WormholeBuffer and channel allocated to flits[Dally04]I – idle, W – waiting, A - allocatedchannel blockedtail flit frees-up channelMore efficient buffer usage than cut-throughBut, may block a channel mid-packetISSCC 2010 Tutorial30

Flit-buffered Flow Control Wormhole vs. Virtual-Channel[Dally92][Dally04]ISSCC 2010 Tutorial31

Virtual-channels – Bandwidth Allocation[Dally04]Inputs compete for bandwidthFlit-by-Flit# flits in VC buffer (cap 3)Fair ArbitrationWinner-take-allArbitrationReduced latencyNo throughput penaltyISSCC 2010 Tutorial32

Virtual-channel Router[Dally04]1-VC4-VCs2-VCsEach channel only as deep as round-trip credit latencyMore buffering, more virtual channelsISSCC 2010 Tutorial33

Credit-based buffer management[Dally04]ISSCC 2010 Tutorial34

NOCs Tutorial Roadmap Networking Basics Building Blocks– Channels– Routers EvaluationISSCC 2010 Tutorial35

Building block costsRouter vs. channel energyRouter Area Breakdowns Simple routers and channels roughly balanced Narrower networks scale better90nm technologyISSCC 2010 Tutorial36

Channels: Electrical technologyRepeater inserted pipelined wires Design constraints– 22 nm technology– 500 nm pitch– 5 GHz clock Design parameters– Wire width– Repeater size– Repeater spacing10.0 mm7.5 mm5.0 mm2.5 mm1.0 mmISSCC 2010 Tutorial37

Channels: Equalized interconnects[Mensik07,Kim08,Kim09] FFE shapes transmitted pulse DFE cancels first trailing ISI tap Lower energy cost due to output voltage swing attenuationISSCC 2010 Tutorial38

Repeated interconnects vs EqualizedinterconnectsComparable latencyFixed energyFixed energyData-dependent energy (DDE) is 4-10x lower for equalizedinterconnects, while fixed energy (FE) is comparableISSCC 2010 Tutorial39

Channels: Silicon photonic technologyEnergy spent in O-Econversion 25 - 60 fJ/bt(independent of link length)[Gunn06, Orcutt08]Receiver sensitivity -20 dBmISSCC 2010 TutorialPhotodetectorloss 0.1 dBFilter drop loss 1.5 dB40

Silicon photonic link – WDMThrough ring loss 1e-4 – 1e-2 dB/ring Dense WDM improves bandwidth density– E.g. 128 λ/wg, 10 Gbps/λISSCC 2010 Tutorial41

Silicon photonic link – Energy cost E-O-E conversion cost – 50-150 fJ/bt(independent of length) Thermal tuning energy – 2-20µW/K/heater– Increases with ring count External laser power– Dependent on losses in photonic devicesISSCC 2010 Tutorial42

Electrical vs Optical links – Energy costElec: ElectricalOpt-A: Optical-AggressiveOpt-C: Optical-Conservative 2x 6xOptical laser powernot shown(dependent on thephysical gyISSCC 2010 Tutorial43

Channel TechnologiesOn-chip linksLatency Energy Density(cyc)(fJ/b) (Gb/s/µm)Optimally repeated wire (2.5 mm)110010Equalized link (2.5 mm)28010Photonic link (2.5 mm)2100-200320Optimally repeated wire (10 mm)250010Equalized link (10 mm)212010Photonic link (10 mm)2100-200320

RoutersInput VC stateper packetper flitOutput VC stateISSCC 2010 Tutorial45

Router pipeline Pipelined routing of a packetRC – route computationVA – virtual channel allocationSA – switch allocationST – switch traversalPipeline stalls (virtual allocation stall)ISSCC 2010 Tutorial46

Speculation and LookaheadSpeculative allocationLookahead routing(pass routing for next hop in head flit)ISSCC 2010 Tutorial47

Crossbar switchesNo Speedup – 68% capacity2x Input Speedup – 90% capacity2x Output Speedup – 87% capacity2x Input & Output Speedup – 137% capacityISSCC 2010 Tutorial48

Router design space exploration - Setup[Shamim09]w Flit size (bits)p Ports 56-bit DestinationAddress for64-core systemISSCC 2010 Tutorial49

Matrix CrossbarISSCC 2010 Tutorial50

Mux CrossbarISSCC 2010 Tutorial51

Example System Design space18 Routers 64 tiles.1GHz frequency1 Message 512-bits4 Messages per input port(2048-bits)Router Aspect Ratio 1p 5, 8, 12w 32, 64, 128 (bits)Matrix xbarMux xbarISSCC 2010 Tutorial52

5x5 Router Floorplan (128bit)in1VSS PowerRingout1in216word 128bitsSRAM16word 128bitsSRAMin316word 128bitsSRAMout2VDD PowerRing16word 128bitsSRAMin5out516word 128bitsSRAMout3out4RouterChip Areain4ISSCC 2010 Tutorial53

8x8 Routers Floorplan (128bit)in1in2out1in316word 128bitsSRAM16word 128bitsSRAM16word 128bitsSRAMin4in816word 128bitsSRAM16word 128bitsSRAMout316word 128bitsSRAMout216word 128bitsSRAMout416word 128bitsSRAMout5in5out8in7out7out6in6ISSCC 2010 Tutorial54

12x12 Routers Floorplan (128bit)in1out1in2out2in3out3in10in416word128bits SRAMout516word128bits SRAM16word128bits SRAMin616word16word128bits SRAM 128bits SRAMin516word128bits SRAM16word16word128bits SRAM 128bits SRAMout416word128bits SRAM16word128bits SRAM16word128bits SRAMout10in11out11in1216word128bits SRAMout6out12in7out7in8out8in9out9ISSCC 2010 Tutorial55

Area vs Port Width and Radix Mux crossbar always better 5-12 port routers scale well (sub p2, b2)ISSCC 2010 Tutorial56

Power vs Port Width and Radix Mux crossbar always better 5-12 port routers scale well (sub p2, b2)ISSCC 2010 Tutorial57

Router Power BreakdownRouter Power Breakdown1800Arbiter1600Xbar1400BufferXbar and Bufferpower roughly evenPower (mW)12001000800Improve Xbar withCkt/channel design(equalized, -mat8p-mat8p-mat5p-mat5p-mat5p-matUse less buffers(circuit switching,token flow control)[Anders08, Kumar08]32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128bits bits bits bits bits bits bits bits bits bits bits bits bits bits bits bits bits bitsISSCC 2010 Tutorial58

Router Area per core vs. # PortsISSCC 2010 Tutorial59

[Balfour06]Effects of Concentration Mesh to Cmesh 5p routers to 8p routers Works well for small flits and number of portsISSCC 2010 Tutorial60

Orion 1.0 vs P & R designRatio (Power of Synthesized designs / Dynamic (no leakage) Power ofAnalytical Models)6BufferXbar5TotalRatio43210ISSCC 2010 Tutorial12 ports64 bits 128 bits 32 bits12 ports12 ports64 bits 128 bits 32 bits8 ports8 ports8 ports5 ports5 ports5 ports32 bits64 bits 128 bits61

Orion 2.0 vs P & R design[Kahng09][Shamim09]Ratio (Power of Synthesized designs / Dynamic (no leakage) Power ofAnalytical Models)1.41.2BufferXbarTotalRatio10.80.60.40.2012 ports12 ports12 ports8 ports8 ports8 ports5 ports5 ports5 ports32 bits 64 bits 128 bits 32 bits 64 bits 128 bits 32 bits 64 bits 128 bitsISSCC 2010 Tutorial62

NOCs Tutorial Roadmap Networking Basics Building Blocks EvaluationISSCC 2010 Tutorial63

Landscape of on-chip photonic [Joshi’09a][Pan’09]ISSCC 2010 n’06]64

Clos with electrical interconnects8-ary 3-stage Clos 10-15 mm channels Equalized Pipelined Repeaters

Centralized Multiplexer CrossbarElectrical designPhotonic designISSCC 2010 Tutorial66

Clos network using point-to-pointchannelsElectrical designPhotonic designISSCC 2010 Tutorial67

Photonic Clos for a 64-tile systemISSCC 2010 Tutorial68

Photonic Clos for a 64-tile systemISSCC 2010 Tutorial69

Photonic Clos for a 64-tile systemISSCC 2010 Tutorial70

Photonic Clos for a 64-tile systemISSCC 2010 Tutorial71

Photonic Clos for a 64-tile system 64 tiles 56 waveguides (for tile throughput 128 b/cyc) 128 modulators per cluster 128 ring filters per cluster Total rings 28K 0.56W (Thermal tuning)ISSCC 2010 Tutorial72

Photonic device requirements in a ClosOptical laser power (W) contourPercent area of photonic devices contourWaveguide loss and Through loss limits for 2 Woptical laser power (30% laser efficiency)constraintISSCC 2010 Tutorial73

Photonic device requirements in a ClosOptical laser power (W) contourPercent area of photonic devices contourOptical loss tolerance for CrossbarOptical loss tolerance for ClosISSCC 2010 Tutorial74

Photonic Crossbar vs Photonic ClosCrossbarClos 10 W power for thermal 0.56 W power for thermaltuning circuits (1 μW/ring/K)tuning circuits (1 μW/ring/K) For 2 W optical laser power For 2 W optical laser power– Waveguide loss 1 dB/cm– Through loss 0.002 dB/ring– Waveguide loss 2dB/cm– Through loss 0.05 dB/ringISSCC 2010 Tutorial75

Simulation setup Cycle-accurate microarchitectural simulator Traffic patterns based on partition application model– Global traffic – UR, P2D, P8D– Local traffic – P8C 64-tile system, 512-bit messages Events captured during simulations to calculate powerCMeshISSCC 2010 TutorialClos76

Partition application model Tiles divided into logical partitions andcommunication is within partition[Joshi’09] Logical partitions mapped to physical tiles– Co-located tiles Local traffic– Distributed tiles Global trafficUniform random (UR) 2 tiles per partition that 8 tiles per partition that 8 tiles per partition thatare distributed across are distributed across are co-located (P8C)the chip (P2D)the chip (P8D)ISSCC 2010 Tutorial77

Latency vs BWmeshcmeshX2[Joshi09b]flatFlyX2closIdeal Throughput θT 8 kb/cyc for UR flatFlyX2 vs mesh/cmeshX2– Saturation BW comparable (UR, P8D, P2D)– Latency flatFlyX2 has lower latency clos vs mesh/cmeshX2/flatFlyX2– Saturation BW uniform for all traffic, comparable to UR of mesh– Latency uniform for all traffic, comparable to UR of meshISSCC 2010 Tutorial78

Mesh vs X2 Repeater-inserted interconnects– cmeshX2 lower power than mesh at comparable throughput Equalized interconnects– cmeshX2 has further 1.5x reduction in power– Channel gains masked by router powerISSCC 2010 Tutorial79

Power vs BW plots –repeater inserted pipelined vsequalizedRepeater-insertedmesh1.5-2x lowerpower withequalizedchannels sertedEqualizedEqualizedflatFlyX2closISSCC 2010 Tutorial80

Power split Channel DDE reduces by 4-10x using equalized links Channel fixed power and router power need to be tackledISSCC 2010 Tutorial81

No VCsNo VCsNo VCs4 VCs4 VCs4 VCsmesh4 VCsflatFlyX24 VCsIdeal throughput Ideal throughput 8 kb/cyc for UR 8 kb/cyc for URLatency vs BW – no VC vs 4 VCsclos4 VCsSaturation throughput improves using VCsSmall change in power at comparable throughputISSCC 2010 Tutorial82

Latency vs BW – no VC vs 4 VCsNo VCsNo VCsNo VCsIdeal URthroughput8 kb/cycle4 VCs4 VCs4 VCs8 kb/cycle4 VCs4 VCs4 VCs4 kb/cycleISSCC 2010 Tutorial83

Power vs BW – no VC vs 4 VCs, repeater insertedpipelinedNo VCsNo VCsNo VCs4 VCs4 VCs4 VCsmeshflatFlyX2clos25-50% lower power using VCs at comparable throughputISSCC 2010 Tutorial84

Power vs BW– no VC case, repeater insertedpipelined vs 4 VCs, equalizedNo VCs (Rep)mesh2-3x lower powerobtained usingequalizedinterconnects andVCs at comparablethroughputNo VCs (Rep)No VCs (Rep)4 VCs(Eqz)4 VCs(Eqz)flatFlyX2ISSCC 2010 Tutorialclos85

Power split VCs an indirect way to increase impact of channel power– Narrower networks, lower power for same throughput, keeputilization highISSCC 2010 Tutorial86

Power-Bandwidth tradeoff2-3x on-chip power savingsfor global traffic (off-chip laser)EClosPClosChannel width bcIdeal Throughput θTCMeshX2128b4kb/cycleClos64b4kb/cycleISSCC 2010 Tutorial87

Power-Bandwidth 4kb/cycleISSCC 2010 TutorialEClosComparableon-chippower forlocal traffic(off-chiplaser)PClosClos128b8kb/cycle88

SummaryMesh CMeshClosCrossbarCross-cut approach for NOC design needed Application mapping Topology, Routing, Flow-control Improving Routers and Channels equally important Opportunities for new technologies New circuit design (low-swing, equalized) System – DVFS, bus-encodingISSCC 2010 Tutorial89

To probe further (tools and sites) Orion Router Design Exploration Tool– http://www.princeton.edu/ peh/orion.html Router RTLs– Bob Mullins’ Netmaker(http://www-dyn.cl.cam.ac.uk/ rdm34/wiki) Network simulators– Garnet (http://www.princeton.edu/ niketa/garnet.html)– Booksim (http://nocs.stanford.edu/booksim.html)Integrated Systems Group at MIT (vlada@mit.edu)http://www.rle.mit.edu/isg/ISSCC 2010 Tutorial90

Bibliography[Agarwal09] N. Agarwal, T. Krishna, L.-S. Peh and N. K. Jha, " GARNET: A Detailed On-Chip Network Model inside a FullSystem Simulator " In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS), Boston, Massachusetts, April 2009.[Anders08] M. Anders, H. Kaul, M. Hansson, R. Krishnamurthy, S. Borkar “A 2.9Tb/s 8W 64-Core Circuit-switched Networkon-Chip in 45nm CMOS,” European Solid-State Circuits Conference, 2008 .[Balfour06] J. Balfour and W. Dally ,“Design tradeoffs for tiled CMP on-chip networks.,” Int’l Conf. on Supercomputing, June2006.[Bell08] S. Bell et al “TILE64TM Processor: A 64-Core SoC with Mesh Interconnect,” ISSCC pp. 88-598, 2008.[Benini02] L. Benini and G. de Micheli, “Networks on Chips: A New SoC Paradigm,” in Computer Magazine, vol. 35 issue 1,pp. 70-78, 2002.[Clos53] C. Clos. A study of non-blocking switching networks. Bell System Technical Journal, 32:406–424, 1953.[Dally92] W. J. Dally, “Virtual-channel flow control,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, no. 2, pp.194–205, 1992.[Dally01] W. J. Dally and B. Towles, “Route Packets, Not Wires: On-chip Interconnection Networks,” DAC 2001, pp. 684-689.[Dally04] W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004.[Gunn06] C. Gunn, “CMOS photonics for high-speed interconnects,”IEEE Micro, 26(2):58–66, Mar./Apr. 2006.[Joshi09a] A. Joshi, et al, “Silicon-Photonic Clos Networks for Global On-Chip Communication,” 3rd ACM/IEEE InternationalSymposium on Networks-on-Chip, San Diego, CA, pp. 124-133, May 2008.[Joshi09b] Joshi, A., B. Kim, and V. Stojanović,”Designing Energy-efficient Low-diameter On-chip Networks with EqualizedInterconnects,” IEEE Symposium on High-Performance Interconnects, New York, NY, 10 pages, August 2009.[Kahng09] A. Kahng, B. Li, L-S. Peh and K. Samadi “ORION 2.0: A Fast and Accurate NoC Power and Area Model for EarlyStage Design Space Exploration” in Proceedings of Design Automation and Test in Europe (DATE), Nice, France, April2009ISSCC 2010 Tutorial91

Bibliography[Kim07] J. Kim, J. Balfour, and W. J. Dally, “Flattened butterfly topology for on-chip networks,” in Proc. 40th AnnualIEEE/ACM International Symposium on Microarchitecture MICRO 2007, 1–5 Dec. 2007, pp. 172–182[Kim08] B. Kim and V. Stojanovic “Characterization of equalized and repeated interconnects for NoC applications,”IEEE Design and Test of Computers, 25(5):430–439, 2008.[Kim09] B. Kim and V. Stojanovic, “A 4Gb/s/ch 356fJ/b 10mm equalized on-chip interconnect with nonlinear chargeinjectingtransmitter filter and transimpedance receiver in 90nm cmos technology,” in Proc. Digest of Technical Papers.IEEE International Solid-State Circuits Conference ISSCC 2009, pp. 66–67, 8–12 Feb. 2009.[Kirman06] N. Kirman et al “Leveraging optical technology in future bus-based chip multiprocessors,” Int’l Symp. onMicroarchitecture, Dec. 2006.[Krishna08] T.Krishna, A. Kumar, P. Chiang, M. Erez and L-S. Peh, " NoC with Near-Ideal Express Virtual ChannelsUsing Global-Line Communication " In Proceedings of Hot Interconnects (HOTI), Stanford, California, August2008.[Kumar08] A. Kumar, L-S. Peh and N. Jha, " Token Flow Control ," in Proceedings of 41st International Symposium onMicroarchitecture (MICRO), Lake Como, Italy, November 2008.[Mensink07] E. Mensink et al., “A 0.28pJ/b 2Gb/s/ch transceiver in 90nm CMOS for 10 mm on-chip interconnects,” inProc. Digest of Technical Papers. IEEE International Solid-State Circuits Conference ISSCC 2007, 11–15 Feb.2007, pp. 414–612.[Nawathe08] U. Nawathe et al., “Implementation of an 8-core, 64-thread, power-efficient SPARC server on a chip,”IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 6–20, Jan. 2008[Orcutt08] J. Orcutt et al “Demonstration of an electronic photonic integrated circuit in a commercial scaled bulk CMOSprocess,” Conf. on Lasers and Electro-Optics, May 2008.ISSCC 2010 Tutorial92

Bibliography[Pan09] Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choudhary, “Firefly: illuminating future network-on-chipwith nanophotonics,” SIGARCH Comput. Archit. News 37, pp. 429-440, Jun. 2009.[Patel09] S. Patel “Rainbow Falls: Sun’s Next Generation CMT Processor”, Hot Chips 2009.[Petracca08] M. Petracca, B. G. Lee, K. Bergman and L.P. Carloni, “Design Exploration of Optical InterconnectionNetworks for Chip Multiprocessors ,” 16th Annual IEEE Symposium on High-Performance Interconnects (HotI),2008[Psota07] J. Psota et al “ATAC: On-chip optical networks for multicore processors,” Boston Area ArchitectureWorkshop, Jan. 2007.[Shacham07] A. Shacham et al “Photonic NoC for DMA communications in chip multiprocessors,” Symp. on HighPerformance Interconnects, Aug. 2007.[Shamim09] I. Shamim, Energy Efficient Links and Routers for Multi-Processor Computer Systems, M.S. Thesis, MIT[Vangal07] S. Vangal et al., “80-tile 1.28 TFlops network-on chip in 65 nm CMOS,” Int’l Solid-State Circuits Conf., Feb.2007[Vantrease08] D. Vantrease et al “Corona: System implications of emerging nanophotonic technology,” Int’l Conf. onComputer Architecture, Jun 2008.[Wang03] H. Wang, L. Peh, and S. Malik, “Power-driven design of router microarchitectures in on-chip networks,” IEEEMicro-36, pp.105–116, 2003[Wentzlaff07] D. Wentzlaff et al “On-chip Interconnection Architecture of the Tile Processor,” IEEE Micro, Volume27, no. 5, pp.15 - 31 , Sept.-Oct. 2007.ISSCC 2010 Tutorial93

Cisco CRS-1 192 Tensilica GPPs 1000s of process or cores per die ISSCC 2010 Tutorial 3. Interconnect bottlenecks CPU . In power constrained systems –Need to look at networks in a cross-cut approach . Cubes have 2x larger bisectio

Related Documents:

appliances we buy for our homes more energy efficient. 1.1 Energy labels indicate relative performance in terms of efficiency, steering consumers towards the most efficient models, while minimum energy performance standards (MEPS) progressively remove the least efficient products from the market. So the next time a new

on work, power and energy]. (iv)Different types of energy (e.g., chemical energy, Mechanical energy, heat energy, electrical energy, nuclear energy, sound energy, light energy). Mechanical energy: potential energy U mgh (derivation included ) gravitational PE, examples; kinetic energy

with low efficient motor Efficient ABB motor Energy saving ( .05/kWh) Energy saving ( .08/kWh) 30 kW 110 kW 200 kW The graph shows the savings that can typically be achieved by selecting an efficient ABB motor rather than a less efficient product. The calculations assume a running time of 24 h / 365 days,

Forms of energy include radiant energy from the sun, chemical energy from the food you eat, and electrical energy from the outlets in your home. All these forms of energy may be used or stored. Energy that is stored is called potential energy. Energy that is being used for motion is called kinetic energy. All types of energy are measured in joules.

installation, capital, and retrofit costs, for energy saving technologies are often higher than the upfront costs for less energy-efficient equivalent technologies. Even though many energy-efficient technologies have higher upfront costs than their conventional equivalents, these additional costs can be recouped from the energy savings. The

about the advantages of energy-efficient buildings. Monitoring of energy performance of buildings is a challenge due to non-installation or non- functioning energy information system (EIS) in majority of the buildings. The paper presents case study of an energy efficient day-use public office building in composite climate (Jaipur).

End-to-End Energy-Efficient Design For optimal results, energy efficiency must be addressed at each stage of SoC design. Over the years, a wide range of techniques has . Managing energy efficiency across the complete low power design flow has been easier since the introduction of the Unified Power Format (UPF) standard (IEEE 1801-2018). .

This Guide is intended to assist lighting directors, studio managers and production teams create low energy lighting designs and improve working practices in television productions A low energy lighting solution is a combination of efficient equipment, efficient design, efficient controls and better energy management.