The Impact Of Moore’s Law And Loss Of Dennard Scaling

2y ago
8 Views
2 Downloads
8.40 MB
79 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Callan Shouse
Transcription

Lennart Johnsson2015-02-06The Impact of Moore’s law andloss of Dennard ScalingLennart JohnssonUniversity of Houston

Lennart Johnsson2015-02-06OutlineThe quest for Energy Efficiency in Computing Technology changes and the impacts thereof A retrospective on benefits and drawbacks of pastarchitectural changes Expected architectural changes and impact Our research to understand opportunities andchallenges for HPC driven by the expectedtechnology changes

Lennart Johnsson2015-02-06What got me interested in energy efficient computingEnergy cost estimate for a 1300 nodecluster purchase 2008 for PDC @ RoyalInstitute of Technology:Four year energy and cooling cost 1.5 times cluster cost incl. software,maintenance and operations!!Business as usual not appealing from ascientist/user point of view

Lennart Johnsson2015-02-06“You Can Hide the Latency,But,You Cannot Hide the ENERGY!!”Peter M KoggeRule of thumb 1MW 1M/yrAverage US household 11MWh/yr(1MW yr 800 households)

Lennart Johnsson2015-02-06Incredible Improvementin Integrated Circuit Energy EfficiencySource: Lorie Wigle, Intel, c/2008/presentations/18-4C-01-Eco-Technology - Delivering Efficiency and Innovation.pdf

Lennart Johnsson2015-02-06Internet Data Centers ‐Facebook The 120 MW Lulea Data Center willconsist of three server buildings with anarea of 28 000 m2 (300 000 ft2).The first building is to be operationalwithin a year and the entire facility isscheduled for completion by 2014The Lulea river has an installedhydroelectric capacity of 4.4 GW andproduces on average 13.8 TWh/yrRead center‐Lulea‐Sweden.html#ixzz1diMHlYILClimate data for Luleå, tNovDecYearAverage high C ( F) 8(18) 8(18) 2(28)3(37)10(50)17(63)19(66)17(63)12(54)6(43) 1(30) 5(23)5.0(41.0)Average low C ( F) 16(3) 16(3) 11(12) 4(25)2(36)8(46)11(52)10(50)5(41)1(34) 7(19) 13(9) 2.5(27.5)

Lennart Johnsson2015-02-06Actions of large consumers/providers Custom server designs for reduced energyconsumption, stripping out un‐neededcomponents, and in some cases reducingredundancy implementing resiliency at thesystem level, not at the server level In some cases making designs public, e.g.Facebooks Open Compute initiative Focusing on clean renewable energy Locating data centers close to energy sources(low transmission losses) and where cooling costsare low

Lennart Johnsson2015-02-06Internet Company effort t‐using‐fpgas‐speed‐bing‐search/

Lennart Johnsson2015-02-06Power Consumption – Loadfor a typical server (2008/2009)CPU power consumption at low load about 40% of consumption at full load.Power consumption of all other system components independent of load,approximately.Result: Power consumption at low load about 65% of consumption at full load.Luiz Andre Barroso, Urs Hoelzle, The Datacenter as a Computer: An Introduction to the Design ofWarehouse-Scale Machines 93ed1v01y200905cac006

Lennart Johnsson2015-02-06GoogleFigure 1. Average CPUutilization of more than 5,000servers during a six-monthperiod. Servers are rarelycompletely idle and seldomoperate near their maximumutilization, instead operatingmost of the time at between10 and 50 percent of theirmaximum utilization levels.“The Case for Energy-Proportional Computing”, Luiz André Barroso, Urs Hölzle, IEEE Computer, vol. 40 l content/untrusted f

Lennart Johnsson2015-02-06Internet vs HPC WorkloadsGoogle (Internet)KTH/PDC (HPC)

Lennart Johnsson2015-02-06The 1st foray into energy efficient computing“on a budget”: the SNIC/KTH/PRACE Prototype INot in prototype nodesObjective: Energy efficiency at par with most energyefficient design at the time, but using only commoditytechnology (for cost) and no acceleration forpreservation of programming paradigm CMM (Chassis Management Module) 36-ports New 4‐socket blade with 4 DIMMsper socket supporting PCI‐ExpressGen 2 x16Four 6‐core 2.1 GHz 55W ADP AMDIstanbul CPUs, 32GB/node10‐blade in a 7U chassis with 36‐port QDR IB switch, new efficientpower supplies.2TF/chassis, 12 TF/rack, 30 kW(6 x 4.8)180 nodes, 4320 cores, fullbisection QDR IB interconnectNetwork: QDR Infiniband 2-level Fat-Tree Leaf level 36-port switchesbuilt into chassis Five external 36-portswitches

Lennart Johnsson2015-02-06The Prototype HPL Efficiencies in PerspectivePowerDual socket Intel Nehalem 2.53 GHzAbove GPUPRACE/SNIC/KTH prototypeIBM BG/P240 MF/W270 MF/W344 MF/W (unoptimized)357 ‐ 372 MF/WFraction of PeakDual socket Intel Nehalem 2.53 GHzAbove GPUPRACE/SNIC/KTH prototypeIBM BG/P91%53%79%83%

Lennart Johnsson2015-02-06Comparison with Best Proprietary SystemHPL energy efficiency Xeon HPTN, dual socketXeon HPTN, dual socket ClearspeedXeon HPTN, dual socket GPUSNIC/KTH prototypeBG/P240 MF/W326 MF/W270 MF/W344 MF/W357 MF/W

Lennart Johnsson2015-02-06Exascale Computing Technology Challenges, John ShalfNational Energy Research Supercomputing Center, Lawrence Berkeley National LaboratoryScicomP / SP‐XXL 16, San Francisco, May 12, 2010

Lennart Johnsson2015-02-06The Good News Moore’s Law still works and expected to work throughout thedecade The number of transistors per die (processor) on average hasincreased slightly faster than predicted by Moore’s law (forseveral reasons, such as, slightly increased die sizes (onaverage), changing balance of transistors used for logic andmemory, and evolving device technology

Lennart Johnsson2015-02-06Doubling about every 21 months on average! (not every 24 -27 months - Moore’s 0/00/Transistor Count and Moore%27s Law - 2011.svg

Lennart Johnsson2015-02-06Moore’s Law at al-Core Itanium 2POWER6Six-Core Opteron 9IntelIBMAMD90 nm65 nm45 nm596 mm²341 mm²346 mm²RV8702,154,000,0002009AMD40 nm334 mm16-Core SPARC T31,000,000,0002010Sun/Oracle40 nm377 mm²Six-Core Core i78-Core POWER74-Core Itanium Tukwila8-Core Xeon Nehalem-EXCayman (GPU)GF100 (GPU)AMD Interlagos (16 C)AMD GCN Tahiti (GPU)10-Core Xeon Westmere-EX8-Core Itanium PoulsonSandy Bridge, 8CIvy Bridge, 4C GPUIvy Bridge, AMDAMDIntelIntelIntelIntelIntel32 nm45 nm65 nm45 nm40 nm40 nm32 nm28 nm32 nm32 nm32 nm22 nm22 nm240 mm²567 mm²699 mm²684 mm²2389 mm2529 mm2315 mm2365 mm2512 mm2544 mm2435 mm2160 mm2257 mmIvy Bridge, 15C4,300,000,0002013Intel22 nm541 mmVertex-7 (FPGA)3,500,000,0002013Xilinx28 nm550 mm ?Nvidia Kepler GK110 (GPU)7,100,000,0002013nVidia28 nm551 mmXeon Phi, 62C2225,000,000,0002013Intel22 nm?http://en.wikipedia.org/wiki/Sandy tor count,about every 21 months on average (notevery 24 -27 months!)2

Lennart Johnsson2015-02-06The Good News Moore’s Law still works and expected to work throughout thedecade The number of transistors per die (processor) on average hasincreased slightly faster than predicted by Moore’s law (forseveral reasons, such as, slightly increased die sizes (onaverage), changing balance of transistors used for logic andmemory, and evolving device technologyNext: The Bad News

Lennart Johnsson2015-02-06Bad News 1: 5/mudge slides.pdf

Lennart Johnsson2015-02-06Bad News 2:Dennard Scaling works no more!Dennard scaling: reducing the critical dimensionswhile keeping the electrical field constant yieldshigher speed and a reduced power consumption ofa digital MOS circuitDennard R.H.et al., “Design of ion‐implanted MOSFETs with very smallphysical dimensions”, IEEE J. Solid‐State Circ., vol.9, p.256 (1974)Using the typical scaling of 0.7x in devicedimensions per generation thus results in adoubling of transistors in a given area with thepower consumption remaining constant andthe speed increasing by 1.4x!!Dennard scaling ended 2005!Constant FieldScaling

Lennart Johnsson2015-02-06The end of Dennard Scaling Graph from E J Nowak IBM Journal of Research andDevelopment, Mar/May 2002, vol. 46, no. 2/3, pg. ��power‐trends.htmlDennard scaling did not take subthreshold leakageinto account.By 2005 subthreshold leakage had increased morethan 10,000 fold and further reduction inthreshold voltage Vt not feasible limiting operatingvoltage reduction.Further, gate oxide thickness scaling had reached apoint of 5 atomic layers and further reduction notpossible and direct tunneling current becoming anoticeable part of total chip ��compensation.htmlMark Bohr, A 30‐year perspective on Dennard’sMPSFET Scaling Paper, http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber 4785534

Lennart Johnsson2015-02-06The End of Dennard ScalingSource: Greg Astfalk 013/Astfalk.pdfSingle thread performanceimprovement is slow. (Specint)*”Intel has done a little betterover this period, increasing at21% per year.Source: Andrew Chien

Lennart Johnsson2015-02-06Pollack’s rule(Fred Pollack, Intel)Performance increase @ same frequency is about (area) not propor onal to the number of transistorsSource: S. Borkar, A. Chien, The Future of ltext

Lennart Johnsson2015-02-06Post Dennard ScalingSource: Bill Dally, HiPEAC Keynote 2015

Lennart Johnsson2015-02-06Post Dennard ScalingSource: Bill Dally, HiPEAC Keynote 2015

Lennart Johnsson2015-02-06The Future The conventional path of scaling planar CMOS will facesignificant challenges set by performance and powerconsumption requirements. Driven by the 2 increase in transistor count per generation,power management is now the primary issue across mostapplication segments. Power management challenges need tobe addressed across multiple levels, . The implementationchallenges of these approaches expands upwards into systemdesign requirements .2013 Roadmap

Lennart Johnsson2015-02-06The Future“Future systems are energy limited:Efficiency is performance”“Process matters less: 1.2x per generation”Bill Dally NVIDIA

Lennart Johnsson2015-02-06Retrospective: Microarchitecture benefitsChange normalized from one generation to the next1000 nm350 nm180‐ 90 nm45 nm700 nmSource: S. Borkar, A. Chien, The Future of ltextSource: A. ishan2010/pdfs/Andrew%20A.%20Chien.pdf

Lennart Johnsson2015-02-06Data 14/documents/dosanjh pres.pdf

Lennart Johnsson2015-02-06Processor Energy HistoryA retrospective: for several chip generations designers used the increasedtransistor capability to introduce micro‐architectural features enhancing singlethread performance, often at the expense of energy efficiency .NormalizedPerformanceNormalizedPowerEPI on 65 nm at1.33 volts (nJ)i4861.01.010Pentium2.02.714Pentium Pro3.6924Pentium 4 (Willamette)6.02338Pentium 4 (Cedarmill)7.93848Pentium M (Dothan)5.4715Core Duo (Yonah)7.7811ProductEd Grochowski, Murali Annavaram Energy per Instruction Trends in Intel Microprocessors. /pdf/epitrends-final2.pdfSource: Shekhar Borkar, Intel“We are on the Wrong side of a Square Law”Pollack, F (1999). New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies. Paper presentedat the Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, Haifa, Israel.

Lennart Johnsson2015-02-06The Square Law .For CMOS the relationship between power (P), voltage (V)and frequency (f) isP c1V2f c2V c3 O(V4)Furthermore, f C(V-V0)Leakage Dynamic BoardFansLinpack: 15f(V-0.2)2 45V 19STREAM: 5f(V-0.2)2 50V 19Source: Supermicro

Lennart Johnsson2015-02-06Energy Efficiency of different design ents/fs2014/caulfield adrian reconfig fabric r01.pdf

Lennart Johnsson2015-02-06Our 2nd PrototypeSource: Andrew Chien,ARM Cortex-9ATOMAMD 12-coreCoresWGF/WCoresWGF/W4 2 0.522 0.5nVidia FermiIntel 6-coreATI 9370CoresWGF/WCoresWGF/WCoresWGF/W12115 0.96130 0.61600225 2.3TMS320C6678ClearSpeedCX700IBM BQCCoresWGF/WCoresWGF/WCoresWGF/WCoresWGF/W512225 2.2810 4/616553.719210 10Very approximate estimates!!KTH/SNIC/PRACE Prototype II

Lennart Johnsson2015-02-06Nominal Compute Density and Energy EfficiencyCPUAMD Interlagos(16C, 2.7 GHz)AMD FirePro 7900AMD S9000AMD BrazosAMD LlanoAMD TrinityIBM Blue Gene/QIBM Power7Intel Sandy Bridge(8C, 3.1 GHz)Intel Ivy Bridge(4C,3.5 GHz)Nvidia FermiNvidia Kepler 20xNvidia Tegra2Nvidia Tegra3TI TMS320C6678TI 66AK2HxXilinx Vertex‐6Xilinx Vertex‐7“Current”Feature sizeGF/J .32.66“Next Generation”Feature size GF/J GF/mm21.19283.62.21221.450.70285.22.36 130.2360.400.570.480.461.26 10.044 328405‐1028Source: L. Johnsson, G. Netzer, Report on Prototype Evaluation, D9.3.3 PRACE, http://www.prace‐ri.eu/IMG/pdf/d9.3.3.pdf

Lennart Johnsson2015-02-06TI KeyStone Device ry SubsystemMSMSRAM64‐BitDDR3 EMIFMSMCDebug & TraceBoot ROMSemaphoreC66x CorePacPowerManagementPLLL1PL1DCache/RAM Cache/RAMx3L2 Memory Cache/RAMEDMA1 to 8 Cores @ up to 1.25 GHzx3TeraNetHyperLinkMulticore NavigatorS witc hE thernetS witc hS GMIIx2S R IOApplicationSpecific I/OS PIUA R TP C IeI2CGPIOApplicationSpecific ketAcceleratorNetwork CoprocessorCorePacMemory SubsystemMulticore NavigatorNetwork CoprocessorExternal InterfacesTeraNet Switch FabricDiagnostic EnhancementsHyperLink BusMiscellaneousApplication‐Specific

Lennart Johnsson2015-02-06

Lennart Johnsson2015-02-06Instrumentation for Power MeasurementsThe 6678 has 16 power domains.Each core has its own power domain.MSM, Hyperlink, SRIO, PCIe, Peripheral Logic,Packet Engine, Trace Logic their own powerdomains.There is no voltage control.RailVariableFixedVoltage0.9 – 1.1 V1.0 VMemory1.5 VUnit InputOther12 VComponent ErrorShunt Resistor ( T 20K)Front End Offset ErrorFront End Gain ErrorFront End Passband FlatnessFront End Stopband RejectionADC Offset ErrorADC Gain ErrorTotal Voltage ErrorTotal Current ErrorTotal Power ErrorMarker Jitter (1000s run‐time)Total Energy ErrorUsageAll SoC logicOn‐chip memoriesHyperlinkOther common fabricDDR3 Memory and I/OHyperlink SerDesPCIe SerDesSGMII SerDesSRIO SerDesDC input to EVM (board)Unit Input – AboveUnmeasured partsRelative Error0.1% 20 ppm/K0.01%0.1%0.1%25 ppm140 ppm476 ppm4 ppmContribution1800 ppm2 mV1000 ppm1000 ppm25 ppm2.4 mV476 ppm 1421 ppm 9.6 mV 2294 ppm 19.2 mA 9500 ppm4 ppm 9500 ppm

Lennart Johnsson2015-02-06Measurement Setup

Lennart Johnsson2015-02-06Results STREAM: 96% of peak, 1.47GB/J @ 1GHz HPL 77% efficiency, 2.6 GF/W @ 1.25 GHz– 95% DGEMM efficiency FFT design:– 5.2 GF/s @ 1GHz (of 6 GF/s) and 5.6 GB/s for singlecore for 512 point L1 DP FFT (Bandwidth limited)– 20 GF/s @ 1GHz 8 cores for 256k 8L2/MSM DP FFT.Bandwidth limited (max on‐chip)– 10.3 GF/s @ 1GHz for 128M FFT (DDR3)

Lennart Johnsson2015-02-06The STREAM Benchmark Invented 1990 by John D. McCalpin for assessing(memory) performance of (long) vectors Correlates with SPEC‐FPperformance [McCalpinSC02] OperationsSource: Peter M Kogge, IPDPS 2014

Lennart Johnsson2015-02-066678 STREAM results in perspectivePlatformBW (GB/s)PeakMeas.Eff.%Energy Eff. Lith.(GB/J)nmCommentIBM Power7409.6122.8 30.0 0.1045STREAM power est. at 50% of TDPIntel Xeon Phi352.0174.8 49.7 1.2022STREAM power est. at 50% of TDPIntel E5‐2697v2119.4101.5 85.0 0.3022STREAM power est. at 50% of TDP6.01.6 26.70.0540Power and BW measured (BSC)6678 Cache10.73.0 28.10.4040Power and BW measured at 1 GHz6678 EDMA10.710.2 95.71.2640Power and BW measured at 1 GHzNVIDIA Tegra3IBM Power7 data from “IBM Power 750 and 760 Technical Overview and Introduction”, Literature Number 4985,IBM, May 2013 and W.J.Starke, “Power7: IBMs next generation, balanced POWER server chip,” HotChips 21, 2009.Intel Xeon Phi data from R. Krishnaiyer, E. Kultursay, P. Chawla, S. Preis, A. Zvezdin, and H. Saito, “Compiler‐basedData Prefetching and Streaming Non‐temporal Store Generation for the Intel R Xeon Phi Coprocessor”, 2013 IEEE27th International Parallel and Distributed Processing Symposium Workshops&PhD Forum, NewYork, NY, IEEE, May2013, pp. 1575–1586.Intel E5‐2697v2 data from Performance and Power Efficiency of Dell Power Edge servers with E5‐2600v2, Dell,October, 2013

Lennart Johnsson2015-02-06STREAM Challenges and Solutions STREAM operations have no data reuse.– Worst case for caches, only merging accesses into cache blocks.– Write‐allocate policies increase memory traffic.– Evicted data shares address LSBs with new data leading toDRAM memory bank conflicts. Possible to implement fast STREAM using:– Prefetching to hide long latencies.– Bypassing cache on writes to prevent extra traffic. Make the data transport explicit in the program:– Can use block‐copy engines.– Can schedule accesses.– Can utilize regularity in access patterns.

Lennart Johnsson2015-02-06Understanding and OptimizingSTREAM on the 6678

Lennart Johnsson2015-02-06CC2

Lennart Johnsson2015-02-06The 6678 Core Memory The 6678 has a 16B 4‐entry writebuffer for bypassing the L1 on cachemisses to avoid stalls, if the writebuffer is not full. The buffer drain rateis:– 16B at CPU/2 rate to L2 SRAM– 16B at CPU/6 rate to L2 cache

Lennart Johnsson2015-02-06Avoiding Stalls and Hiding Latencyusing the Enhanced DMA (EDMA3) EDMA3– Three Dimensional Transfers– Three Channel Controllers (CCx), with CC0having 2 Transfer Controllers (TCs), andCC1 and CC2 each having 4 TCs– Support for chaining, completion of atransfer triggers a subsequent transfer,and linking, automatic loading of transferspecifications– Uses TeraNet 2, CPU clock/2, and TeraNet3, CPU clock/3 CC0 used in our EDMA3 STREAM impl.hBandwidth limiting path 16/3 B/CPU cycle

Lennart Johnsson2015-02-06Memory latency @ 1GHz core clock (ns)The DDR Challenge: Avoid Page Faults8 Banks DDR3 andDDR4MemoryStride in Bytes641282565121 ki2ki4ki8ki16kitburst (ns)BW (GB/s)L1 SRAM555555555610.66L2 SRAM1212121212121212124 Banks11.255.68MSM1528282828282828281 �1333PageMeasured latencies at 1 GHz core clock.L2 cache line: 128 B. Prefetch distance: 256 BSee e.g. Collective Memory Transfers for Multi‐Core Chips,G. Michelogiannakis, A. Williams, and J. Shalf, Lawrence Berkeley National Lab, November 2013.DDR3 page size 8kiB.This is a serious problem for multi‐core CPUs!

Lennart Johnsson2015-02-06EDMA3 Strategy Prefetch to L1 directly from DDR3– Eliminates stalls and hides latency Store from L1 directly to DDR3– Eliminates stalls, hide latencies, and avoids write‐allocate reads Use at least two buffers per core to enable overlap of core Execution Unitoperations with data transfers (due to various overheads more than twobuffers may be beneficial). (If the core execution time exceeds thetransfer time, then the core‐EDMA3 interface will not be fully utilizedeven with multiple buffers.) Use the EDMA3 to make core operations entirely local (no inter‐corecommunication required for scheduling, synchronization, etc.) Use the EDMA3 to coordinate core accesses to DDR3 to maximizebandwidth utilization by minimizing DDR3 page openings and closings Order DDR3 reads and writes to minimize read/write and write/readtransitions Use two Transfer Controllers to maximize memory channel utilization(overlap TC overheads)

Lennart Johnsson2015-02-06Scheduling EDMA3 transfersClustering of loads and stores across cores reduces read/write and write/read transitionsIndependent Transfer Controllers are kept in lock‐step using cross‐triggering

Lennart Johnsson2015-02-06STREAM Power and Energy Efficiency,EDMA3, 2 TCs, 3 or 2 buffers/coreGB/sPerformance SoC @ 1GHz, 4x2Gb DDR3 1333MHz12GB/JEnergy Efficiency, SoC @ 1GHz 0.4020.2000.001W9.002345678 ThreadsPower, SoC @ 1GHz 4x2Gb DDR3 133MHz, 012345678Threads0Idle cores powered off678 Threads

Lennart Johnsson2015-02-06DDR3 Memory Power and EnergyW, 4 x 2Gb DDR3‐1333, 1.5V3.002.502.001.50DDR3 (W)1.000.500.000.00 5.16 6.99 7.99 8.678.86 9.50 9.83 10.20pJ/B, 4 x 2Gb DDR3‐1333, 1.5V400350300250200pJ/B1501005000.00 5.16 6.99 7.99 8.678.86 9.50 9.83 10.20

Lennart Johnsson2015-02-06Lessons Learned from Optimizing STREAM The 40 nm 6678 DSP can offer energy efficiency comparable to 22 nm x86CPUs with complex hardware prefetch and hardware support for non‐temporal stores (bypassing write‐allocate mechanisms).Explicit data management necessary (using the EDMA3) to implement––––– effective prefetchingstoring to DDR3coordinate core accesses to DDR3multi‐buffering to overlap core and transfer operationsoverlap Transfer Controller overheads by using multiple (two) TCsThe EDMA3 linking and chaining features needed to achieve close to 100%memory channel utilizationOptimization at the level carried out for our STREAM implementationrequires detailed and deep knowledge of the architecture, its features andlimitations, but for highly structured applications encapsulating thiscomplexity into higher level functions or libraries seems feasible (but wasnot attempted within the time available).

Lennart Johnsson2015-02-06The FutureKey Limits for Chip DesignAndrew Chien, VP of Research, shan2010/pdfs/Andrew%20A.%20Chien.pdf

Lennart Johnsson2015-02-06Dark silicon Chip power will remain capped at current levels (because ofcooling costs and known cooling technologies) Moore’s law enables more transistors per unit area, but postDennard scaling power per unit area increases Thus, dark silicon, i.e. not all areas can be simultaneouslyused, becomes a necessity On‐die dynamic power management (voltage, frequency,clock, power) will be a necessity. For maximum performanceand energy efficiency this may propagate up to theapplication level.Remark 1. New technology has made it possible for voltage control to move onto thedie, enabling increasingly refined control of subdomains of a dieRemark 2. Recent processor chips has a dedicated processor for power control,including control of voltage, frequency, power and clock distribution tosubdomains of the chip.

Lennart Johnsson2015-02-06Power ChallengesPower Challenges May End the Multicore Era Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, KarthikeyanSankaralingam, Doug Burger,http://dl.acm.org/citation.cfm?id 2408797&CFID 745900626&CFTOKEN 86138313https://www.youtube.com/watch?v Df8SQ8ojEAQ&feature youtu.be

Lennart Johnsson2015-02-06“The good news is that the old designsare really inefficient, leaving lots ofroom for innovation,”Bill Dally, Nvida/Stanford,NYT, July 31, 2011

Lennart Johnsson2015-02-06An LBL view – Simpler (smaller) coresExascale Computing Technology Challenges, John ShalfNational Energy Research Supercomputing Center, Lawrence Berkeley National LaboratoryScicomP / SP-XXL 16, San Francisco, May 12, 2010

Lennart Johnsson2015-02-06“We used to focus graduate students ongenerating ideas of how to improveperformance by adding features, now weneed to focus them on how to improveperformance by removing features.”Bill Dally,A Computer Architecture Workshop:Visions for the Future, September 19, 2014

Lennart Johnsson2015-02-06Customization Power/Performance ExamplesNote: lowest powerconsumption on topPower and performance of accelerators varying in degree of customization (scaled to 45nm).Customization yields up to 1,000 ‐ 10,000x vs mainstream processors such as Sandybridge.Source: Apala Guha et al., Systematic Evaluation of Workload Clustering for Designing 10x10 Architectures, ACM SIGARCH ComputerArchitecture News, vol 41, no2, pp22‐29, 2013, http://dl.acm.org/citation.cfm?id 2490307

Lennart Johnsson2015-02-06Heterogeneous ArchitecturesImagine the impact TI’s KeyStone SoC HP Moonshot“TI’s KeyStone II‐basedSoCs, which integratefixed‐ and floating‐point DSP cores withmultiple ARM Cortex A‐15 MPCoreprocessors, packet andsecurity processing,and high speedinterconnect .”HP Project Moonshot is dedicated to designingextreme low‐energy server technologiesWe are pursuing HPC cartridges with HP and TISource: S. Borkar, A. Chien, The Future of ltext

Lennart Johnsson2015-02-06Heterogeneous ters/

Lennart Johnsson2015-02-06An aside – Big vs. SmallJanuary 06, 2011. . On Wednesday, the GPU-maker -- and soon to be CPU-maker -- revealedits plans to build heterogeneous processors, which will encompass high performance ARM CPUcores alongside GPU cores. The strategy parallel's AMD's Fusion architectural approach thatmarries x86 CPUs with ATI GPUs on-chip.

Lennart Johnsson2015-02-06Source: A. Chien, 010/pdfs/Andrew%20A.%20Chien.pdf

Lennart Johnsson2015-02-0610x10 Assessment ‐ Architecture7 Micro‐engines(not 10 in current assessment) One RISC core, MIPS‐like instruction set, 5‐stage pipeline Six specialized micro‐enginesSource: A. Chien et. al. “10x10 A Case Study in Federated Heterogeneous Architecture”, Dept. of Comp. Science, U. Chicago, January 2015

Lennart Johnsson2015-02-06Relative Core Energy: Floating‐point vs Integer32 nmFFT‐LdSt: impact ofFFT μ‐engineFFT‐DLT: impact of FFTand DLT μ‐engines7 nmConclusion: At 7 nm energyconsumption forcomputation is practicallyindependent of integer vsfloating‐point formatSource: T. Thank‐Hoang et el., Does Arithmetic Logic Dominate Data Movement? A Systematic Comparison of Energy‐Efficiencyfor FFT Accelerators, TR‐2015‐01, U. Chicago, March 2015

Lennart Johnsson2015-02-06Relative System Energy 16‐bit vs 32‐bit32 nm7 nmSource: T. Thank‐Hoang et el., Does Arithmetic Logic Dominate Data Movement? A Systematic Comparison of Energy‐Efficiencyfor FFT Accelerators, TR‐2015‐01, U. Chicago, March 2015

Lennart Johnsson2015-02-06The Hybrid Memory CubeSource: J. Thomas Pawlowski, Micron,HotChips23, August 17 – 19, 2011Source: Todd Farrell, 014/Farrell.pdf

Lennart Johnsson2015-02-06Processor/Platform Power Management

Lennart ary/David Perlmutter/SilverlightLoader.html

Lennart Johnsson2015-02-06Intel Sandy Bridge CPUSource: Key NehalemChoices, Glenn Hinton, s/100217slides.pdfSource: Efi Rotem, Alon Naveh, Doron Rajwan, Avinash Ananthakrishnan, Eli Weissmannhttp://www.hotchips.org/hc23 2011-08-17 -- 19

Lennart Johnsson2015-02-06SoftwareSource: S. Borkar, A. Chien, The Future of ltext

Lennart Johnsson2015-02-06Energy ‐SoftwareMatrix Multiplication – Programmer Productivity vs (Energy) EfficiencySource: Jim Larus, HiPEAC 2015

Lennart Johnsson2015-02-06Software Explicit memory management – no cachecoherence Explicit power management (thread/taskallocation, voltage and frequency control) Complex programming environment – fromVHDL (FPGA) to system level, libraries, .

Lennart Johnsson2015-02-06Programming HeterogeneousComputer Systems is Very Complex ACPI, IPMI, BMC, UEFI, Explicit power management (thread/task allocation,voltage and frequency control) DMA, Assembly, . Explicit memory management – no cache coherence Linux, Embedded Linux, . Core: C, Embedded C, Fortran, SIMD, . Multicore: OpenMP, Pthreads, TBB, Cilk, . Heteregenous: OpenCL, OpenACC,HAS, BOLT, FPGA: Verilog, VHDL Clusters: MPI, UPC, Titanium, X10,Co‐Array Fortran, Global Arrays, shed‐lecture‐U‐Chicagp‐1‐22‐2015.pdf

Lennart Johnsson2015-02-06Thank You!

16-Core SPARC T3 1,000,000,000 2010 Sun/Oracle 40 nm 377 mm² Six-Core Core i7 1,170,000,000 2010 Intel 32 nm 240 mm² 8-Core POWER7 1,200,000,000 2010 IBM 45 nm 567 mm² 4-Core Itanium Tukwila 2,000,000,000 2010 Intel 65 nm 699 mm² 8-Core Xeon Nehalem-EX 2,300,000,000 2010 Intel 45 nm

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Food outlets which focused on food quality, Service quality, environment and price factors, are thè valuable factors for food outlets to increase thè satisfaction level of customers and it will create a positive impact through word ofmouth. Keyword : Customer satisfaction, food quality, Service quality, physical environment off ood outlets .

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.