CPU / GPU TECHNOLOGIES NOW AND FUTURE

2y ago
21 Views
3 Downloads
4.27 MB
28 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Ellie Forte
Transcription

CPU / GPU TECHNOLOGIESNOW AND FUTUREAndré HeidekrügerSr. Technical ConsultantPresales EMEA1 HPC Advisory Council Lugano

A CASE FOR SERVER FUSION - EXASCALE Current trajectory puts traditional x86 computing at just over 20Pflops by 2018 A data center that could achieve an exaflop in 2018 using only x86 processors would consume over3TW To achieve exascale capability by 2018, x86 performance would need to increase by 2x each year,starting in 20101.00E 19EXAFLOP1.00E 18500PF100PF1.00E 1750PF20PF1.00E 161.00E 152 HPC Advisory Council LuganoHeterogeneouscompute required tobridge the gapExascale Target@30% upliftHomogeneousx86 computehits a wall

HIGH EFFICIENCY LINPACK IMPLEMENTATIONON AMD MAGNY COURS AMD 5870 GPUSystem GFLOPSDGEMM/nodeLinpack/nodeLinpack/4 nodes05001000 GPU DPFP Peak: 544 GFLOPS GPU DGEMM kernel: 87% of Peak2.5 GFLOPS/W Node DPFP Peak: 745.6 GFLOPS Linpack efficiency: 75.5% of Peak Linpack scaling across 4 nodes: 70% of Peak3 HPC Advisory Council Lugano150020002500HPL code:http://code.compeng.uni-frankfurt.de/

AMD PROCESSOR POWER AND PERFORMANCE OVER TIMEINCREASING PERFORMANCE-PER-WATT EFFICIENCIESAdvanced Platform Management LinkAMD CoolSpeed TechnologyC1E power-stateDDR3 LV MemoryAMD PowerCap ManagerAMD SmartFetch TechnologyAMD CoolCore Technology on L3AMD CoolCore TechnologyDual Dynamic Power ManagementIndependent Dynamic Core EngagementAMD PowerNow! TechnologyIntegrated Memory ControllerSPEC, SPECint, and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. The comparison presented above is based on the best performing two-socketservers using the specified processor model. For the latest SPECint rate2006 and SPECfp rate2006 results, visit ” performance based on internal AMD estimates4 HPC Advisory Council Lugano

TECHNOLOGY COMPARISONSCompared to how the current architecture handles two threads,“Bulldozer” can deliver more performance in a smaller die space5 HPC Advisory Council Lugano

SHARING RESOURCESHELPING TO MAXIMIZE POWER EFFICIENCY AND COSTThe “Bulldozer” module has shared anddedicated componentsDedicatedComponentsShared at themodule levelShared atthe chip levelFetchThe shared components:DecodeL1 DCacheL1 DCacheShared L2 CacheShared L3 Cache and NB6 HPC Advisory Council LuganoPipelinePipelinePipeline128-bit FMACPipelinePipelinePipelinePipelineCore 2128-bit FMAC“Bulldozer” dynamically switchesbetween shared and dedicatedcomponents to maximize performanceper wattIntSchedulerCore 1The dedicated components: Help increase performance andscalabilityFPSchedulerIntSchedulerPipeline Help reduce power consumption Help reduce die space (cost)

CORE MICROARCHITECTURE – SHARED FPUFetch QueueL2 BTBUcode ROM PRF-based registerrenaming Unified scheduler (forboth threads)7 HPC Advisory Council LuganoLd/ST Unit L1 DTLB L1 DCacheData PrefetcherFP Ld BufferAGenAGenEX, DIVInstrRetireEX, MULIntSchedulerFP Scheduler128-bitFMACAGenAGenEX, DIVInstrRetireEX, MULIntScheduler4 x86 Decoders128-bitFMAC Dual 128-bit FMACpipesMMX Reports completionback to parent core Dual 128-bit packedinteger pipesICacheL1 BTBMMX Co-processororganizationPrediction QueueLd/ST Unit L1 DTLB L1 DCacheShared L2 Cache

THREAD CONTROL AND SELECTION MECHANISMSVertical MTEach core is logical processor from viewpoint of softwareSingle ThreadSMT/ thread agnosticSC tchthreaddomainReq QCorePredQBPThreaddomain8 HPC Advisory Council LuganoFPfrontendSC QFPExecutionRetQsFPbackendL2/CU

POWER EFFICIENCY AND APM Start with inherently powerefficient micro-architectureand implementation:Power consumption varies greatly by workload90%– Dynamic sharing of sharedresources80%– Minimize data movement70%– Extensive clock and power gating– Digitally measure activity toestimate power– Hardware uses higher frequencywhen power limit allows Support for chip-level corepower gatingPercent of TDP Add active managementsupport:60%50%40%30%20%Power Headroom that canbe utilized per core.10%0%* Based on internal AMD modeling using benchmark simulations9 HPC Advisory Council Lugano

BUILDING A “BULLDOZER” PROCESSORServer:“Interlagos” –16 cores (2 dies)“Valencia” –8 cores (1 die)Client:“Zambezi” –8 cores (1 die)10 HPC Advisory Council LuganoMemory ControllerShared L3 CacheEach processor die is composed ofmultiple “Bulldozer” modulesModule divisions are transparent toshared hardware, operating system orapplicationThe modular architecture speeds chipdevelopment and increases productflexibilityNB/HT Links

NEW “BULLDOZER” INSTRUCTIONSRequire a recompileRequire no recompileInstruction11DescriptionSSSE3Supplemental Streaming SIMD Extensions 3 (SSSE3) is a SIMD instruction set. It contains 16discrete instructions; because each can act on 64-bit MMX or 128-bit XMM registers itrepresents a total of 32 instructions.SSE 4.1A set of 47 instructions that execute operations which are not specific to multimediaapplications. It features a number of instructions whose action is determined by a constant fieldand a set of instructions that take XMM0 as an implicit third operand.SSE 4.2An additional 7 instructions that are incremental to SSE 4.1, including 4 very powerful andgeneric string compare operations.AES and PCMULQDQAdvanced Encryption Standard (AES) Instruction Set is an extension to the x86 instruction setarchitecture. It helps improve the speed of applications performing encryption and decryptionusing the Advanced Encryption Standard (AES).AVXThe size of the SIMD vector registers is increased from 128-bits XMM registers to 256-bitsregisters called YMM0 - YMM15. Existing 128-bit instructions use the lower half of the YMMregisters. The AVX instruction set allows all two-operand XMM instructions to be modified intonon-destructive three-operand forms where the destination register is different from both sourceregisters.FMA4The FMA instruction set is a extension to the 128-bit and 256-bit SIMD instructions in the X86microprocessor instruction set to perform fused multiply-add operations.XOPXOP makes the binary coding of new instructions more compatible with Intel's AVX instructionextensions, while the functionality of the instructions is unchanged. HPC Advisory Council Lugano

“BULLDOZER”: FLEX FP OPERATING MODESLegacyCore 1AVXCore 2AVXLegacyAVXAVXCore 1 single precision480Core 1 double precision240Core 2 single precision408Core 2 double precision204FLOPs/Cycle (16 cores)646464Recompiled app?NoYesYesMode12 HPC Advisory Council Lugano

GENERATIONAL COMPARISONSAMD Opteron 4100/6100Series Processors“Valencia” / “Interlagos”Cores4100: 4 or 6 core; 6100: 8 or 12 core4200: 6 or 8 core; 6200: 8, 12 or 16 coreCache (L2 per core / L3 perdie)512KB / 6MB2MB (shared between 2 cores) / 8MBMemory Channels and speed4100: two; 6100: four; up to 1333MHz4200: two; 6200: four; up to 1600MHzFloating point capability128-bit FPU per core (FADD/FMUL)128-bit dedicated FMAC per core or 256-bit AVXshared between 2 coresInteger Issues Per Cycle34Turbo CORE TechnologyNoYes ( 500MHz with all cores active)Power (ACP)65W, 80W, 105WTBD (planned 65W, 80W, 105W)SSSE3, SSE 4.1/4.2, AVX, AES, FMA4, XOP,PCLMULQDQNew Instruction SetsPower GatingAMD CoolCore , C1EAMD CoolCore , C1E, C6Process / Die Size45nm SOI32nm SOI (smaller overall die size)PerformanceExpected up to 50% higher throughputThe above reflect current expectations regarding features and performance and is subject to change13 HPC Advisory Council Lugano

AMD AND GPU COMPUTING: PIONEERING INNOVATIONFirstDevelopmentPlatformFirst with DoublePrecisionFirst to1 TFLOPS,First OpenCL CPUIndustry’s first doubleprecision GPU:AMD FireStream 9170First to 1 TFLOPSFirst single slot:AMD FireStream 9250First onTop 500’sTop TenTianhe-1Top500 #5Second GenerationSingle SlotFireStream 9350ATI Stream SDKFireStream 9370Stream ComputingDevelopment PlatformCTM SDK200614Industry StandardAPI: OpenCLannounced2007 HPC Advisory Council Lugano2008FireStream 927020092010

AMD FIRESTREAM GPU ACCELERATORSSOLUTIONS OPTIMIZED FOR PERFORMANCE, POWER AND DENSITYMaximum Performance- Fastest memory technology- Large memoryDeployable Performance- Highest performance per watt anddensity- Low power- Low profile- Optimal price/performance15 HPC Advisory Council LuganoAMD FireStream 9370- 2.64 TFLOPS- 528 GFLOPS DPFP- 4GB GDDR5- 225W- Passive heat sinkAMD FireStream 9350- 2.0 TFLOPS- 400 GFLOPS DPFP- 2GB GDDR5- Single slot, 150W- Passive heat sink

ATI RADEON HD 5870 (“CYPRESS”) 1600 Stream Processors 20 SIMD engines 2.72 TFLOPs SP 544 GFLOPs DP16 HPC Advisory Council Lugano

OPENCL DEVICE EXAMPLE ATI Radeon HD 5870 GPU1 Compute UnitContains 16 StreamCores1 Stream Core 5Processing Elements17 HPC Advisory Council Lugano

AMD RADEON HD 6900 SERIES Dual graphics engines New VLIW4 core architecture More SIMD engines and textureunits Over 5 Gbps 1536 Stream Processors 2.7 TFLOPs SP 683 GFLOPs DP18 HPC Advisory Council Lugano

NEW CORE DESIGN VLIW4 thread processors– 4-way co-issue– All stream processing units now haveequal capabilities (no more “T-unit”) Special functions (transcendentals) occupy3 of 4 issue slots Allow better utilization than previousVLIW5 design– Similar performance with 10% area reduction– Simplified scheduling and register management– Extensive logic re-use19 HPC Advisory Council LuganoStream Processing UnitsFP ops per clockInteger ops per clock4 32-bit MAD2 64-bit MUL or ADD1 64-bit MAD or FMA1 Special Function4 24-bit MUL, ADD or MAD2 32-bit ADD1 32-bit MUL

GPU COMPUTE ENHANCEMENTS Asynchronous dispatch– Execute multiple compute kernels simultaneously– Each kernel has its own command queue andprotected virtual address domain Dual bidirectional DMA engines forfaster system memory reads & writes Coalescing of shader read ops Fetch direct to LDS Improved flow control Faster double precision ops (1/4 SPrate)20 HPC Advisory Council Lugano

UPCOMING POWER CONTAINMENT FEATURE DRIVES GPUPERFORMANCE EFFICIENCY Clamps GPU TDP to a pre-determined level Integrated power control processor monitorspower draw every clock cycle–Dynamically adjusts clocks for various blocks to enforce TDP No longer need to constrain clocks speedsto allow for outlier applications User controllable via AMD OverDrive Utility2121 HPC Advisory Council LuganoNote: See slide 34 for additional information

POWER CONTAINMENTHigh peak powerUnconstrained powerScaled powerPower containmentPowerLower peak powerFast completionSlower completionTime22 HPC Advisory Council LuganoTheoretical example only

NOW THE AMD FUSION ERA OF COMPUTING BEGINS APU: Fusion of CPU & GPU compute power within one processor High-bandwidth I/O23 HPC Advisory Council Lugano

Fusion APU Based PC24 HPC Advisory Council Lugano

ATI STREAM SDK V2.3:OPENCL FOR MULTICORE X86 CPUS AND GPUSThe Power of AMD Fusion : Developers leverage heterogeneousarchitecture to enable superior user experience Complete OpenCL development platform Certified OpenCL 1.1 compliant by The Khronos Group1 Write code that can scale well on multi-core CPUs and GPUsAMD delivers on the promise of support for OpenCL , with bothhigh-performance CPU and GPU technologiesAvailable for download now – includes documentation, samples,profilers and developer supportProduct Page: sSerial andTask ParallelWorkloadsGraphics WorkloadsData Parallel Workloadslogs submitted for the ATI Radeon HD 5800 Series GPUs, ATI Radeon HD 5700 Series GPUs, ATI Radeon HD 5600 Series GPUs, ATI Radeon HD 5400 Series GPUs, ATI Radeon HD 5500 Series GPUs, ATI Radeon HD 4800 Series GPUs, ATI Radeon 4600 Series GPUs, ATI FirePro V8800 Series GPUs, ATI FirePro V8700 Series GPUs, , ATI FirePro V7800 Series GPUs, ATI FirePro V7700 Series GPUs, ATI FirePro V5800 Series GPUs, ATI FirePro V5700 Series GPUs, ATI FirePro V4800 Series GPUs, ATI FirePro V3800 Series GPUs, ATI FirePro V3700 Series GPUs, AMD FireStream 9200 Series GPU Compute Accelerators, ATI Mobility Radeon HD 5800 Series GPUs (Windows ), ATI Mobility Radeon HD 5700 Series GPUs (Windows ), ATI Mobility Radeon HD 5600 Series GPUs(Windows ), ATI Mobility Radeon HD 5400 Series GPUs (Windows ), ATI Mobility Radeon HD 4800 Series GPUs (Windows ), ATI Mobility Radeon HD 4600 Series GPUs (Windows ), ATI Radeon E4690 Discrete GPU (Windows ), and x86 CPUs with SSE3.1Conformance25 HPC Advisory Council Lugano

GPU COMPUTE OFFLOAD – 3 PHASESExcellentArchitected EraFusion ArchitecturePoorArchitecture Maturity &Programmer AccessiblityIndustry StandardDrivers EraOpenCL /DirectComputeOpenCL/DirectComputeDriver-based APIsDriver-basedAPIsProprietaryDrivers EraGraphics& ProprietaryGraphics& ProprietaryDriverDriver-basedAPIsbased APIsExpert programmersGood APIs for compute“C and C like”Multiple address spaces &explicit data movement “Hacker” programmers Exploit early programmable“shader cores” in the GPU Make your program look like“graphics” to the GPU CUDA, Brook , etc2002 - 200826 Mainstream programmers GPU is a first class member ofthe platform architecture Full C support Single unified & coherentaddress space HPC Advisory Council Lugano2009 - 20112012 - 2020

http://developer.amd.com/afds/pages/default.aspx27 HPC Advisory Council Lugano

DISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies,omissions and typographical errors.The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but notlimited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMDassumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise thisinformation and to make changes from time to time to the content hereof without obligation of AMD to notify any person ofsuch revisions or changes.AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF ANDASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THISINFORMATION.AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANYPARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIALOR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.Trademark AttributionAMD, AMD Opteron, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in theUnited States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may betrademarks of their respective owners. 2010 Advanced Micro Devices, Inc. All rights reserved.28 HPC Advisory Council Lugano

1Conformance logs submitted for the ATI Radeon HD 5800 Series GPUs, ATI HD 5700 Series GPUs, ATI HD 5600 Series GPUs, ATI HD 5400 Series GPUs, ATI HD 5500 Serie

Related Documents:

Adaptive MPI multirail tuning for non-uniform input/output access. EuroMPI'10. CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU . F. Broquedis et al., HWLOC : A generic framework for managing hardware affinities in HPC applications. PDP '10. (2) D. Callahan, et al., Compiling Programs for Distributed Memory Multiprocessors.The .

79 85 91 97 3 9 5 GPU r) U r (W) e) ex r A15 r rVR 4 U L2 Cache DRAM Cortex-A15 Quad CPU 0 CPU 1 CPU 2 CPU 3 L2 Cache PowerVR SGX544 GPU Cortex-A7 Quad CPU 0 CPU 1 CPU 2 CPU 3 Multi-layer BUS Figure 1: Exynos 5 Octa SoC simplified block diagram. However, 3D games are highly demanding of computational re-sources as well as memory bandwidth on .

CPU 315-2 PN/DP 6ES7315-2EH13-0AB0 V2.6 CPU 317-2 DP 6ES7317-2AJ10-0AB0 V2.6 CPU 317-2 PN/DP 6ES7317-2EK13-0AB0 V2.6 CPU 319-3 PN/DP CPU 31x 6ES7318-3EL00-0AB0 V2.7 . SIMATIC S7-300 CPU 31xC and CPU 31x: Specifications CPU 31xC and CPU 31x: Specifications 4 Manual .

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

CPU VS GPU A GPU is a processor with thousands of cores , ALUs and cache. S.N O CPU GPU 1. CPU stands for Central Processing Unit. While GPU stands for Graphics Processing Unit. 2. CPU consumes or needs more memory than GPU. While it consumes or requires less memor

1 mm 3 mm 5 mm 7 mm 9 mm 11 mm 13 mm 15 mm 17 mm AMDFSA Config Figure 6: CPU -- GPU Power Sharing While the CPU is the hot spot on the die, a 1W reduction in CPU power allows the GPU to consume an additional 1.6W before the lateral heat conduction from CPU to GPU heats the CPU enough to be the hot spot again. As the GPU

Introduction to GPU Computing . CPU GPU Add GPUs: Accelerate Science Applications . Small Changes, Big Speed-up Application Code GPU Use GPU to Parallelize CPU Compute-Intensive Functions Rest of Sequential CPU Code . 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming

CPU 315-2 DP 6ES7315-2AG10-0AB0 V2.0.0 01 CPU 315-2 PN/DP 6ES7315-2EG10-0AB0 V2.3.0 01 CPU 317-2 DP 6ES7317-2AJ10-0AB0 V2.1.0 01 CPU 317-2 PN/DP CPU 31x 6ES7317-2EJ10-0AB0 V2.3.0 01 Note The special features of the CPU 315F-2 DP (6ES7 315-6FF00-0AB0) and CPU 317F-2 DP (6ES7 317-6FF00-0AB0) are described in their Product Information,