Vector And SIMD Processors

2y ago
25 Views
2 Downloads
652.40 KB
24 Pages
Last View : 20d ago
Last Download : 3m ago
Upload by : Callan Shouse
Transcription

Vector and SIMDProcessorsEric Welch & James EvansMultiple Processor SystemsSpring 2013

Outline IntroductionTraditional Vector Processorsooooo History & mance OptimizationsModern SIMD ProcessorsoooIntroductionArchitecturesUse in signal and image processing

History of Vector Processors Early WorkoDevelopment started in the early 1960s at Westinghouse Goal of the Solomon project was to substantially increase arithmeticperformance by using many simple co-processors under the control ofa single master CPUAllowed single algorithm to be applied to large data set SupercomputersooDominated supercomputer design through the 1970s into the 1990sCray platforms were the most notable vector supercomputers oCray -1: Introduced in 1976Cray-2, Cray X-MP, Cray Y-MPDemise In the late 1990s, the price-toperformance ratio drasticallyincreased for conventionalmicroprocessor designs

Description of Vector Processors CPU that implements an instruction set that operates on1-D arrays, called vectorsVectors contain multiple data elementsNumber of data elements per vector is typically referredto as the vector lengthBoth instructions and data are pipelined to reducedecoding timeSCALAR(1 operation)r2r1 r3add r3, r1, r2VECTOR(N operations)v1v2 v3vectorlengthadd.vv v3, v1, v2

Advantages of Vector Processors Require Lower Instruction Bandwitho Easier Addressing of Main Memoryo oLoop-related control hazards from the loop are eliminatedScalable Platformo Unlike cache access, every data element that is requested by theprocessor is actually used – no cache missesLatency only occurs once per vector during pipelined loadingSimplification of Control Hazardso Load/Store units access memory with known patternsElimination of Memory Wastageo Reduced by fewer fetches and decodesIncrease performance by using more hardware resourcesReduced Code SizeoShort, single instruction can describe N operations

Vector Processor Architectures Memory-to-Memory Architecture (Traditional)For all vector operation, operands are fetched directly frommain memory, then routed to the functional unito Results are written back to main memoryo Includes early vector machines through mid 1980s:o oAdvanced Scientific Computer (TI), Cyber 200 & ETA-10Major reason for demise was due to large startup time

Vector Processor Architectures (cont) Register-to-Register Architecture (Modern)All vector operations occur between vector registerso If necessary, operands are fetched from main memory intoa set of vector registers (load-store unit)o Includes all vector machines since the late 1980s:o oConvex, Cray, Fujitsu, Hitachi, NECSIMD processors are based on this architecture

Components of Vector Processors Vector Registersoooo Vector Functional Units (FUs)ooo Fully pipelined, new operation every cyclePerforms arithmetic and logic operationsTypically 4-8 different unitsVector Load-Store Units (LSUs)o Typically 8-32 vector registers with 64 - 128 64-bit elementsEach contains a vector of double-precision numbersRegister size determines the maximum vector lengthEach includes at least 2 read and 1 write portsMoves vectors between memory and registersScalar RegistersoSingle elements for interconnecting FUs, LSUs, and registers

Performance Optimizations Increase Memory Bandwidthoo Strip Miningo Equivalent to data forwarding in vector processorsResults of one pipeline are fed into operand registers of another pipelineScatter and Gatheroo Generates code to allow vector operands whose size is less than or greaterthan size of vector registersVector Chainingoo Memory banks are used to reduce load/store latencyAllow multiple simultaneous outstanding memory requestsRetrieves data elements scattered thorughout memory and packs theminto sequential vectors in vector registersPromotes data locality and reduces data pollutionMultiple Parallel Lanes, or PipesoAllows vector operation to be performed in parallel on multiple elements ofthe vector

Vector Chaining Example

Organization of Cray Supercomputer

Performance of Cray Supercomputers

Modern SIMD Introduction Single Instruction Multiple Data is part of Flynn's taxonomy(not MIMD as discussed in class)Performs same instruction on multiple data pointsconcurrentlyTakes advantage of data level parallelism within analgorithmCommonly used in image and signal processingapplicationso Large number of samples or pixels calculated with the same instructionDisadvantages:o Larger registers and functional units use more chip area and powero Difficult to parallelize some algorithms (Amdahl's Law)o Parallelization requires explicit instructions from the programmer

SIMD Processor Performance TrendsSIMD andMIMDSIMD

Modern SIMD Processors Most modern CPUs have SIMD architectureso Intel SSE and MMX, ARM NEON, MIPS MDMX These architectures include instruction set extensionswhich allow both sequential and parallel instructionsto be executed Some architectures include separate SIMD coprocessorsfor handling these instructions ARM NEONo Included in Cortex-A8 and Cortex-A9 processorsIntel SSEoIntroduced in 1999 in the Pentium III processoroSSE4 currently used in Core series

SIMD Processor Introduction128 bitsVLD.32 Q1, 0x0Q1M[3]M[2]M[1]M[0]VLD.32 Q2, 0x8Q2M[11]M[10]M[9]M[8]VADD.U32 Q3, Q2, Q1Q3M[3] M[11]M[2] M[10]M[1] M[9]M[0] M[8]VST.32 Q3, 0x100x10M[0] M[8]0x11M[1] M[9]non-SIMD8 load instructions4 add instructions4 store instructionsSIMD2 load instructions1 add instruction1 store instruction0x12M[2] M[10]16 total instructions4 total instructions0x13M[3] M[11]Possible Speedup 4

ARM NEON SIMD Architecture 16 128-bit SIMDregistersSeparatesequential andSIMD processorsBoth have accessto same L2 cachebut separate L1cachesInstructionsfetched in ARMprocessor andsent to NEONcoprocessorARM ProcessorNEON CoprocessorARM Cortex-A8 Processor and NEON SIMD coprocessor

Intel SSE SIMD Architecture Intel Core ArchitectureStreaming SIMDExtensions16 128-bit registersSIMD instructionsexecuted along withsequential instructionsAdds floating pointoperations to Intel'sMMX SIMDCombined SSE Functional Units

Software Programming Intel and ARM both have vectorizing compilers whichwill compile code using SIMD instructionsMany audio/video SIMD libraries availableTo achieve best performance, custom coding at theassembly level should be used

Specialized Instructions NEON SIMDo VZIP – Interleaves two vectorso VMLA – Multiply and accumulateo VRECPE – Reciprocal estimateo VRSQRTE – Reciprocal square root estimate Intel SSE4o PAVG – Vector averageo DPPS, DPPD – Dot producto PREFETCHT0 – Prefetch data into all cache levelso MONITOR, MWAIT – Used to synchronize across threads

Performance Impact of SIMD NEON on Cortex-A8 with gcc compiler Two vectorization methodso Intrinsics - Using intrinsic functions to vectorizeo Manual - Vectorizing using instructions at the assembly levelApplied to an image warping algorithmo Mapping a pixel from a source to destination image by an offseto Calculated on four pixels in parallel (max speedup 4)OriginalExecution time rogramming methods1.000of different SIMD2.1953.090

Performance Impact of SIMD SSE on Intel i7 and AltiVec on IBM Power 7 processorsSIMD applied to Media Bench II which containsmultimedia applications for encoding/decoding mediafiles (JPEG, H263, MPEG2)Tested three compilers with three methods:o Auto Vectorization - No changes to codeo Transformations - Code changes to help compiler vectorizeo Intrinsics - Functions which compile to SIMDMethodXLCICCGCCAutoVectorization1.66 (52.94%)1.84(71.77%)1.58 (44.71%)Transformations2.972.38-IntrinsicsAverage speedup and parallelizable loops for Media Bench II3.152.45-

Conclusions Vector processors provided the early foundation forprocessing large amounts of data in parallelVector processing techniques can still be found in videogame consoles and graphics acceleratorsSIMD extensions are a decendant of vector processorsand included in most modern processorsChallenging programming and Amdahl’s Law are themain factors limiting the performance of SIMD

Questions?

Components of Vector Processors Vector Registers o Typically 8-32 vector registers with 64 - 128 64-bit elements o Each contains a vector of double-precision numbers o Register size determines the maximum vector length o Each includes at least 2 read and 1 write ports Vector Functional Units (FUs) o Fully pipelin

Related Documents:

Concepts Introduced in Chapter 4 vector architectures SIMD ISA extensions graphics processing units (GPUs) loop dependence analysis Vector Computers SIMD Extensions GPUs Loop Deps SIMD Advantages SIMD architectures can signi cantly improve performance by exploiting DLP when available in applications. SIMD processors are more energy e cient than .

Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length - VLR Vector stride Enhancing Vector performance Measuring Vector performance SSE Instruction set and Applications A case study - Intel Larrabee vector processor

Emscripten now targets SIMD.JS C/C JavaScript* 1.00 2.03 7.18 8.13 0 2 4 6 8 10 Speedup over Scalar JS Scalar JS Scalar C SIMD JS SIMD C Emscripten brings native SIMD apps to the open web platform Near-native SIMD.JS speedup

PEAK PCAN-USB, PEAK PCAN-USB Pro, PEAK PCAN-PCI, PEAK PCAN-PCI Express, Vector CANboard XL, Vector CANcase XL, Vector CANcard X, Vector CANcard XL, Vector CANcard XLe, Vector VN1610, Vector VN1611, Vector VN1630, Vector VN1640, Vector VN89xx, Son-theim CANUSBlight, Sontheim CANUSB, S

instruction, multiple data [1]) architecture and shared memory vector architecture. An early example of a distributed memory SIMD (DM-SIMD) architecture is the Illiac-IV [2]. A typical DM-SIMD architecture has a general-purpose scalar p

adding vector eight times either in horizontal or vertical direction can improve performance with NEON for most of 8-bit and 16-bit interpolation modes. SIMD can process 4, 8 or even 16 adjacent samples at a time depending on the block width. Fig. 4. HEVC Inter/Intra Prediction B. SIMD with Intra Prediction

Intel SIMD architecture Computer Organization and Assembly Languages Yung-Yu Chuang 2007/1/7 2 Overview SIMD MMX architectures MMX instructions

Introduction Description logics (DLs) are a prominent family of logic-based formalisms for the representation of and reasoning about conceptual knowledge (Baader et al. 2003). In DLs, concepts are used to describe classes of individuals sharing common properties. For example, the following concept de-scribes the class of all parents with only happy children: Personu has-child.Personu has .