Introduction To Vector Processing

1y ago
18 Views
2 Downloads
608.27 KB
83 Pages
Last View : 20d ago
Last Download : 3m ago
Upload by : Albert Barnett
Transcription

Introduction to Vector Processing Motivation: Why vector Processing?––– Limits to Conventional Exploitation of ILPFlynn’s 1972 Classification of Computer ArchitectureData Parallelism and ArchitecturesVector Processing �––––Paper: VEC-1Paper: VEC-1Vectorizable ApplicationsLoop Level Parallelism (LLP) Review (From 551)Vector vs. Single-Issue and Superscalar ProcessorsProperties of Vector Processors/ISAsVector MIPS (VMIPS) ISAVector Memory Operations Basic Addressing ModesVectorizing Example: DAXPYVector Execution Time EvaluationVector Load/Store Units (LSUs) and Multi-Banked MemoryVector Loops ( n MVL): Strip MiningMore on Vector Memory Addressing Modes: Vector Stride Memory AccessVector Operations Chaining Vector element data forwardingVector Conditional Execution & Gather-Scatter OperationsVector Example with Dependency: Vectorizing Matrix MultiplicationCommon Vector Performance Metrics & ExamplesLimited Vector Processing: SIMD/vector or Multimedia Extensions to Scalar ISASummary: Vector Processing Advantages & PitfallsVector Processing & VLSI: Vector Intelligent RAM (VIRAM)–––Potential AdvantagesVIRAM ArchitectureOverview of VIRAM Prototype Generations: VIRAM-1, VIRAM-2Papers: VEC-1, VEC-2, VEC-3Papers: VEC-2, VEC-3EECC722 - Shaaban#1 lec # 7 Fall 2012 10-1-2012

Motivation for Vector Processing:Problems with Superscalar approach Limits to conventional exploitation of ILP:1) Pipelined clock rate: Increasing clock rate requires deeperpipelines with longer pipeline latency which increases the CPIincrease (longer branch penalty , other hazards).2) Instruction Issue Rate: Limited instruction level parallelism (ILP)reduces actual instruction issue/completion rate. (vertical &horizontal waste) SMT fixes this one?3) Cache hit rate: Data-intensive scientific programs have very largedata working sets accessed with poor locality; others havecontinuous data streams (multimedia) and hence poor locality.(poor memory latency hiding).4) Data Parallelism: Poor exploitation of data parallelism present inmany scientific and multimedia applications, where similarindependent computations are performed on large arrays of data(Limited ISA, hardware support). As a result, actual achieved performance is much less than peakpotential performance and low computational energy efficiency(computations/watt)EECC722 - Shaaban#2 lec # 7 Fall 2012 10-1-2012

X86 CPU Cache/Memory Performance Example:AMD Athlon T-Bird Vs. Intel PIII, Vs. P4AMD Athlon T-Bird 1GHZL1: 64K INST, 64K DATA (3 cycle latency),both 2-wayL2: 256K 16-way 64 bitLatency: 7 cyclesL1,L2 on-chipIntel P 4, 1.5 GHZL1: 8K INST, 8K DATA (2 cycle latency)both 4-way96KB Execution Trace CacheL2: 256K 8-way 256 bit , Latency: 7 cyclesData working setlarger than L2L1,L2 on-chipIntel PIII 1 GHZL1: 16K INST, 16K DATA (3 cycle latency)both 4-wayL2: 256K 8-way 256 bit , Latency: 7 cyclesL1,L2 on-chipImpact of long memorylatency for large data working setsSource: http://www1.anandtech.com/showdoc.html?i 1360&p 15From 551EECC722 - Shaaban#3 lec # 7 Fall 2012 10-1-2012

Flynn’s 1972 Classification of ComputerArchitecturei.e single-threadedUniprocessorSISDSIMD Single Instruction stream over a Single Data stream (SISD):Conventional sequential machines (Superscalar, VLIW). Single Instruction stream over Multiple Data streams (SIMD):Vector computers, array of synchronized processingAKA Data Parallel or Data Streamingelements. (exploit data parallelism)ArchitecturesMISDMIMD Multiple Instruction streams and a Single Data stream (MISD):Systolic arrays for pipelined execution. Multiple Instruction streams over Multiple Data streams(MIMD): Parallel computers:Parallel Processor Systems: Exploit Thread Level Parallelism (TLP) Shared memory multiprocessors (e.g. SMP, CMP,NUMA, SMT) Multicomputers: Unshared distributed memory,message-passing used instead (Clusters)Instruction stream Hardware context or threadFrom 756 Lecture 1EECC722 - Shaaban#4 lec # 7 Fall 2012 10-1-2012

Data Parallel Systems SIMD in Flynn taxonomy Programming model: Data Parallel– Operations performed in parallel on eachelement of data structure i.e elements of arrays or vectors– Logically single thread of control, performssequential or parallel stepsControlprocessor– Conceptually, a processing element (PE) orprocessor is associated with each data element Architectural model– Array of many simple, cheap processors eachwith little memory Processors don’t sequence throughinstructions Control processor does that– Attached to a control processor that issuesinstructions– Specialized and general communication, cheapglobal synchronization PEPE PEPEPE PE PEPE PEExample machines:–––Thinking Machines CM-1, CM-2 (and CM-5)Maspar MP-1 and MP-2,Current Variations: IBM’s Cell Architecture, GraphicsProcessor Units (GPUs) Difference: PE’s are full processors optimized fordata parallel computations.From 756 Lecture 1PE Processing ElementSIMD (array) MachineEECC722 - Shaaban#5 lec # 7 Fall 2012 10-1-2012

Alternative Model:Vector Processing Vector processing exploits data parallelism by performing the same computationon linear arrays of numbers "vectors” using one instruction.The maximum number of elements in a vector supported by a vector ISA isreferred to as the Maximum Vector Length (MVL). Typical MVL 64-128 Range 64-4996Per scalarinstructionScalarISA(RISCor CISC)SCALAR(1 operation)VECTOR(N operations)(Vector-vector instruction shown)VectorISAv1 v2r2r1 r3v3Add vectorAdd.d F3, F1, F2vectorlengthVector Registersaddv.d v3, v1, v2v vectorVEC-1Per vectorinstructionTypical MVL 64 (Cray)Up toMaximumVectorLength(MVL)EECC722 - Shaaban#6 lec # 7 Fall 2012 10-1-2012

Vector (Vectorizable) ApplicationsApplications with high degree of data parallelism (loop-level parallelism),thus suitable for vector processing. Not Limited to scientific computing AstrophysicsAtmospheric and Ocean ModelingBioinformaticsBiomolecular simulation: Protein foldingComputational ChemistryComputational Fluid DynamicsComputational PhysicsComputer vision and image understandingData Mining and Data-intensive ComputingEngineering analysis (CAD/CAM)Global climate modeling and forecastingMaterial SciencesMilitary applicationsQuantum chemistryVLSI designMultimedia Processing (compress., graphics, audio synth, image proc.) Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)Lossy Compression (JPEG, MPEG video and audio)Lossless Compression (Zero removal, RLE, Differencing, LZW)Cryptography (RSA, DES/IDEA, SHA/MD5)Speech and handwriting recognitionOperating systems/Networking (memcpy, memset, parity, checksum)Databases (hash/join, data mining, image/video serving)Language run-time support (stdlib, garbage collection)EECC722 - Shaaban#7 lec # 7 Fall 2012 10-1-2012

Data Parallelism & Loop Level Parallelism (LLP) Data Parallelism: Similar independent/parallel computations on differentelements of arrays that usually result in independent (or parallel) loop iterationswhen such computations are implemented as sequential programs.In scalar codeA common way to increase parallelism among instructions is to exploit dataparallelism among independent iterations of a loopUsually: Data Parallelism LLP(e.g exploit Loop Level Parallelism, LLP).– One method covered earlier to accomplish this is by unrolling the loop eitherstatically by the compiler, or dynamically by hardware, which increases the size ofthe basic block present. This resulting larger basic block provides moreinstructions that can be scheduled or re-ordered by the compiler/hardware toeliminate more stall cycles. The following loop has parallel loop iterations since computations in each iterations are dataparallel and are performed on different elements of the arrays.4 vector instructions:Vectorfor(i 1;i 1000;i i 1;)Load Vector XLVCodeScalarLoad Vector YLVCodeADDV Add Vector X, X, Yx[i] x[i] y[i];SV Store Vector XIn supercomputing applications, data parallelism/LLP has been traditionallyexploited by vector ISAs/processors, utilizing vector instructions– Vector instructions operate on a number of data items (vectors) producinga vector of elements not just a single result value. The above loop might requirejust four such instructions.Assuming Maximum Vector Length(MVL) 1000 is supportedotherwise a vector loop (i.e strip mining) is needed, more on this laterFrom 551EECC722 - Shaaban#8 lec # 7 Fall 2012 10-1-2012

Loop-Level Parallelism (LLP) Analysis Loop-Level Parallelism (LLP) analysis focuses on whether data accesses in lateriterations of a loop are data dependent on data values produced in earlieriterations and possibly making loop iterations independent (parallel).e.g. inUsually:for (i 1; i 1000; i )x[i] x[i] s;Data Parallelism LLPin scalar codeIteration #S1(Body of Loop)123 .1000S1S1S1 S1Dependency Graphthe computation in each iteration is independent of the previous iterations and theloop is thus parallel. The use of X[i] twice is within a single iteration. Thus loop iterations are parallel (or independent from each other).Classification of Date Dependencies in Loops: Loop-carried Data Dependence: A data dependence between different loopiterations (data produced in an earlier iteration used in a later one).Not Loop-carried Data Dependence: Data dependence within the same loopiteration.LLP analysis is important in software optimizations such as loop unrolling since itusually requires loop iterations to be independent (and in vector processing).LLP analysis is normally done at the source code level or close to it since assemblylanguage and target machine code generation introduces loop-carried namedependence in the registers used in the loop.–Instruction level parallelism (ILP) analysis, on the other hand, is usually done when instructions aregenerated by the compiler.From 551EECC722 - Shaaban#9 lec # 7 Fall 2012 10-1-2012

Loop-carried DependenceLLP Analysis Example 1 iIteration #In the loop:for (i 1; i 100; i i 1) {A[i 1] A[i] C[i]; /* S1 */B[i 1] B[i] A[i 1];} /* S2 */}Not LoopCarriedDependence(within thesame iteration)i 1S1A i 1A i 1S1A i 1S2S2B i 1(Where A, B, C are distinct non-overlapping arrays)Dependency Graph– S2 uses the value A[i 1], computed by S1 in the same iteration. This data dependence iswithin the same iteration (not a loop-carried dependence). does not prevent loop iteration parallelism.i.e. S1 S2 on A[i 1] Not loop-carried dependence– S1 uses a value computed by S1 in the earlier iteration, since iteration i computes A[i 1]read in iteration i 1 (loop-carried dependence, prevents parallelism). The same applies forS2 for B[i] and B[i 1]i.e. S1 S1 on A[i] Loop-carried dependenceS2 S2 on B[i] Loop-carried dependence These two data dependencies are loop-carried spanning more than one iteration(two iterations) preventing loop parallelism.In this example the loop carried dependencies form two dependency chainsstarting from the very first iteration and ending at the last iterationFrom 551EECC722 - Shaaban#10 lec # 7 Fall 2012 10-1-2012

LLP Analysis Example 2Dependency GraphIteration # In the loop:for (i 1; i 100; i i 1) {A[i] A[i] B[i];B[i 1] C[i] D[i];}/* S1 *//* S2 */Loop-carried Dependenceii 1S1S1B i 1S2S2– S1 uses the value B[i] computed by S2 in the previous iteration (loop-carrieddependence)i.e. S2 S1 on B[i] Loop-carried dependence– This dependence is not circular: S1 depends on S2 but S2 does not depend on S1.– Can be made parallel by replacing the code with the following:Scalar Code: Loop Start-up codeA[1] A[1] B[1];for (i 1; i 99; i i 1) {VectorizableB[i 1] C[i] D[i];codeParallel loop iterationsA[i 1] A[i 1] B[i 1];(data parallelism in computationexposed in loop code)}B[101] C[100] D[100];Scalar Code: Loop Completion codeFrom 551EECC722 - Shaaban#11 lec # 7 Fall 2012 10-1-2012

LLP Analysis Example 2Original Loop:Iteration 1for (i 1; i 100; i i 1) {A[i] A[i] B[i];B[i 1] C[i] D[i];}Iteration 2S1A[1] A[1] B[1];A[2] A[2] B[2];S2B[2] C[1] D[1];B[3] C[2] D[2];/* S1 *//* S2 */.Loop-carriedDependenceIteration 99Iteration 100A[99] A[99] B[99];A[100] A[100] B[100];B[100] C[99] D[99];B[101] C[100] D[100];Scalar Code: Loop Start-up codeA[1] A[1] B[1];for (i 1; i 99; i i 1) {B[i 1] C[i] D[i];VectorizableA[i 1] A[i 1] B[i 1];}B[101] C[100] D[100]; Scalar Code: Loop Completion codeModified Parallel Loop:(one less iteration)Iteration 98Loop Start-up codeIteration 1A[1] A[1] B[1];A[2] A[2] B[2];B[2] C[1] D[1];B[3] C[2] D[2];From 551.Not LoopCarriedDependencecodeIteration 99A[99] A[99] B[99];A[100] A[100] B[100];B[100] C[99] D[99];B[101] C[100] D[100];Loop Completion codeEECC722 - Shaaban#12 lec # 7 Fall 2012 10-1-2012

Properties of Vector Processors/ISAs Each result (element) in a vector operation is independent ofprevious results (Data Parallelism, LLP exploited) Multiple pipelined Functional units (lanes) usually used, vectorcompiler ensures no dependencies between computations on elements of asingle vector instruction Higher clock rate (less complexity) Vector instructions access memory with known patterns: Highly interleaved memory with multiple banks used to providethe high bandwidth needed and hide memory latency. Amortize memory latency of over many vector elements. No (data) caches usually used. (Do use instruction cache)Thus more predictable performancei.e. lower effectiveimpact of memorylatencyUp to MVL computations A single vector instruction implies a large number ofcomputations (replacing loops or reducing number of iterationsneeded) By a factor of MVL Fewer instructions fetched/executed, TLB look-ups . Reduces branches and branch problems (control hazards) in pipelines.As if loop-unrolling by default MVL times?EECC722 - Shaaban#13 lec # 7 Fall 2012 10-1-2012

Vector vs. Single-Issue Scalar ProcessorVectorSingle-issue ScalarzzzzzOne instruction fetch, decode, dispatchper operationArbitrary register accesses,adds area and powerLoop unrolling and software pipeliningfor high performance increasesinstruction cache footprintAll data passes through cache; wastepower if no temporal localityOne TLB lookup per load or store zOff-chip access in whole cache linesOne instruction fetch,decode, dispatchper vector (up to MVL elements)Structured register accessesSmaller code for high performance, lesspower in instruction cache missesBypass cache (for data)One TLB lookup pergroup of loads or storesMove only necessary dataacross chip boundaryEECC722 - Shaaban#14 lec # 7 Fall 2012 10-1-2012

Vector vs. Superscalar ProcessorsVectorSuperscalar Control logic grows quad-raticallywith issue widthControl logic consumes energyregardless of available parallelism Control logic growslinearly with issue widthVector unit switches off when not in use- Higher energy efficiencyLow Computational power efficiency(computations/watt)Dynamic nature makes real-timeperformance less predictableSpeculation to increase visibleparallelism wastes energy andadds complexity More predictable real-time performance Vector instructions expose dataparallelism without speculationSoftware control of speculation whendesired:– Whether to use vector mask orcompress/expand for conditionals The above differences are in addition to the “Vector vs. Single-Issue Scalar Processor”differences (from last slide)EECC722 - Shaaban#15 lec # 7 Fall 2012 10-1-2012

Changes to Scalar Processor to Run VectorInstructions12 A vector processor typically consists of:1- a pipelined scalar unit plus2- a vector unit. The scalar unit is basically not different than advanced pipelined CPUs,commercial vector machines have included both out-of-order scalar units(NEC SX/5) and VLIW scalar units (Fujitsu VPP5000). Computations that don’t run in vector mode don’t have high ILP, so canmake scalar CPU simple. The vector unit supports a vector ISA including decoding of vectorinstructions which includes:1– Vector functional units. Multiple Pipelined Functional Units (FUs) or lanes– ISA vector register bank. Each has MVL elements2MVL Bits– Vector control registers3 e,.g Vector Length Register (VLR), Vector Mask (VM) 1 1 1 0 0 .01Vector Mask (VM) Register– Vector memory Load-Store Units (LSUs).5– Multi-banked main memory To provide the very high data bandwidth needed Send scalar registers to vector unit (for vector-scalar ops). Synchronization for results back from vector register, including exceptions.46 System component interconnectsEECC722 - Shaaban#16 lec # 7 Fall 2012 10-1-2012

Basic Types of VectorArchitecture/ISAs Types of architecture/ISA for vector processors:– Memory-memory Vector ISAs/Processors:All vector operations are memory to memory(No vector ISA registers)– Vector-register ISAs/Processors:All vector operations between vector registers (except vectorload and store) Vector equivalent of load-store scalar GPR architectures (ISAs) Includes all vector machines since the late 1980(Cray, Convex, Fujitsu, Hitachi, NEC) We assume vector-register for rest of the lecture.EECC722 - Shaaban#17 lec # 7 Fall 2012 10-1-2012

Basic Structure of Vector RegisterArchitectureMulti-Bankedmemory forhigh bandwidthand latency-hiding56System Interconnects14PipelinedVector Functional UnitsVector Load-StoreUnits (LSUs)MVL elements(64 bits each)23Vector Control RegistersVLRVector Length RegisterMVL Maximum Vector LengthMVL bitsVEC-1VMVector Mask RegisterTypical MVL 64 (Cray)MVL range 64-4096 (4K)EECC722 - Shaaban#18 lec # 7 Fall 2012 10-1-2012

Components of Vector Processor1 Vector Functional Units (FUs): Fully pipelined, start new operation every clock–Typically 4 to 8 Fus (or lanes): FP add, FP mult, FP reciprocal (1/X),integer add, logical, shift; may have multiple of same unit(multiple lanes of the same type)More on lanes later2 ISA Vector Register Bank: Fixed length bank holding vector ISA registers–Has at least 2 read and 1 write ports–Typically 8-32 vector registers, each holding MVL 64-128 elements(typical, up to 4K possible) 64-bit elements.ISA Scalar registers: single element for FP scalar or address.3 Vector Control Registers: Vector Length Register (VLR), Vector Mask Register(VM).4 Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; mayhave multiple LSUs.5 Multi-Banked memory: Multi-Banked memory for high throughput (bandwidth)and long latency-hiding.6 System Interconnects: Cross-bar to connect FUs , LSUs, registers, memory.VEC-1EECC722 - Shaaban#19 lec # 7 Fall 2012 10-1-2012

Vector ISA Issues:How To Pick Maximum Vector Length (MVL)?Vector Instruction Processing Time (or latency):Startup Time Longer good because:1) Hide vector startup time2) Lower instruction bandwidthPipelined vector functionalunit (FU) startup latency(fill cycles)Vector elements computation timeOne cycle per result vectorelement (up to MVL cycles)Fewer instructions fetched for a given computation3) Tiled access to memory reduce scalar processor memorybandwidth needs4) If known maximum length of app. is MVL, no stripmining (vector loop) overhead is needed.5) Better spatial locality for memory access Longer not much help because:1) Diminishing returns on overhead savings as keepdoubling number of elements.2) Need natural application vector length to match physicalvector register length, or no helpi.e MVLVEC-1EECC722 - Shaaban#20 lec # 7 Fall 2012 10-1-2012

Media-Processing:Vectorizable? Vector Lengths?Natural ApplicationComputational Kernel MVL?Matrix transpose/multiplyDCT (video, communication)FFT (audio)Motion estimation (video)Gamma correction (video)Haar transform (media mining)Median filter (image processing)Separable convolution (img. proc.)Vector length# vertices at onceimage width256-1024image width, iw/16image widthimage widthimage widthimage width(from Pradeep Dubey - utor.html)EECC722 - Shaaban#21 lec # 7 Fall 2012 10-1-2012

Vector Implementation Vector register file:– Each register is an array of MVL elements.– Size of each register is determined by the maximum vector length(MVL) supported by the implemented vector ISA.– Vector Length Register (VLR) determines the actual vector lengthVectorused for a particular vector operation or instruction.ControlRegisters – Vector Mask Register (VM) determines which elements of a vectorMVL Bitswill be computed. Multiple parallel execution units “lanes” (sometimesVector called “pipelines” or “pipes”) of the same type:Lanes– Multiples pipelined functional units (lanes) are each assigneda number of computations of a single vector instruction. MVL Vector Instruction Latency Vector Startup Time N Where N is the number of lanessupported by the vector processor- Thus, supporting multiple lanes in a vector processor reduces vector instruction latency byproducing multiple elements of the result vector per cycle (after fill cycles).- Having multiple lanes, however, does not reduce vector startup time (vector unit fill cycles).Processing time for a vector instruction in cyclesEECC722 - Shaaban#22 lec # 7 Fall 2012 10-1-2012

Structure of a Vector Unit Containing Four LanesNumber of Lanes in a vector unit (processor): The number of vector functionalunits of the same type that are each assigned a number of computations of the same vector instructione.gADDVe.gMULVVEC-1What about MVL lanes ?EECC722 - Shaaban#23 lec # 7 Fall 2012 10-1-2012

Using Multiple Lanes (Vector Functional Units) to ImprovePerformance of A Single Vector Add InstructionSingle Lane: For vectors with nine elements (as shown)Time needed 9 cycles startup(a) has a single add pipeline and can complete one additionper cycle. The machine shown in (b) has four add pipelinesand can complete four additions per cycle.Four Lanes: For vectors with nine elementsTime needed 3 cycles startupOne LaneFour LanesMVL lanes? Data parallel system, SIMD array?EECC722 - Shaaban#24 lec # 7 Fall 2012 10-1-2012

Example Vector-Register Architectures(LSUs)(MVL)Peak 133MFLOPSVector processor familyUsed in Earth SimulatorVEC-1VMIPS Vector MIPSEECC722 - Shaaban#25 lec # 7 Fall 2012 10-1-2012

8 Vector RegistersV0-V7MVL 64(Similar to Cray)The VMIPS Vector FP InstructionsVector FPVector Memory AccessAddressing Modes1- Unit StrideAccessVectorMemory2- Constant StrideAccess3- Variable StrideAccess (indexed)Vector IndexVector Mask(VM)Vector Length(VLR)VEC-1Vector Control Registers: VM Vector MaskVLR Vector Length RegisterEECC722 - Shaaban#26 lec # 7 Fall 2012 10-1-2012

Basic Vector Memory Access Addressing Modes 1Load/store operations move groups of data between registers and memoryThree types of addressing:– Unit stride Fastest memory accessSequential element access, i.e Stride 1LV (Load Vector), SV (Store Vector):LV V1, R1SV R1, V12Load vector register V1 from memory starting at address R1Store vector register V1 into memory starting at address R1.– Non-unit (constant) strideLVWS (Load Vector With Stride), SVWS (Store Vector With Stride):LVWS V1,(R1,R2)SVWS (R1,R2),V13Load V1 from address at R1 with stride in R2, i.e., R1 i R2.Store V1 from address at R1 with stride in R2, i.e., R1 i R2.– Indexed (gather-scatter)(i size of element)Or Variable stride Vector equivalent of register indirect Good for sparse arrays of data Increases number of programs that vectorizeLVI (Load Vector Indexed or Gather), SVI (Store Vector Indexed or Scatter):LVI V1,(R1 V2) Load V1 with vector whose elements are at R1 V2(i), i.e., V2 is an index.SVI (R1 V2),V1 Store V1 to vector whose elements are at R1 V2(i), i.e., V2 is an index.VEC-1Stride Distance in elements between consecutive vector elementsloaded or stored by a vector memory access instructionEECC722 - Shaaban#27 lec # 7 Fall 2012 10-1-2012

Scalar Vs. VectorCode ExampleDAXPY (Y a X Y)*Assuming vectors X, Yare length 64 MVLScalar vs. VectorVLR 64VM (1,1,1,1 .1)L.DDADDIUloop: 4,Rx,#512F2, 0(Rx)F2,F0,F2F4, 0(Ry)F4,F2, F4F4 ad scalar aLVV1,Rx;load vector XMULVS.DV2,V1,F0;vector-scalar mult.LVV3,Ry;load vector YADDV.DV4,V2,V3;addSVRy,V4;store the resultAs if the scalar loop code was unrolled MVL 64 times:Every vector instruction replaces 64 scalar instructions.;last address to load;load X(i);a*X(i);load Y(i);a*X(i) Y(i);store into Y(i);increment index to X;increment index to Y;compute bound;check if doneUnroll scalar loop code?What does loop unrolling accomplish?Scalar Vs. Vector Code578 (2 9*64) vs.321 (1 5*64) ops (1.8X)578 (2 9*64) vs.6 instructions (96X)64 operation vectors no loop overheadalso 64X fewer pipelinehazardsEECC722 - Shaaban#28 lec # 7 Fall 2012 10-1-2012

In seconds or cyclesVector Execution Time/Performance Time f(vector length, data dependencies, struct. hazards, C) Initiation rate: rate that FU consumes vector elements.( number of lanes; usually 1 or 2 on Cray T-90) Convoy: a set of vector instructions that can begin execution inapproximately the same clock cycle (no structural or data hazards). Chime: approx. time in cycles to produce a vector element result(usually number of convoys in vector code). Assuming one lane is used/ignore startup m convoys take Tchime m cycles (or 1 chime); if each vector length isn, then they take approx. m x n clock cycles (ignores overhead; onelane; good approximation for long vectors)i.e vector functional unitAssuming vector length, n MVLConvoy1: LVV1,Rx;load vector X4: SVRy,V4;store the resultstartup time etc.4 conveys, 1 lane, VL n 642: MULV V2,F0,V1 ;vector-scalar mult. 4 x 64 256 cycles(or m 4 cycles per resultLVV3,Ry;load vector Yvector element)3: ADDV V4,V2,V3 ;addVEC-1DAXPYIgnoring vector startup time, n MVLEECC722 - Shaaban#29 lec # 7 Fall 2012 10-1-2012

DAXPY (Y a * X Y) Timing(One LSU, One Lane, No Vector Chaining, Ignoring Startup) n MVLFrom Last Slide: Time in Cycles Number of Convoys x Vector Length Tchime x n m x nConvoy1: LVV1,Rx;load vector Xm 4 conveys, 1 lane,VL n 64 4 x 64 256 cycles(or Tchime 4 cycles perresult vector element)2: MULV V2,F0,V1 ;vector-scalar mult.LVV3,Ry;load vector Y3: ADDV V4,V2,V3 ;add4: SV;store the resultRy,V4mxn1nnLV V1,Rxm 4 Convoys or Tchime 4 cycles per elementn elements take m x n 4 n cyclesFor n VL MVL 64 it takes 4x64 256 cycles2nTchime m 4n2MULVV2,F0,V1LVV3,Ryn3 ADDV V4,V2,V3n vector length VL number of elements in vectorm or Tchime Number of convoysVEC-14 convoys x n 4 x 64 256 cyclesnWhat if multiple lanes are used?4SV3nRy,V4n4nEECC722 - Shaaban#30 lec # 7 Fall 2012 10-1-2012

Accounting For Vector FU Start-up Timei.e Pipelined VectorFunctional Unit Fill Cycles Start-up time: pipeline latency time (depth of FU pipeline);another sources of overhead Operation Start-up penalty (from CRAY-1)– Vector load/store12Time to get first result element– Vector multiply7(To account for pipeline fill cycles)– Vector add6Assume convoys don't overlap (no vector chaining); vector length n:n MVLAccounting For Startup Time (for one lane):Time in Cycles Total Startup Number of Convoys x Vector Length Total Startup m x nConvoy1. LVStart01st result last result1211 n (12 n-1)2. MULV, LV12 n12 n 1223 2n Load start-up3. ADDV24 2n24 2n 629 3n Wait convoy 24. SV30 3n30 3n 1241 4n Wait convoy 3DAXPYVEC-1Total Startup cycles4 ConvoysEECC722 - Shaaban#31 lec # 7 Fall 2012 10-1-2012

DAXPY (Y a * X Y) Timing(One LSU, One Lane, One LSU, No Vector Chaining, Including Startup)Time in Cycles Total Startup Number of Convoys x Vector Length Total Startup m x nTime to get firstresult elementConvoy1: LVV1,Rx;load vector XOperation Start-up penalty(from CRAY-1)–Vector load/store12–Vector multiply7–Vector add62: MULV V2,F0,V1 ;vector-scalar mult.LVV3,Ry;load vector Y3: ADDV V4,V2,V3 ;add4: SV;store the resultRy,V4m 4 Convoys or Tchime 4 cycles per elementn elements take Startup m x n 41 4 n cyclesFor n VL MVL 64 it takes 41 4x64 297 cycles297 cycles Vs. 256 cycleswhen ignoring startup time (slide 30)1 LV V1,Rxn-11211 n72MULVV2,F0,V1LVV3,Ry12n-13 ADDV V4,V2,V3n vector length VL number of elements in vectorm or Tchime Number of convoysVEC-1Here Total Startup Time 41 cyclesn-1What if multiple lanes are used?23 2nn-164SV29 3nRy,V412n-141 4nEECC722 - Shaaban#32 lec # 7 Fall 2012 10-1-2012

Vector Load/Store Units (LSUs) & MemoriesCPU Start-up overheads usually longer for LSUs Memory system must sustain (# lanes x word) /clock cycle Many Vector Procs. use banks (vs. simple interleaving):1) support multiple loads/stores per cycle multiple banks & address banks independently2) support non-sequential accesses (non unit stride) Note: No. memory banks memory latency to avoid stalls– m banks m words per memory lantecy l clocksi.e to hide memory– if m l, then gap in memory pipeline:latencyclock: 0 l l 1 l 2 l m- 1 l m 2 lword: -- 01 2 m-1-- m– may have 1024 banks in SRAMVEC-1i.e a large number of memorybanks maybe neededEECC722 - Shaaban#33 lec # 7 Fall 2012 10-1-2012

Vector Memory Requirements Example The Cray T90 has a CPU clock cycle of 2.167 ns (460 MHz) and in its largestconfiguration (Cray T932) has 32 processors each capable of generating fourloads and two stores per CPU clock cycle. i.e Each processor has 6 LSUsThe CPU clock cycle is 2.167 ns, while the cycle time of the SRAMs used inthe memory system is 15 ns.Calculate the minimum number of memory banks required to allow allCPUs to run at full memory bandwidth.Answer:The maximum number of memory references each cycle is 192 (32 CPUstimes 6 references per CPU).CPUEach SRAM bank is busy for 15/2.167 6.92 clock cycles, which we roundup to 7 CPU clock cycles. Therefore we require a minimum of 1

Vector Length (MVL) VEC-1 Typical MVL 64 (Cray) Add vector Typical MVL 64-128 Range 64-4996 (Vector-vector instruction shown) Vector processing exploits data parallelism by performing the same computation on linear arrays of numbers "vectors" using one instruction. The maximum number of elements in a vector supported by a vector ISA is

Related Documents:

Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length - VLR Vector stride Enhancing Vector performance Measuring Vector performance SSE Instruction set and Applications A case study - Intel Larrabee vector processor

PEAK PCAN-USB, PEAK PCAN-USB Pro, PEAK PCAN-PCI, PEAK PCAN-PCI Express, Vector CANboard XL, Vector CANcase XL, Vector CANcard X, Vector CANcard XL, Vector CANcard XLe, Vector VN1610, Vector VN1611, Vector VN1630, Vector VN1640, Vector VN89xx, Son-theim CANUSBlight, Sontheim CANUSB, S

12 VECTOR GEOMETRY 12.1 VectorsinthePlane Preliminary Questions 1. Answer true or false. Every nonzero vector is: (a) equivalent to a vector based at the origin. (b) equivalent to a unit vector based at the origin. (c) parallel to a vector based at the origin. (d) parallel to a unit vector based at the origin. solution (a) This statement is true. Translating the vector so that it is based on .

Unit vectors A unit vector is any vector with unit length. When we want to indicate that a vector is a unit vector we put a hat (circum ex) above it, e.g., u. The special vectors i, j and k are unit vectors. Since vectors can be scaled, any vector can be rescaled b to be a unit vector. Example: Find a unit vector that is parallel to h3;4i. 1 3 4

The vector award is under the patronage of Ken Fouhy, Chief Editor of VDI nachrichten. der vector award umfasst die goldene vector -Statue, Urkunde und ein Preisgeld von 5.000 die silberne vector -Statue, Urkunde und ein Preisgeld von 2.500 die bronzene vector -Statue, Urkunde und ein Preisgeld von 1.000 the vector award .

Components of Vector Processors Vector Registers o Typically 8-32 vector registers with 64 - 128 64-bit elements o Each contains a vector of double-precision numbers o Register size determines the maximum vector length o Each includes at least 2 read and 1 write ports Vector Functional Units (FUs) o Fully pipelin

Vector Calculus 16.1 Vector Fields This chapter is concerned with applying calculus in the context of vector fields. A two-dimensional vector field is a function f that maps each point (x,y) in R2 to a two-dimensional vector hu,vi, and similarly a three-dimensional vector field maps (x,y,z) to hu,v,wi.

New Jersey Student Learning Standards for English Language Arts . Page 1 of 12. Grade 4 . The standards define general, cross-disciplinary literacy expectations that must be met for students to be prepared to enter college and workforce training programs ready to succeed. The K–12 grade-specific standards define end-of-year expectations and a cumulative progression designed to enable .