An Introduction To Vectorization With The Intel Fortran .

2y ago
13 Views
2 Downloads
312.99 KB
6 Pages
Last View : 18d ago
Last Download : 2m ago
Upload by : Callan Shouse
Transcription

EXPLOIT CAPABILITIES WITHIN INTEL XEON PROCESSORSAn Introduction to Vectorizationwith the Intel Fortran CompilerWHITE PAPERQ: How do I take advantage of SSE and AVX instructions to speed up my code?IntroductionWhat is Vectorization?This paper defines vectorization and introduces howdevelopers using Fortran can take advantage of it. Thereason to use vectorization is typically related to aninterest in increasing application performance and creatingmore efficient application processing.In computer science, vectorization is the process ofconverting an algorithm from a scalar implementation,which does an operation one pair of operands at a time, toa vector process, where a single instruction can refer to avector (series of adjacent values)1. In effect, it adds a formof parallelism to software in which one instruction oroperation is applied to multiple pieces of data. When doneon computing systems that support such actions, thebenefit is more efficient processing and improvedapplication performance. Many general-purposemicroprocessors today feature multimedia extensions thatsupport SIMD (single-instruction-multiple-data) parallelism.And when the hardware is coupled with Fortran compilersthat support it, developers of scientific and engineeringapplications have an easier time delivering more efficient,better performing software2.The paper introduces vectorization techniques that can beused by just about any application developer and uses theIntel Fortran Compiler to exemplify these uses. The firstforms of vectorization presented in this paper are thosethat are the easiest to use. They require no changes tocode. Next are libraries, followed by compiler options thatoffer advice to the programmer on steps to take to delivervectorization. Additional topics are introduced that requiremore programmer intervention in source code and whichoffer the most programmer control, and frequently, ahigher return in performance or efficiency.Here are the vectorization topics mentioned in this paper:Performance or efficiency benefits from vectorizationdepend on the code structure. But, in general, theautomatic and near automatic techniques introduced beloware most productive in delivering improved performance orefficiency. The techniques offering the most controlrequire greater application knowledge and skill in knowingwhere they should be applied. But these more intrusivetechniques, such as those that may involve compilerdirectives or other source code changes, can yieldpotentially greater performance and efficiency benefitwhen properly used. Auto-vectorization capabilities of the Intel FortranCompiler Use of threaded and thread-safe libraries, such asIntel Math Kernel Library (Intel MKL) Use of special compiler build-log reports to guidesource code changes and use of pragmas Guided Auto-Parallelism in the Intel Fortran Compiler SIMD compiler directiveTopics introduced in this paper apply to vectorizing codefor IA-32, Intel 64 and the upcoming Intel MICarchitectures. Thus, the vectorization you implement usingthe Intel Fortran Compiler will scale over systems usingcurrent and future Intel processors.Reading materials are mentioned throughout the paper andare presented in a list at the end of the paper.1A Guide to Vectorization with Intel C Compilers, page 1, MarkSabahi, et. al., Intel Corporation.2 Vectorizationwith the Intel Compilers, Intel Developer Services, page1, Aart J.C. Bik, Intel Corporation.1

A Good Way to Start: Intel Compilersand the Auto-Vectorization FeatureIntel MKLAnother easy way to take advantage of vectorization is tomake calls in your applications to the vectorized forms offunctions in the Intel Math Kernel Library (Intel MKL).Intel MKL offers linear algebra functions, implemented inLAPACK (solvers and eigensolvers) plus level 1, 2, and 3BLAS, offering the vector, vector-matrix, and matrix-matrixoperations needed for complex mathematical software. Aset of vectorized transcendental functions called theVector Math Library (VML) is also included. These offergreater performance than the libm (scalar) functions, whilemaintaining the same high accuracy. The Vector StatisticalLibrary (VSL) offers high performance vectorized randomnumber generators for several probability distributions,convolution and correlation routines, and summarystatistics functions.Intel C and Intel Fortran compilers support SIMD bysupporting the Intel Streaming SIMD Extensions (Intel SSE) and Intel Advanced Vector Extensions (Intel AVX) onboth IA-32 and Intel 64 architecture processors. Bothcompilers do auto-vectorization, generating Intel SIMD codeto automatically vectorize parts of application softwarewhen certain conditions are met. Because no source codechanges are required to use auto-vectorization, there is noimpact on the portability of your application.To take advantage of auto-vectorization, applications mustbe built at default optimization settings (/O2 or -O2) orhigher. Add the /Qvec-report1 (-vec-report1) to have thecompiler tell you when it vectorized a loop. With thesesettings, the compiler will look for opportunities to executemultiple adjacent loop iterations in parallel using packedSIMD instructions3. If one or more loops have beenvectorized, the compiler emits a remark to the build logthat identifies the loop and says that the “LOOP WASVECTORIZED.”Vectorization ReportsIntel compiler build-log reports contain two important kindsof information about vectorization. First, as noted above,they reports which loops were vectorized. Second, andperhaps more useful, an optional report (/Qvec-report2 or –vec-report2) provides information about why some loopswere not vectorized. This can be very helpful in providingguidance to restructure code so it will auto-vectorize.When you use Intel compilers on systems that use Intelprocessors, you get ‘free’ performance improvements thatwill automatically take advantage of processing power asthe Intel architecture gets more parallel. This is anexample of what we mean by ‘scaling forward.’Figure 1. Sample source code followed by a command line tostart the Fortran compiler, and a sample report from thecompiler indicating the loop was vectorized.You can try the Intel compilers yourself by downloading anevaluation copy of an Intel compiler and testing it with thesample code included with the compiler4 or with your own‘loopy’ code. The Intel Fortran Compiler feature easy-touse “Getting Started” guides that take you step-by-stepthrough the use of the sample code and many compilerfeatures, such as auto-vectorization.3 Op.subroutine quad(len,a,b,c,x1,x2)real(4) a(len),b(len), c(len), x1(len), x2(len), sdo i 1,lens b(i)**2 - 4.*a(i)*c(i)if (s.ge.0.) thenx1(i) sqrt(s)x2(i) (-x1(i) - b(i)) *0.5 / a(i)x1(i) ( x1(i) - b(i)) *0.5 / a(i)elsex2(i) 0.x1(i) 0.endifenddoendcit., Sabahi, et. al., Intel Corporation4 Thecompiler includes a “Getting Started” tutorial and sample code. Ifyou do the default installation (in this case, on Windows), samples arelocated inC:\Program Files (x86)\Intel\Composer XE 2011SP1\Samples\en US\Fortran\vec samples.zip. ifort -c -vec-report2 quad.f90quad.f90(4): (col. 3) remark: LOOP WAS VECTORIZED.2

The IVDEP directive informs the compiler that the programwould behave correctly if the statements were executed incertain orders other than the sequential execution order,such as executing the first statement or block tocompletion for all iterations, then the next statement orblock for all iterations, and so forth. The optimizer can usethis information, along with whatever else it can proveabout the dependences, to choose other execution orders.Figure 2. Similar to Figure 1 but, in this case, it’s an exampleof unvectorizable code with a sample report.subroutine no vec(a, b, c)real(4), dimension(*) :: a, b, cinteger :: ido i 1,100a(i) b(i) * c(i)if (a(i) 0.0 ) exitenddoGuided Auto-Parallelism (GAP)The Intel Fortran Compiler also includes an easy-to-usetool to help you vectorize code. It’s called Guided AutoParallelism (GAP), which is invoked with the “/Qguide”option on Windows and “–guide” on Linux. This causes thecompiler to generate diagnostic reports – but no objectcode or executables – that suggest ways to improve autovectorization as well as auto-parallelization and datalayout. The advice may include suggestions for sourcecode changes, applying specific pragmas, or applyingspecific compiler options. In all cases, applying specificadvice requires the user to verify that it is safe to applythat particular suggestion.5 This is a powerful tool to helpyou extend the auto-vectorization and auto-parallelismcapabilities of the compiler for developers who are familiarwith the code on which they are working.end ifort -c -vec-report2 two exits.f90two exits.f90(5): (col. 3) remark: loop was notvectorized: nonstandard loop is not a vectorizationcandidate.DirectivesThe reports are also useful to help guide use andplacement of the many directives included in the Intel Fortran compiler, not including OpenMP* directives, thatcan override assumptions made by the compiler. Fordevelopers familiar with their applications, directives makeit easy to declare to the compiler that it is safe to ignoreissues such as potential data dependencies. Otherdirectives deal with loop counts, allow developers todeclare that a loop is safe to vectorize regardless of whatthe compiler thinks about the performance cost or benefit,and assert that data within the loop are aligned. There isalso a statement to tell the compiler to not vectorize a loopand a compiler option to not do any vectorization. Thesecan be useful for ‘before’ and ‘after’ performance andresults testing.SIMD DirectiveYet another tool is user-mandated vectorization using theSIMD directive. This is a feature that enables you to tellthe compiler to enforce vectorization of loops. Programswritten with SIMD vectorization are very similar to thosewritten using auto-vectorization hints. You can use SIMDvectorization to minimize code changes that you may haveto go through in order to obtain vectorized code.Descriptions and examples of pragmas supported by theIntel Fortran Compiler are provided in the Intel FortranCompiler XE 12.1 User and Reference Guides (search for“Compiler Directives”).SIMD vectorization uses the !DIR SIMD directive to effectloop vectorization. The options –Qsimd- [on Windows*] or –no-simd [on Linux* or Mac* OS] may be used to disable anySIMD directives, for testing and comparisons.The IVDEP directive is applied to a DO loop in which theuser knows that dependences are in lexical order. Forexample, if two memory references in the loop touch thesame memory location and one of them modifies thememory location, then the first reference to touch thelocation has to be the one that appears earlier lexically inthe program source code. This assumes that the right-handside of an assignment statement is "earlier" than the lefthand side.53Op. cit, Sabahi, et. al., pg 25

The following example in Figures 3 and 4 show an exampleusing code that does not automatically vectorize the due tothe unknown data dependence distance "X". You can usethe data dependence assertion via the auto-vectorizationhint, !DIR IVDEP, to let the compiler decide to vectorizethe loop or not, or you can enforce vectorization of theloop using !DIR SIMD.Figure 4. Example with !DIR SIMD produces "LOOP WASVECTORIZED" report.[D:/simd] cat example1.fsubroutine add(A, N, X)integer N, XrealA(N)!DIR SIMDDO I X 1, NA(I) A(I) A(I-X)ENDDOendFigure 3. Example: without !DIR SIMD produces the outputat the bottom of the figure.[D:/simd] cat example1.fsubroutine add(A, N, X)integer N, XrealA(N)DO I X 1, NA(I) A(I) A(I-X)ENDDOendCommand line entry: [D:\simd] ifort example1.f-nologo -Qvec-report2Output: D:\simd\example1.f(7): (col. 9) remark:LOOP WAS VECTORIZED.The SIMD directive has optional clauses to guide thecompiler on how vectorization must proceed. An expertuser might employ these clauses to further guide how thecompiler goes about vectorization. In most simplesituations, they are not needed. For more information,consult the Intel Fortran Compiler XE 12.1 User andReference Guides (search “Directive SIMD”).Command line entry: [D:/simd] ifort example1.f nologo -Qvec-report2Output: D:\simd\example1.f(6): (col. 9) remark:loop was not vectorized: existence of vectordependence.SummaryThe performance benefits from vectorization and parallelism can be significant. Intel Software Development Products offerflexible capabilities that enable tapping into this performance, some of which are automatic, others that are easy to use andstill more that offer extensive programmer control. This paper offers quick survey of these capabilities. Take the time todownload the tools, evaluate them, and see for yourself how you can take advantage of vectorization in contemporarycomputing systems.Other development products from Intel can also help with vectorization and other forms of parallelism. Intel VTune AmplifierXE can help analyze code to find performance bottlenecks and Intel Inspector XE can help debug parallel code to verifythreading correctness.4

Additional Reading and CommunityVectorization with the Intel Compilers (Part 1), A.J.C Bik, Intel, Intel Software Network Knowledge base and search the title inthe keyword search. This article offers good bibliographical references.The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance, A.J.C. Bik. Intel Press, June,2004, for a detailed discussion of how to vectorize code using the Intel compiler.Elemental functions: Writing data parallel code in C/C using Intel Cilk Plus. Robert Geva, Intel CorporationIntel Software Network, Search for topics such as “Parallel Programming in the “Communities” menu or “Software Forums” orKnowledge Base in the “Forums and Support” menu.Requirements for Vectorizable Loops, Martyn Corden, Intel CorporationThe Software Optimization Cookbook, Second Edition, High-Performance Recipes for IA-32 Platforms by Richard Gerber, AartJ.C. Bik, Kevin B. Smith and Xinmin Tian, Intel Press.Evaluate a toolDownload a free evaluation copy of our tools. If you’re still uncertain where to begin, we suggest:For bundled suites that include the compiler and libraries along with analysis tools, try Intel Parallel Studio XE or Intel ClusterStudio XE (if you use MPI clusters). If you are not interested in analysis tools, Intel Composer XE combines the Intel compilerswith libraries. Try Intel Parallel Advisor for Windows* to help identify where you code can benefit from parallelism.Learning Tools Intel Visual Fortran Composer XE 2011 Getting Started Tutorials For Windows For Linux For Mac OS X Intel Learning Lab, collection of tutorials, white papers and more.5

Purchase Options: Language Specific SuitesSeveral suites are available combining the tools to build, verify and tune your application. Single or multi-user licenses andvolume, academic, and student discounts are available.Intel ParallelStudio XEIntel C Studio XEIntel ClusterStudio XEIntel ComposerXEIntel C Composer XEIntel C / C Compiler Intel Fortran Compiler Intel Integrated Performance Primitives3 Intel Math Kernel Library3 Intel Cilk Plus Intel Threading Building Blocks Intel Inspector XE Intel VTune Amplifier XE Static Security Analysis ComponentsSuites Intel FortranStudio XE Intel MPI Library Intel Trace Analyzer & Collector Rogue Wave IMSL* Library2Operating System1Intel FortranComposer XE W, LW, LW, LW, LW, LW, L, MW, L, MNote: (1)1 Operating System: W Windows, L Linux, M Mac OS* X. (2)2 Available in Intel Visual Fortran Composer XE for Windows with IMSL*(3)3Not available individually on Mac OS X, it is included in Intel C & Fortran Composer XE suites for Mac OS XAbout the AuthorChuck Piper is an Intel Product Marketing Engineer specializing in compilers.NoticesINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT ASPROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVERAND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTSINCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, ORINFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.Optimization NoticeNotice revision #20110804Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not uniqueto Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel doesnot guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations notspecific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User andReference Guides for more information regarding the specific instruction sets covered by this notice.6 2012, Intel Corporation. All rights reserved. Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of IntelCorporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.Intel An-Introduction-to-Vectorization-with-the- Intel Fortran Compiler WP /Rev-021712

C:\Program Files (x86)\Intel\Composer XE 2011 SP1\Samples\en_US\Fortran\vec_samples.zip. Intel MKL Another easy way to take advantage of vectorization is to make calls in your applications to the vectorized forms of functions in the Intel Math Kernel Library (Intel MKL). Intel

Related Documents:

Intel Parallel Studio XE 2016 Suites Vectorization – Boost Performance By Utilizing Vector Instructions / Units Intel Advisor XE - Vectorization Advisor identifies new vectorization opportunities as well as improvements to existing vectorization and highlights them in your

Going Parallel: CPUs vs Co-processors CPUs SIMD ISA extensions (Single Instruction-Multiple Data) th

1st Francesco Petrogalli Arm Ltd. Cambridge, United Kingdom francesco.petrogalli@arm.com 2nd Paul Walker Arm Ltd. Manchester, United Kingdom paul.walker@arm.com Abstract—The vectorization of loops invoking math function is an important optimization that is available in most commercial c

3 Vectorization through Rewriting Our goal is to take formulas obtained by the recursive application of rules like (1){(6) and automatically manipulate them into a form that enables a direct mapping into SIMD vector code. Further, we also want to explore difierent vec-torizations for the same formula. The s

Title: Meet The Architects - Code Modernization – Performance & Vectorization Author: Steyer, Michael Created Date: 9/18/2015 11:57:44 PM

[Juliusz Sompolski, Marcin Zukowski, Peter A. Boncz: Vectorization vs. compilation in query execution. DaMoN 2011] . Data cache miss (L1d, L2. Query Compilation. 24 706.543 Architecture of Database Systems - 07 Vectorization, Compilation, and Parallelization

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to

Batch baking is an economical way of having baked goods for the family which will last days. Owning a freezer makes batch baking an even more viable method of cooking as a variety of baked items can be frozen ahead of time and used as required. This is beneficial if you have less time to spend on meal preparation as well as helping to cater for unexpected guests and large numbers. Filling the .