FPGA Implementation Of Gram-Schmidt QR Decomposition Using High Level .

1y ago
15 Views
2 Downloads
634.51 KB
6 Pages
Last View : 7d ago
Last Download : 3m ago
Upload by : Vicente Bone
Transcription

482FPGA Implementation of Gram-Schmidt QRDecomposition Using High Level SynthesisParth Desai1, Semih Aslan2 and Jafar Saniie112Department of Electrical and Computer EngineeringIllinois Institute of TechnologyChicago, IL, 60616, U.S.A.Abstract—In this paper, a high precision QR decompositionimplementation on FPGA is discussed. The Gram-Schmidtalgorithm is used and synthesized from high level language toRegister Transfer Level (RTL) Verilog HDL, using High LevelSynthesis (HLS). The RTL Code is synthesized and implementedon Xilinx Zedboard FPGA. The IP core that is generated duringsimulation is designed for 10x10 floating point and fixed pointimplementation which can be scaled up for higher matrix sizes.The results for scaled up 100x100 matrix implementation are alsodiscussed. Fixed point data type provides twice as fastimplementation and consumes nearly 30% less area compared tofloating point implementation with no more than 3% error inprecision, for 10x10 matrix size. The error percentage would varywith the size of matrix as well.Keywords—FPGA, Gram-Schmidt, HLS, QR Decomposition,ZedBoard.I.INTRODUCTIONIn this paper, an IP core that is designed using High LevelSynthesis (HLS) is discussed. The implemented core performsQR decomposition of 10x10 input matrix A and generates two10x10 matrices, Q and R as outputs. The data type used here isfloating point data type which gives higher precision anddynamic range [1]. An alternative to this is also introduced,which consume less resources and uses 16-bit fixed point datatype, but has a precision tradeoff [1]. The design is further scaledup to 100x100 matrix size for both floating and fixed pointsimplementations using loop unrolling technique [2] [3] which isused to increase the performance is also discussed. Based on thedata type used, two different designs are created for variousapplications where precision and resource utilization is desired.This paper is structured as follows: Section II provides abrief discussion about the QR decomposition algorithm usingGram-Schmidt. Section III discusses the HLS design flow andimplementation, and section IV discusses the proposed designimplementation results.II.QR DECOMPOSITION USING GRAM-SCHMIDTA. QR Decomposition TechniquesThere are various applications of QR decomposition in thefield of linear algebra and digital signal processing [4]. It is usedin various signal processing applications such as echocancellation, Multiple Input Multiple Output (MIMO) andDepartment of Electrical and Computer EngineeringTexas State UniversitySan Marcos, TX, 78666, U.S.A.beamforming systems [5]. Design and verification of a QRdecomposition is complex and requires higher time-to-market(TTM). To reduce TTM and create more efficient design a HLSdesign and verification method is proposed.The aim of QR decomposition is to factorize a linearlydependent matrix A in to two different matrices, Q and R, whereQ is unitary matrix and R is upper triangular matrix, such that A QR. There are several QR factorization algorithms used forhardware implementations, such as Givens Rotation,Householders transform, and Gram-Schmidt [5]. Thesealgorithms have different computational power, numericalstability and latency [6]. We will discuss Gram-Schmidtalgorithm hardware implementation and verification in thispaper.B. Gram-Schmidt AlgorithmIn certain applications, the computation is simplified if wehave orthogonal vectors. We can produce an orthogonal set ofvectors Tꞌ from T with same span as T. This is Gram-SchmidtOrthogonalization [7]. So, for set of vectors T {p1, p2, .,pn}, we must find a set of vectors Tꞌ {q1, q2, ., qnꞌ} withnꞌ n so that, span{q1, q2, ., qnꞌ} span{p1, p2, ., pn}and ‹qi , qj› δi,j.The algorithm works as follows [5] [7]. First step is to normalize the vector by applying (1) tocalculate q1. (1) Compute the difference in projection between p2 ontoq1 and p2 by using (2) which is shown in the Fig. 1. ‹ , › ‹ ,›(2)If e2 0 than it is discarded. If e2 0 than we normalize asin (3). (3) 978-1-5090-4767-3/17/ 31.00 2017 IEEEAuthorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:15:50 UTC from IEEE Xplore. Restrictions apply.

483modified to create a separate test-bench and top function. TheHLS tool, for top function, supports certain libraries formathematical function but doesn’t support standard I/O library,so a separate top module must be designed which would performthe desired function. Designed top function can be tested usinga test-bench created using standard C libraries. This topfunction will be converted to an IP block which will perform thedesired function, like the top function.Fig. 1. Orthogonal Projection of Vectors [5] Similarly, we can obtain e3 and q3 as shown below in (4)and (5). ‹ , › ‹ ,›(4)After this we run the C simulation to test our results usingGCC/CLANG compiler embedded in the tool. After it passes werun the C synthesis to generate the IP block with RTL levelVerilog HDL code. The generated IP block is tested against Ctest-bench which we call C/RTL co-simulation. Complete highlevel synthesis design flow is shown in Fig. 2.(5) So similarly, we have following generalized form of ekqk as in (6) and (7) respectively. ‹, ›(6)(7) So now matrix A {p1, p2, ., pn} and Q can be obtainedby stacking vectors qn such that, Q {q1, q2, ., qnꞌ} andsimilarly R can be obtained as shown below, ‹ , › ‹ , › ‹ , › ‹ , › ‹ , › ‹ , › This gives decomposition of A in to two matrices Q and R.III.HIGH LEVEL SYNTHESIS DESIGN FLOW ANDIMPLEMENTATIONFor the implementation [8], the algorithm is first coded inC language. This code is than synthesized to RTL levelVerilog HDL code using HLS. HLS creates IP block that has thesame behavior as described in higher level language. This IPblock is then combined with the Zynq IP Cores to synthesize andimplement [8] the complete design on FPGA.A. HLS Design FlowThe HLS design flow [9] [10] simply converts C/C codeinto HDL after some modifications. The algorithm in C isFig. 2. HLS Flow [11]On passing all the test, final IP block is generated. This IPblock is to be exported to the FPGA. This whole process isdiscussed in the next section.Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:15:50 UTC from IEEE Xplore. Restrictions apply.

484B. Overall Synthesis and Implementation FlowAfter generating the RTL block using HLS, this IP blockneeds to be connected to Zynq Processing System and the BlockRAMs in the Zynq7020 chip, to complete the design. The inputmatrix and output matrix are stored in the Block RAM whichcould be accessed via Zynq using BRAM controllers. Thiswhole design is synthesized and physically implemented, to testthe timing and power estimates based on the final physicalimplementation. After implementation, a bit-stream file isgenerated which is imported to ZedBoard. The implementeddesign is tested using the test-bench and the results could becompared between the hardware and software in terms of timingand precision/percentage error. The Complete Synthesis andimplementation flow is shown in Fig. 3.IV.IMPLEMENTATION RESULTSIn these implementations two different type of data types areused. 32-bit floating point data type 16-bit fixed point data typeA test bench is created in C which takes a randomlygenerated 10x10 matrix as an input. The implementation resultsare compared with MATLAB. The algorithm was implementedin C and it was synthesized to RTL level using High LevelSynthesis, and an IP block was generated. The IP blockgenerated was then implemented on ZedBoard which has Zynq7000 and a dual core Cortex-A9 processors.The implementations with floating point and fixed point datatype discussed here having their advantages and can be used forapplications based on the requirements. The floating-pointimplementation provides nearly 100% precision with theresource utilization shown in Table I for post synthesis results.TABLE I.POST RTL SYNTHESIS RESOURCE UTILIZATION FORFLOATING POINT DATA TYPENameUtilizationTotal Available% 62202.73BRAM 18K22800.71Resource utilization for fixed point data type for post RTLsynthesis result is shown in Table II.TABLE II.POST RTL SYNTHESIS RESOURCE UTILIZATION FOR FIXEDPOINT DATA TYPENameUtilizationTotal Available% 2202.73BRAM 18K62802.14For fixed point data type implementation, there is nearly30% reduction in resource utilization compared to floating pointimplementation. The comparison of resource utilization betweenthe floating point and fixed point data type is shown in the Fig.4. These results are for post implementation stage.Fig. 3. Design Implementation flow [11]In the next section the implementation results are discussedfor floating point matrix input and outputs.The post synthesis results show that the implementation withfixed point data type is twice as fast as floating point data type.The maximum latency for floating point design is 21,073 clockcycles and with fixed point design is 10,591 clock cycles. Theclock frequency used is 100MHz. These latencies are for postRTL synthesis results.Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:15:50 UTC from IEEE Xplore. Restrictions apply.

485To increase the performance of 100x100 matriximplementation, loop unrolling technique is used. Two differentimplementations, one with unrolling factor 2 and another withunrolling factor 4 are presented. This technique shows reductionin latency with the tradeoff of area utilization.Table V and Table VI shows post RTL synthesis results for100x100 matrix with loop unrolling factor of 2. Thisimplementation show 32% reduction in latency for floating pointand 25% reduction in latency for fixed point implementationwith respect to original 100x100 implementation.Fig. 4. Post-Implementation Resource Utilization Comparison for 10x10 MatrixThe 10x10 matrix was scaled up to 100x100 matrix size. Theresource utilization increased by a small factor with increase inthe latency. Table III and Table IV shows post RTL synthesisresource utilization results for floating point data type and fixedpoint data type respectively.TABLE III.POST RTL SYNTHESIS RESOURCE UTILIZATION FORFLOATING POINT DATA TYPE 100X100 MATRIXNameUtilizationTotal Available% 62202.73BRAM 18K22800.71TABLE IV.POST RTL SYNTHESIS RESOURCE UTILIZATION FOR FIXEDPOINT DATA TYPE 100X100 MATRIXNameUtilizationTotal Available% 62202.73BRAM 18K5028017.86TABLE V.POST RTL SYNTHESIS RESOURCE UTILIZATION FORFLOATING POINT DATA TYPE 100X100 MATRIX WITH UNROLL FACTOR 2NameUtilizationTotal Available% 192208.64BRAM 18K42801.43TABLE VI.POST RTL SYNTHESIS RESOURCE UTILIZATION FOR FIXEDPOINT DATA TYPE 100X100 MATRIX WITH UNROLL FACTOR 2NameUtilizationTotal Available% 162207.27BRAM 18K5028017.86Fig. 6 shows comparison between floating point and fixedpoint, post implementation results of 100x100 matrix with unrollfactor of 2. It shows that fixed point implementation has nearly50 percent reduction in resource utilization compared to floatingpoint implementation and is nearly 3 times faster. The latencyhas decreased at the cost of area utilization.Fig. 5 shows comparison between floating point and fixedpoint, post implementation results. It shows that fixed pointimplementation has nearly 40 percent reduction in resourceutilization compared to floating point implementation and is 3times faster as well.Fig. 6. Post-Implementation Resource Utilization Comparison for100x100 matrix with unroll factor 2Fig. 5. Post-Implementation Resource Utilization Comparison for 100x100MatrixSimilar to the unrolling factor 2, Table VII and Table VIIIshows post RTL synthesis resource utilization results forunrolling factor 4. This implementation shows nearly 52%reduction in latency for floating point and 45% reduction inlatency for fixed point implementation compared to the original100x100 matrix implementation. As a tradeoff to the latencyimprovement, resource utilization has increased by 15% fromAuthorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:15:50 UTC from IEEE Xplore. Restrictions apply.

486original percent of resource utilization for both floating pointand fixed point implementation.both for floating point and fixed point data type, with andwithout loop unrolling technique. These latencies are for postsynthesis results and for 100MHz clock frequency.TABLE VII.POST RTL SYNTHESIS RESOURCE UTILIZATION FORFLOATING POINT DATA TYPE100X100 MATRIX WITH UNROLL FACTOR 4NameUtilizationTotal Available% 2722012.27BRAM 18K42801.43TABLE VIII. POST RTL SYNTHESIS RESOURCE UTILIZATION FOR FIXEDPOINT DATA TYPE 100X100 MATRIX WITH UNROLL FACTOR 4NameUtilizationTotal Available% 2822012.73BRAM 18K5028017.86Fig. 7 shows comparison between post implementationresults of floating point and fixed point, for 100x100 matrix withunroll factor of 4. It shows that fixed point implementation hasnearly 46 percent reduction in resource utilization compared tofloating point implementation and is nearly 2.8 times faster. Thelatency has decreased at cost of resource utilization.V.CONCLUSION AND FUTURE WORKIn this work, a unique way of implementing Gram-Schmidtalgorithm using HLS, for 10x10 matrix input and a scaled up100x100 matrix implementation is discussed [12]. Two differentdata types were used which have different applications. Thefloating-point implementation has very high precision [12] andcan be used if the precision is the priority. The fixed-point datatype with 16-bit inputs provide a better option with 30 percentreduction in resource utilization and twice faster than floatingpoint implementation, with error not more than 3 percent.A scaled up 100x100 matrix application is also discussedwith the results which shows around twice as much resourceutilization post implementation and with increase in latency.Latency reduction technique, loop unrolling [2] [3], is discussedwhich shows significant improvement in the latency withresource utilization tradeoff. This shows a new approach for QRdecomposition technique using high level synthesis and any ofthese implementation techniques could be selected based onapplication.The future scope of this project is to improve thisimplementation by further optimizing the algorithm and tryingto achieve higher throughput by trying differenthardware/software implementation techniques.ACKNOWLEDGMENTWe would like to thank Xilinx, Inc for their valuable supportwith the software and documentation.REFERENCES[1][2]Fig. 7. Post-Implementation Resource Utilization Comparison for 100x100matrix with unroll factor 4TABLE IX.MethodImplementedWithout LoopUnrollingLoopUnrolling byfactor 2LoopUnrolling byfactor 4LATENCY TABLE FOR 100X100 MATRIX SIZE[3]Floating pointdata typelatency (ms)Fixed Point datatype latency (ms)[4]20964[5]14148[6]9935Table IX above shows significant improvement in thelatency for 100x100 matrix size for different implementations,[7][8]S. Aslan, E. Oruklu and J. Saniie, "A high-level synthesis and verificationtool for fixed to floating point conversion," 2012 IEEE 55th InternationalMidwest Symposium on Circuits and Systems (MWSCAS), Boise, ID,2012, pp. 908-911.J. L. Ayala, D. Atienza, M. Lopez-Vallejo, J. M. Mendias, R. Hermidaand C. A. Lopez-Barrio, "Optimal loop-unrolling mechanisms andarchitectural extensions for an energy-efficient design of shared registerfiles in MPSoCs," Innovative Architecture for Future Generation HighPerformance Processors and Systems (IWIA'05), 2005, pp. 7 pp.-.G. Velkoski, M. Gusev and S. Ristov, "The performance impact analysisof loop unrolling," 2014 37th International Convention on Informationand Communication Technology, Electronics and Microelectronics(MIPRO), Opatija, 2014, pp. 307-312.S. Aslan, S. Niu and J. Saniie, "FPGA implementation of fast QRdecomposition based on givens rotation," 2012 IEEE 55th InternationalMidwest Symposium on Circuits and Systems (MWSCAS), Boise, ID,2012, pp. 470-473.S. Aslan, E. Oruklu and J. Saniie, "Realization of area efficient QRfactorization using unified division, square root, and inverse square roothardware," 2009 IEEE International Conference on Electro/InformationTechnology, Windsor, ON, 2009, pp. 245-250.R. T. Kobayashi and T. Abrão, "Stability analysis in Gram-Schmidt QRdecomposition," in IET Signal Processing, vol. 10, no. 8, pp. 912-917, 102016.Todd K. Moon, Wynn C. Stirling, Mathematical Methods and Algorithmsfor Signal Processing, Prentice Hall Upper Saddle River, NJ 07458.Zynq-7000 All Programmable SoC Embedded Design Tutorial, A HandsOn Guide to Effective Embedded System Design, 2015, Available atAuthorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:15:50 UTC from IEEE Xplore. Restrictions apply.

487https://www.xilinx.com/support/documentation/sw torial.pdf.[9] Xilinx Vivado Design Suite, HLS v2016.4, 2016, Available athttps://www.xilinx.com/support/documentation/sw sis-tutorial.pdf.[10] Vivado Design Suite Tutorial, design flow overview, 2016 Available at,https://www.xilinx.com/support/documentation/sw view-tutorial.pdf.[11] C. Desmouliers, E. Oruklu, S. Aslan, J. Saniie and F. M. Vallina, "Imageand video processing platform for field programmable gate arrays using ahigh-level synthesis," in IET Computers & Digital Techniques, vol. 6, no.6, pp. 414-425, November 2012.[12] S. Niu, S. Wang, S. Aslan and J. Saniie, "Hardware and software designfor QR Decomposition Recursive Least Square algorithm," 2013 IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS), Columbus, OH, 2013, pp. 117-120.Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:15:50 UTC from IEEE Xplore. Restrictions apply.

point, post implementation results. It shows that fixed point implementation has nearly 40 percent reduction in resource utilization compared to floating point implementation and is 3 times faster as well. Fig. 5. Post-Implementation Resource Utilization Comparison for 100x100 Matrix To increase the performance of 100x100 matrix

Related Documents:

In this thesis, FPGA-based simulation and implementation of direct torque control (DTC) of induction motors are studied. DTC is simulated on an FPGA as well as a personal computer. Results prove the FPGA-based simulation to be 12 times faster. Also an experimental setup of DTC is implemented using both FPGA and dSPACE. The FPGA-based design .

FPGA ASIC Trend ASIC NRE Parameter FPGA ASIC Clock frequency Power consumption Form factor Reconfiguration Design security Redesign risk (weighted) Time to market NRE Total Cost FPGA vs. ASIC ü ü ü ü ü ü ü ü FPGA Domain ASIC Domain - 11 - 18.05.2012 The Case for FPGAs - FPGA vs. ASIC FPGAs can't beat ASICs when it comes to Low power

Step 1: Replace ASIC RAMs to FPGA RAMs (using CORE Gen. tool) Step 2: ASIC PLLs to FPGA DCM & PLLs (using architecture wizard), also use BUFG/IBUFG for global routing. Step 3: Convert SERDES (Using Chipsync wizard) Step 4: Convert DSP resources to FPGA DSP resources (using FPGA Core gen.)

Near-Term Update Plans, RRA-GRAM Fairing Currently methodology in Earth-GRAM does not handle transitions between RRA and GRAM very well Generated 2013 RRA cases to examine effect on GRAM profiles of temperature, east-west wind and north-south wind Faired over a region of 5 km (25-30 km) between RRA and GRAM.

The LabVIEW implementation of the control system consisted of two main parts; (i) host PC virtual instrument (VI) and (ii) FPGA VI. A host PC VI was deve loped to model the PID control transfer function and inter act with the FPGA based RIO hardware. The FPGA VI was programmed in LabVIEW and synthesized to ru n on the FPGA.

14 2 FPGA Architectures: An Overview Fig. 2.5 Overview of mesh-based FPGA architecture [22] 2.4.1 Island-Style Routing Architecture Figure2.5 shows a traditional island-style FPGA architecture (also termed as mesh-based FPGA architecture). This is the most common

5.2 Inspection of Structural Adder Using Schematic and fpga editor 5.2.1 Schematics and FPGA layout Now let’s take a look at how the Verilog you wrote mapped to the primitive components on the FPGA. Three levels

Additif alimentaire : substance qui n’est habituellement pas consommée comme un aliment ou utilisée comme un ingrédient dans l’alimentation. Ils sont ajoutés aux denrées dans un but technologique au stade de la fabrication, de la transformation, de la préparation, du traitement, du conditionnement, du transport ou de l’entreposage des denrées et se retrouvent donc dans la .