An FPGA Implementation Of The Hestenes-Jacobi Algorithm .

2y ago
131 Views
10 Downloads
919.75 KB
8 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Harley Spears
Transcription

2014 IEEE 28th International Parallel & Distributed Processing Symposium WorkshopsAn FPGA Implementation of the Hestenes-JacobiAlgorithm for Singular Value DecompositionXinying Wang and Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University, Ames, Iowa, USAEmail: {xinying, zambreno}@iastate.eduThe SVD operation diagonalizes an arbitrary m n matrixthrough a series of orthogonal transformations [5]. Optimizedsoftware implementations (e.g., MATLAB, LAPACK) employthe Householder transformation [6] to perform SVD computation; however, their performance is restricted by their inherentcomputational complexity and high data dependency. Highlyparallel accelerators such as Graphic Processing Unit (GPUs)and multi-core platforms have been employed to exploreparallel implementations, although these previous works onlyachieved speedups when the input matrices have dimensionsgreater than 1000 [7], [8].In the reconfigurable architecture community, systolicarrays have been implemented on modern FPGAs to computethe classic two-sided Jacobi rotations [9], [10]. Althoughimproved performance has been demonstrated, the scalabilityof those implementations are limited, and the designs arerestricted to only handle square input matrices.Hestenes [10] discovered the equivalence between zeroingout an off-diagonal aij and orthogonalizing the ith and jthvectors through plane rotation. Instead of annihilating everynon-zero off-diagonal element by rotating 2 2 matrices,the Hestenes-Jacobi method is capable of decomposing anarbitrary m n non-square matrix through vector computations. GPUs and FPGAs have been employed to evaluateparallel Hestenes-Jacobi designs; however, the performancehas suffered from the iterative thread synchronizations (in thecase of GPUs [11]) and repeated calculations (in the case ofFPGA implementations [12]).In this paper, we present an FPGA-based hardware designof the Hestenes-Jacobi algorithm for SVD with floating-pointarithmetic, which attempts to analyze an arbitrary m nmatrix. Compared to a previous FPGA-based Hestenes-Jacobiimplementation [12], our architecture optimizes the calculations through improving data reuse, and employs IEEE754 double precision floating-point operators to provide awider dynamic range. Also, off-chip memory is employed tobreak the restriction of the analyzable matrix dimensions. Ourexperimental results have demonstrated the efficiency of ourdesign for matrices with small to medium column dimensions,even when they have comparably large row dimensions. Thedimension-dependent speedups that can be achieved rangefrom 3.8 to 43.6 for matrices with column sizes from128 to 256 and row dimensions from 128 to 2048. Comparedto other GPU-based and FPGA-based implementations ofAbstract—As a useful tool for dimensionality reduction, Singular Value Decomposition (SVD) plays an increasingly significantrole in many scientific and engineering applications. The highcomputational complexity of SVD poses challenges for efficientsignal processing and data analysis systems, especially for timesensitive applications with large data sets. While the emergenceof FPGAs provides a flexible and low-cost opportunity topursue high-performance SVD designs, the classical two-sidedJacobi rotation-based SVD architectures are restricted in termsof scalability and input matrix representation. The HestenesJacobi algorithm offers a more parallelizable solution to analyzearbitrary rectangular matrices; however, to date both FPGA andGPU-based implementations have not lived up to the algorithm’spotential. In this paper, we introduce a floating-point HestenesJacobi architecture for SVD, which is capable of analyzingarbitrary sized matrices. Our implementation on an FPGA-basedhybrid acceleration system demonstrates improved efficiency ofour architecture compared to an optimized software-based SVDsolution for matrices with small to medium column dimensions,even with comparably large row dimensions. The dimensionalspeedups can be achieved range from 3.8 to 43.6 for matriceswith column dimensions from 128 to 256 and row sizes from 128to 2048. Additionally, we also evaluate the accuracy of our SVDprocess through convergence analysis.Keywords-Architecture, FPGA, Singular Value Decomposition,Hestenes-Jacobi Algorithm.I. I NTRODUCTIONIn many real-world applications, data dimensionality israpidly growing. Principal Component Analysis (PCA) iswidely employed to reduce the dimensions of data in highdimensional spaces. Among the classical solutions for PCA,Singular Value Decomposition (SVD) is the most populartechnique to approximate high-dimensional data through orthogonal transformations. SVD-based PCA has been used inmany signal processing applications such as image processing,computer vision, pattern recognition and remote sensing [1]–[3]. However, SVD is a computationally-expensive procedure,which makes its use unlikely to meet the requirements ofmany time-sensitive designs, especially when it is processediteratively in those applications. For instance, in the application of video surveillance [4], it takes 185.2 seconds torecover the square matrix with the dimensions of 3000 throughrunning partial SVD 15 times, which makes it difficult tosatisfy stringent real-time performance requirements. As datadimensionality is increasing continuously, the runtime of SVDis likely to have further substantial growth.978-1-4799-4116-2/14 31.00 2014 IEEEDOI 10.1109/IPDPSW.2014.29220

C. Hestenes-Jacobi MethodHestenes-Jacobi SVD, our architecture is currently the fastestin terms of overall performance. We have also evaluated theaccuracy of our approach through analysis of the convergenceproperties.The remainder of this paper is organized as follows. SectionII introduces the theoretical background of SVD, while relatedwork is discussed in Section III. Section IV presents themodified Hestenes-Jacobi algorithm, and Section V introducesour novel hardware architecture for Hestenes-Jacobi algorithm,including detailed descriptions of individual components. Theevaluation of performance and accuracy for our architectureis presented in Section VI. Finally, the paper is concluded inSection VII with a preview of future planned work.In [10], Hestenes observed that the annihilation of a matrixelement is equivalent to orthogonalizing two column vectors. Instead of directly annihilating non-zero off-diagonalelements, the Hestenes-Jacobi algorithm (also known as theone-sided Jacobi method) performs the matrix decompositionthrough iterative orthogonal transformations between everypair of vectors. In the Hestenes-Jacobi method, the matrixis orthogonalized by columns through post-multiplying anorthogonal matrix, which is generated through a product ofplane rotations as in eq. (6).A · V B,II. T HEORETICAL BACKGROUNDThe Singular Value Decomposition transforms an m nmatrix into a product of an m m orthogonal matrix, anm n diagonal matrix with singular values and the transposeof an n n orthogonal matrix [5] in the form of eq. (1).Am n A·V U ·Σ(1)Jacobi rotations are widely used in diagonalizing matrices.To perform Jacobi rotation, Jacobi rotation matrices J l and J rare applied to the matrix from both sides as shown in eq. (2).By applying the Jacobi rotation matrices to a 2 2 matrix, theoff-diagonal elements are annihilated as in eq. (3).l J ·AppAqpApqAqq r·J 0A”qqAqp Apq)Aqq Appβ-α arctan((7)III. R ELATED W ORK (3)The Jacobi rotation matrices are generated through eq. (4)and eq. (5), where θ represents plain rotation angles α or β[9]. Jpp cos(θ); Jpq sin(θ);(p q) Jqp -sin(θ);(p q)(4)Jqq cos(θ); Jii 1;(i p,q) Jij 0, Others.β α arctan(A U ·Σ·VTIn recent years, the significant surge of data dimensionalityhas made the application of SVD seem ubiquitous [13]. Tocompute SVD, the Householder transformation-based methodand the Jacobi rotation-based method have demonstrated satisfied stability and accuracy [14], [15]. The Householder transformation [6], [16] is capable of efficiently bi-diagonalizingmatrices through vector computations, which is then followedby iterative implicit QR factorization [17] or divide-andconquer iterations [18] for generating singular values. InHouseholder transformation-based method, the SVD processis dominated by the calculations of Householder vectors andtheir respective updates, whose performance improvement ischallenged by the inherent data dependency. To parallelizethe Householder transformation, implementations have beendemonstrated on GPUs [7], [11] and multi-core platforms [8],in which possible accelerations of GPU-based designs areachieved only for matrices with significantly large dimensionsdue to the iterative thread synchronization, while, the performance of implementation on multi-core platform is dominatedby the task splitting and time consumption of communications.The emergence of reconfigurable fabrics such as FPGAsintroduces low-cost solutions to parallelize the algorithm atthe operand-level granularity. To perform SVD, 1-dimensionalor 2-dimensional systolic arrays have been employed to parallelize the classic two-sided Jacobi rotation algorithm [9], [19]–[21]. With the featured independent 2 2 rotations, a highlyparallel 2-dimensional systolic array is employed to map theclassic two-sided Jacobi rotation algorithm into FPGA deviceswith the computational complexity of O(n log n) for an n-by-nsquare matrix. However, to fit the architecture on a single chip,(2)A”pp0 Compared to the classic two-sided Jacobi rotation approach,the Hestenes-Jacobi method is capable to analyze an arbitraryrectangular matrix.B. Classic Two-sided Jacobi RotationsAi 1 Jil Ai Jir(6)Next, matrix B is further normalized through the equationB B · Σ 1 · Σ, in which Σ is a diagonal matrix with thesquared column norms as diagonal elements. Then, by settingU B · Σ 1 , eq. (6) can be rewritten as eq. (7), which is theresult form of SVD.A. Singular Value Decomposition (SVD)TUm m Σm n Vn nwhere bTi · bj 0Aqp Apq) (5)Aqq AppTo process SVD, Jacobi rotations are calculated on every2 2 matrix to zero out all the non-zero off-diagonal elements.The calculation of an independent 2 2 Jacobi rotation onlyaffects two rows and columns of a matrix, which providesan opportunity for parallel designs through simultaneouslyperforming independent 2 2 Jacobi rotations. Due to thenature of 2 2 Jacobi rotations, the input matrix is restrictedto square dimensions.221

the scalability is limited, as n2 processing elements (PEs) isneeded by the systolic array implementation.Compared to the classical Jacobi rotation approach, theHestenes-Jacobi algorithm provides a more flexible solutionto analyze the rectangular matrices. To explore the high performance SVD design, FPGAs and GPUs have been employedto demonstrate the parallel implementations of the HestenesJacobi SVD algorithm [10]; however, the performance hassuffered from iterative thread synchronizations for the implementation on GPUs [11], and the iterative design withduplicated computations in the case of FPGA implementation[12].Algorithm 1: S INGULAR VALUE DECOMPOSITION VIAM ODIFIED H ESTENES -JACOBI ALGORITHM1234567IV. M ODIFIED H ESTENES -JACOBI A LGORITHMAs previously mentioned, the Hestenes-Jacobi algorithmcomputes the SVD through orthogonalizing every pair of vectors. Instead of directly performing element-wise operations toannihilate an off-diagonal, the Hestenes-Jacobi method appliesorthogonal transformation between the two vectors whoseindexes are equal to the row and column indexes of that offdiagonal element. To orthogonalize a pair of vectors, Jacobirotation is computed with the squared 2-norms of the vectorsand the covariance between them.In the Hestenes-Jacobi process (detailed in Algo. 1), the orthogonalization between two column vectors is started with thecalculation of their squared 2-norms and respective covariance.Then, Jacobi rotation is performed with the calculated squared2-norms and covariance, after which, the elements in those twocolumn vectors are updated by applying the generated rotationangle parameters. At runtime, pairwise orthogonalizations areperformed iteratively until the satisfied convergence is reached.The singular values are obtained as the square root of thediagonal elements in the resulted matrix.To optimize the algorithm by reducing the amount ofcomputations, the squared 2-norms of rotated vectors andtheir associated covariances are updated directly after eachrotation. Thus, the repeated regeneration of squared 2-normsand covariances has become unnecessary. In Algo. 1, matrixD is the covariance matrix, whose diagonal and off-diagonalelements are the squared 2-norms of column vectors and thecovariances between them, respectively.V. O UR H ESTENES -JACOBI : matrix AOutput: singular value vector SigR A/* Generating the squared 2-norms of columnvectors and their associated covariances*/for i 1 to N umof Column dofor j i to N umof Column doDi,j RiT Rjrepeatfor i 1 to N umof Column 1 dofor j i to N umof Column do/* Generating Jacobi rotation angleparameters with squared 2-normsof column vectors and theirrespective covariance*/norm1 Dj,jnorm2 Di,icov Di,jρ (norm2 norm 1 )/(2 cov)t sign(ρ)/( ρ 1 ρ2 ) cos 1/ 1 t2sin cos t/* Updating the squared 2-normsaffected by rotation*/Dj,j Dj,j t covDi,i Di,i t covcov 0/* Updating the covariancesaffected by rotation*/for k 1 to i 1 doDk,i Dk,i cos Dk,j sinDk,j Dk,i sin Dk,j cosfor k i 1 to j 1 doDi,k Di,k cos Dk,j sinDk,j Di,k sin Dk,j cosfor k j 1 to N umof Row doDi,k Di,k cos Dj,k sinDj,k Di,k sin Dj,k cosuntil convergence reachedfor i 1 to min(N umof Column, N umof Rows) doSigi Di,iARCHITECTUREassociated vectors. The Update operator is employed to updatethe affected columns and covariances.The Hestenes-Jacobi SVD is an iterative diagonalizationprocess, which performs the orthogonal transformations between every pair of columns by numerous iterations to achievesatisfied convergence. To reduce the amount of computations,instead of repeatedly regenerating all the squared 2-normsand covariances in each iteration, our Hestenes-Jacobi SVDarchitecture calculates all the squared 2-norms and covariances only in the first iteration, and then those squared 2norms and covariances are directly updated and reused in thesubsequent iterations. To reduce the hardware resource usage,the Hestenes preprocessor is reconfigured to function as anThe Hestenes-Jacobi SVD process primarily consists ofthree computations: (1), calculating the squared 2-norms ofvectors and the covariances between vector pairs; (2), performing Jacobi rotations with paired squared 2-norms and theirrespective covariance; (3), updating rotated vector elementsand affected covariances.To implement the Hestenes-Jacobi SVD, we created threecomponents: the Hestenes preprocessor, the Jacobi rotationcomponent and the Update operator (shown in Fig. 1), all ofwhich are pipelined. The Hestenes preprocessor is responsiblefor computing the squared column 2-norms and the associatedcovariances. The Jacobi rotation component is used to zeroout the covariance through applying plane rotation with its222

Fig. 1: Block diagram of the Hestenes-Jacobi SVD architecture.Fig. 2: Example architecture of the Hestenes preprocessor.covariance between (j 1)th and (j 3)th columns.The example input for a single multiplier-array with fourmultipliers is described in Fig. 3, in which the new operandrequests for the multiplier array are underlined. The dashedarrows highlights the data movement for reuse and the dashedcircles indicate the entered operands, which are reused in latercomputations. In this case, in a single layer, four doubleprecision floating-point numbers and at most one doubleprecision floating-point number are needed as the input forthe starting cycle and every subsequent cycle respectively toperform the computations on a sub-vector. Then, the computations on different layers are initialized successively. Thus,16 cycles are used for the input to obtain the covariancematrix of an 8 8 matrix if 8 layers of multiplier-arraysare equipped. Additional adders are employed to process theaccumulations of partial results of covariances and squared2-norms for vectors with the lengths over 8.additional Update operator after the first iteration. The squareroot operator in the Jacobi rotation component is employed tofinalize the SVD process, from which the singular values areproduced. Besides, as shown in Fig. 1, FIFOs are employed tosynchronize the computations and transmit data between theHestenes preprocessor and the Update operator. Local BRAMsare used to hold the generated rotation angle parameters cosand sin, and the covariances whose computations have notbeen completed with the current vector pairing.A. Hestenes PreprocessorThe Hestenes preprocessor is responsible for calculatingthe squared column 2-norms and the covariances betweencolumn vectors, in which ATi Aj is computed. Consideringthe overall system performance, we have to balance the amountof parallel computation with the I/O requests. In the Hestenespreprocessor (shown in Fig. 2), multiple layers of pipelinedmultiplier-arrays are devised, in which operands are reusedby all the multipliers successively in a multiplier-array tocalculate the partial results of various squared column 2norms and their related covariances. The resulting product of amultiplier is summed up with the results of its correspondingmultiplications across layers, whose operands are the matrixelements from the same columns. For example, in Fig. 2, thematrix element Ai,j 3 multiplies with Ai,j at the first layer,whose product is then added to the product of multiplyingAi 1,j 3 by Ai 1,j at the second layer, the product of multiplying Ai 2,j 3 by Ai 2,j at the third layer, and the productof multiplying Ai 3,j 3 by Ai 3,j at the forth layer, whosesum is the partial result of the covariance between the jthand (j 3)th columns. Meanwhile, Ai,j 3 moves leftwardsto be applied to the adjacent multiplier for multiplication ofAi,j 3 and Ai,j 1 , whose product is used for computing theB. Jacobi Rotation ComponentJacobi rotation component performs the orthogonal transformation between two column vectors through a series of operations on their squared column 2-norms and the covariancebetween them. To calculate the Jacobi rotations, the CORDIC(for COordinate Rotation DIgital Computer) algorithm [22] isa popular choice in the research literature, due to its advantageson efficiently performing complicated trigonometric functionsthrough simple shift-and-add operations. Although CORDIChas been demonstrated as a hardware-efficient algorithm forfixed-point operations, its efficient floating-point implementation is challenged by its inherent bit-width shift-and-addstructure. As floating-point arithmetic has become increasinglypopular in signal processing applications for its support of amuch wider range of values compared to decimal fixed-point223

Fig. 5: The architecture of a single update kernel.Fig. 3: Example input to a single layer of multiplier-array.t cos n2 n1 2 c1,2 (n2 n1 )2 4 c21,2(n2 n1 )2 2 c21,2 n2 n1 (n2 n1 )2 4 c21,2 n2 n1 sin (sign)(8) (n n )2 4 c21,2 2 1222 c21,2(n2 n1 )2 4 c21,2 n2 n1 (n2 n1 ) 4 c1,2(9) (n2 n1 )2 4 c21,2(10)C. Update Operator Ai Ai cos Aj sin Aj Ai sin Aj cosFig. 4: Dataflow of the Jocobi rotation procedure.(11)(12)The Update operator is responsible for updating columnelements and covariances which are affected by the processedrotations. Generated rotation angle parameters cos and sinare employed to update the covariances before they are usedby later rotations. The update process for a pair of elementscontains simple multiplications, additions and subtractions asis shown in eq. 11 and eq. 12. An architecture of a singleupdate kernel is demonstrated in Fig. 5, in which pipelinedmultipliers, an adder and a subtractor are employed. Multipleupdate kernels are included in the Update operator. Thenumber of update kernels that can be allocated to a singlechip is determined by the resource capacity on the chip.This determines the efficiency of the system, especially forlarge-scale matrices, where performance is dominated by theamount of updates after each rotation. The convergence ofSVD requires the orthogonal transformation of the matrix tobe performed in numerous iterations. Both individual columnelements and covariances have to be updated in the firstiteration, and in the subsequent iterations, only covariancesare operated. To optimize the use of hardware resources, theHestenes preprocessor is able to be reconfigured to functionas multiple update kernels.format, our architecture is designed to perform floating-pointcalculations by using pipelined IEEE-754 double-precisionfloating-point operators.As described in Algo. 1, Jacobi rotation of two columnvectors is computed with their squared 2-norms and covariance through a series of addition, subtraction, multiplication,division and square-root. The Jacobi rotation equations canbe represented as eq. (8), eq. (9), eq. (10), where n1 and n2represents the squared 2-norms of column vectors, while thecovariance between them is represented by c1,2 . The calculatedparameter t is then applied to update the squared 2-normsof rotated vectors and zero out their covariance. In Fig. 4,the computations of Jacobi rotation is demonstrated, in whichindependent calculations can be processed simultaneously. Tominimize hardware resource usage, the expensive floatingpoint computational cores are reused by those calculations.Once all the orthogonal transformations are completed, thesquare-root operator in the Jacobi rotation component is usedto generate the singular values by applying it to the diagonalelements of the processed matrix.224

Fig. 7: SVD computation time (in seconds) for square matricesby our Hestenes-Jacobi architecture, Matlab, Intel MKL andGPUiterations. To perform Jacobi rotation, 1 multiplier, 2 adders,1 divider and 1 square-root calculators are used, which canstart 8 independent Jacobi rotations in every 64 clock cycles.Additionally, an array of eight update kernels are implementedin the Update operator, which contains 32 multipliers and 16adders or subtractors. The IP core generated computationalcores are configured with default latencies as 9, 14, 57, 57clock cycles for multiplier, adder or subtractor, divider andsquare-root calculator respectively. Two groups of eight 64-bitwidth FIFOs are programmed to synchronize the input andoutput, while a group of eight 127-bit width FIFOs are usedfor the data transmissions between the Hestenes processor andthe Update operator. Simple dual port RAMs are employedto temporarily cache the rotation angle parameters and somecovariances. The whole covariance matrix can be stored in thelocal memory for matrices of column dimension no greaterthan 256. The system is tested by executing at 150Mhzfor 6 iterations, which is believed sufficient for achievingconvergence with certain thresholds. Also, a software modelis implemented using Matlab to conduct the convergenceevaluation.Fig. 6: Demonstration of employed cyclic order for vectorpairing.D. The Cyclic Order for Vector PairingThe order of vector pairing determines the speed andfeasibility of the convergence. In our design, we employ thecyclic ordering, which was demonstrated with the capability ofachieving convergence efficiently [9]. In Fig. 6, cyclic orderingis demonstrated with 32 vectors, in which the numbers represent the column indexes, and the arrows indicate the directionfor the movement of indexes to form the new vector pairs. Inthe cyclic ordering, each column has to be paired with everyother column. The paired vectors are highlighted by solidboxes in Fig. 6. Besides, due to the limited hardware resourceson a single chip, a limited number of vector pairs can beoperated simultaneously. In Fig. 6, a dashed box highlights agroup of vector pairs, whose computations can be performed inparallel. All the vector-pair groups enter our Hestenes-Jacobiarchitecture successively.B. Performance analysisWe experimented with both square and rectangular matriceswith various dimensions, the performance of which has beensummarized in Table I. The experimental results demonstratethat the execution time grows significantly as the numberof matrix columns increases, which determines the amountof covariances, whose computation dominates the overallperformance. Comparably, the number of rows, which onlyaffects the execution time of the Hestenes preprocessing,has smaller impact on the performance. When the matrixcolumn size grows over 256, the performance is increasinglyaffected by the I/O bandwidths due to the increased covariancecommunications have to be made between our Hestenes-Jacobiarchitecture and off-chip memory.Comparisons of execution times have been made betweenour implementation and experimental results from the published literature [7] as well as a Matlab SVD routine. InVI. E XPERIMENTS AND E VALUATIONSA. Implementation and experimental setupTo evaluate the performance of our Hestenes-Jacobi design,a single Xilinx Virtex-5 XC5VLX330 FPGA on our ConveyHC-2 system [23] is used to implement our architecture. In ourimplementation, double-precision floating-point computationalcores are generated by using Xilinx Coregen generator [24].In the Hestenes preprocessor, four layers of multiplier-arrayare implemented, in which 16 multipliers and 16 adders areused. To improve the computational intensity on the limitedhardware resources, the Hestenes processor calculates all thesquared 2-norms and covariances in the first iteration oforthogonalization, and it is then reconfigured as four updatekernels with 16 multipliers and 8 adders in the remaining225

Fig. 8: SVD computation time (in seconds) for rectangularmatrices by our Hestenes-Jacobi architecture, Matlab, IntelMKL and GPUFig. 10: Convergence process of different dimensional matrices.dimensions from 128 to 2048. Table II shows the resourceutilization by our Hestenes-Jacobi architecture.Compared to the experimental results of the latest GPUbased and FPGA-based Hestenes-Jacobi implementations, ourarchitecture shows the best performance [11], [12]. In [12],the GPU-based implementation, which ran 106.90ms and1022.92ms to decompose a 128 128 and a 256 256 matrixrespectively, failed to achieve any speedup compared to aconventional software solution. The FPGA-based design [11]was devised to perform fixed-point operations, which can onlyanalyze the matrices with the size up to 32 128 due to the limitation of on-chip memory. Although the better performancehas been demonstrated compared to their software executionwith Matlab SVD for matrices with dimensions range from2 2 to 32 127, our Matlab SVD routine runs 100 times fasterthan their Matlab SVD, and shares comparable speeds withtheir FPGA-based design. In further comparison to [11], inwhich 24.3143ms is needed to decompose the largest analyzedmatrix with the dimensions of 32 127, the execution time ofoperating a 128 128 matrix by our architecture shows morethan 5 times speedup.Fig. 9: Speedup of our Hestenes-Jacobi SVD compare toMatlab SVDFig. 7 and Fig. 8, the performance of our design, the Matlab7.10.0 SVD routine running on a 2.2 GHz dual core IntelXeon processor, SVD solutions with Intel MLK 10.0.4 andNVIDIA 8800 GPU with 128 stream processors [7] havebeen demonstrated. By analyzing those data points in Fig.7, our architecture has better efficiency than other softwaresolutions when matrix with dimensions under 512, and ourexecution slows down when the dimensions over 512 due tothe limits of our chosen platform’s I/O throughput. In Fig.8, the comparison is made between matrices with identicalcolumn dimensions but various row sizes, which indicates thegrowth of row number causes a relatively slow increase of theexecution time due to the quantity of covariances is determinedby the column size.In Fig. 9, the dimensional speedups of our FPGA-basedHestenes-Jacobi SVD compared to the Matlab SVD solutionrunning on an Intel platform are presented, in which ourHestenes-Jacobi architecture shows better efficiency in analyzing matrices with small to medium column dimensionscompared to the standard software solution, even when theyare with comparably large row dimensions. The dimensionalspeedups that can be achieved range from 3.8 to 43.6 for matrices with column sizes from 128 to 256 and rowTABLE I:m\n12825651210241284.39 10 32.52 10 21.70 10 11.23TABLE II:Execution time in seconds.2566.30 10 33.30 10 22.01 10 11.355121.01 10 24.84 10 22.63 10 11.6110241.79 10 27.94 10 23.87 10 12.01Resource consumption of our Hestenes-Jacobi architecture.ResourceOurArchitectureSlice LUT89%BRAM91%DSPs53%C. Convergence AnalysisSVD is a process of diagonalizing matrix through iterativerotations; to evaluate the correctness and accuracy of the226

[2] Y. Mu, J. Dong, X. Yuan, and S. Yan, “Accelerated low-rank visualrecovery by random projection,” in Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), June 2011, pp.2609–2616.[3] R. Liao, Y. Fernandess, K. Tavernier, G. Taylor, and M. Irving, “Recognition of partial discharge patterns,” in Proceedings of IEEE Power andEnergy Society General Meeting, July 2012, pp. 1–8.[4] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis?” The Computing Research Repository, vol. 0912.3599, 2009.[5] L. N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM: Societyfor Industrial and Applied Mathematics, 1997.[6] G. Golub and W. Kahan, “Calculating the singular values and pseudoinverse of a matrix,” Journal of the Society for Industrial and AppliedMathematics, Series B: Numerical Analysis, vol. 2, no. 2, pp. 205–224,1965.[7] S.

FPGA implementations [12]). In this paper, we present an FPGA-based hardware design of the Hestenes-Jacobi algorithm for SVD with floating-point arithmetic, which attempts to analyze an arbitrary m n matrix. Compared to a previous FPGA-based Hestenes-Jacobi i

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

In this thesis, FPGA-based simulation and implementation of direct torque control (DTC) of induction motors are studied. DTC is simulated on an FPGA as well as a personal computer. Results prove the FPGA-based simulation to be 12 times faster. Also an experimental setup of DTC is implemented using both FPGA and dSPACE. The FPGA-based design .

FPGA ASIC Trend ASIC NRE Parameter FPGA ASIC Clock frequency Power consumption Form factor Reconfiguration Design security Redesign risk (weighted) Time to market NRE Total Cost FPGA vs. ASIC ü ü ü ü ü ü ü ü FPGA Domain ASIC Domain - 11 - 18.05.2012 The Case for FPGAs - FPGA vs. ASIC FPGAs can't beat ASICs when it comes to Low power