An FPGA Implementation Of The Hestenes-Jacobi Algorithm .

2y ago

131 Views

10 Downloads

919.75 KB

8 Pages

Last View : 16d ago

Last Download : 3m ago

Upload by : Harley Spears

Report this link

Download PDF

Transcription

2014 IEEE 28th International Parallel & Distributed Processing Symposium WorkshopsAn FPGA Implementation of the Hestenes-JacobiAlgorithm for Singular Value DecompositionXinying Wang and Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University, Ames, Iowa, USAEmail: {xinying, zambreno}@iastate.eduThe SVD operation diagonalizes an arbitrary m n matrixthrough a series of orthogonal transformations [5]. Optimizedsoftware implementations (e.g., MATLAB, LAPACK) employthe Householder transformation [6] to perform SVD computation; however, their performance is restricted by their inherentcomputational complexity and high data dependency. Highlyparallel accelerators such as Graphic Processing Unit (GPUs)and multi-core platforms have been employed to exploreparallel implementations, although these previous works onlyachieved speedups when the input matrices have dimensionsgreater than 1000 [7], [8].In the reconﬁgurable architecture community, systolicarrays have been implemented on modern FPGAs to computethe classic two-sided Jacobi rotations [9], [10]. Althoughimproved performance has been demonstrated, the scalabilityof those implementations are limited, and the designs arerestricted to only handle square input matrices.Hestenes [10] discovered the equivalence between zeroingout an off-diagonal aij and orthogonalizing the ith and jthvectors through plane rotation. Instead of annihilating everynon-zero off-diagonal element by rotating 2 2 matrices,the Hestenes-Jacobi method is capable of decomposing anarbitrary m n non-square matrix through vector computations. GPUs and FPGAs have been employed to evaluateparallel Hestenes-Jacobi designs; however, the performancehas suffered from the iterative thread synchronizations (in thecase of GPUs [11]) and repeated calculations (in the case ofFPGA implementations [12]).In this paper, we present an FPGA-based hardware designof the Hestenes-Jacobi algorithm for SVD with ﬂoating-pointarithmetic, which attempts to analyze an arbitrary m nmatrix. Compared to a previous FPGA-based Hestenes-Jacobiimplementation [12], our architecture optimizes the calculations through improving data reuse, and employs IEEE754 double precision ﬂoating-point operators to provide awider dynamic range. Also, off-chip memory is employed tobreak the restriction of the analyzable matrix dimensions. Ourexperimental results have demonstrated the efﬁciency of ourdesign for matrices with small to medium column dimensions,even when they have comparably large row dimensions. Thedimension-dependent speedups that can be achieved rangefrom 3.8 to 43.6 for matrices with column sizes from128 to 256 and row dimensions from 128 to 2048. Comparedto other GPU-based and FPGA-based implementations ofAbstract—As a useful tool for dimensionality reduction, Singular Value Decomposition (SVD) plays an increasingly signiﬁcantrole in many scientiﬁc and engineering applications. The highcomputational complexity of SVD poses challenges for efﬁcientsignal processing and data analysis systems, especially for timesensitive applications with large data sets. While the emergenceof FPGAs provides a ﬂexible and low-cost opportunity topursue high-performance SVD designs, the classical two-sidedJacobi rotation-based SVD architectures are restricted in termsof scalability and input matrix representation. The HestenesJacobi algorithm offers a more parallelizable solution to analyzearbitrary rectangular matrices; however, to date both FPGA andGPU-based implementations have not lived up to the algorithm’spotential. In this paper, we introduce a ﬂoating-point HestenesJacobi architecture for SVD, which is capable of analyzingarbitrary sized matrices. Our implementation on an FPGA-basedhybrid acceleration system demonstrates improved efﬁciency ofour architecture compared to an optimized software-based SVDsolution for matrices with small to medium column dimensions,even with comparably large row dimensions. The dimensionalspeedups can be achieved range from 3.8 to 43.6 for matriceswith column dimensions from 128 to 256 and row sizes from 128to 2048. Additionally, we also evaluate the accuracy of our SVDprocess through convergence analysis.Keywords-Architecture, FPGA, Singular Value Decomposition,Hestenes-Jacobi Algorithm.I. I NTRODUCTIONIn many real-world applications, data dimensionality israpidly growing. Principal Component Analysis (PCA) iswidely employed to reduce the dimensions of data in highdimensional spaces. Among the classical solutions for PCA,Singular Value Decomposition (SVD) is the most populartechnique to approximate high-dimensional data through orthogonal transformations. SVD-based PCA has been used inmany signal processing applications such as image processing,computer vision, pattern recognition and remote sensing [1]–[3]. However, SVD is a computationally-expensive procedure,which makes its use unlikely to meet the requirements ofmany time-sensitive designs, especially when it is processediteratively in those applications. For instance, in the application of video surveillance [4], it takes 185.2 seconds torecover the square matrix with the dimensions of 3000 throughrunning partial SVD 15 times, which makes it difﬁcult tosatisfy stringent real-time performance requirements. As datadimensionality is increasing continuously, the runtime of SVDis likely to have further substantial growth.978-1-4799-4116-2/14 31.00 2014 IEEEDOI 10.1109/IPDPSW.2014.29220

C. Hestenes-Jacobi MethodHestenes-Jacobi SVD, our architecture is currently the fastestin terms of overall performance. We have also evaluated theaccuracy of our approach through analysis of the convergenceproperties.The remainder of this paper is organized as follows. SectionII introduces the theoretical background of SVD, while relatedwork is discussed in Section III. Section IV presents themodiﬁed Hestenes-Jacobi algorithm, and Section V introducesour novel hardware architecture for Hestenes-Jacobi algorithm,including detailed descriptions of individual components. Theevaluation of performance and accuracy for our architectureis presented in Section VI. Finally, the paper is concluded inSection VII with a preview of future planned work.In [10], Hestenes observed that the annihilation of a matrixelement is equivalent to orthogonalizing two column vectors. Instead of directly annihilating non-zero off-diagonalelements, the Hestenes-Jacobi algorithm (also known as theone-sided Jacobi method) performs the matrix decompositionthrough iterative orthogonal transformations between everypair of vectors. In the Hestenes-Jacobi method, the matrixis orthogonalized by columns through post-multiplying anorthogonal matrix, which is generated through a product ofplane rotations as in eq. (6).A · V B,II. T HEORETICAL BACKGROUNDThe Singular Value Decomposition transforms an m nmatrix into a product of an m m orthogonal matrix, anm n diagonal matrix with singular values and the transposeof an n n orthogonal matrix [5] in the form of eq. (1).Am n A·V U ·Σ(1)Jacobi rotations are widely used in diagonalizing matrices.To perform Jacobi rotation, Jacobi rotation matrices J l and J rare applied to the matrix from both sides as shown in eq. (2).By applying the Jacobi rotation matrices to a 2 2 matrix, theoff-diagonal elements are annihilated as in eq. (3).l J ·AppAqpApqAqq r·J 0A”qqAqp Apq)Aqq Appβ-α arctan((7)III. R ELATED W ORK (3)The Jacobi rotation matrices are generated through eq. (4)and eq. (5), where θ represents plain rotation angles α or β[9]. Jpp cos(θ); Jpq sin(θ);(p q) Jqp -sin(θ);(p q)(4)Jqq cos(θ); Jii 1;(i p,q) Jij 0, Others.β α arctan(A U ·Σ·VTIn recent years, the signiﬁcant surge of data dimensionalityhas made the application of SVD seem ubiquitous [13]. Tocompute SVD, the Householder transformation-based methodand the Jacobi rotation-based method have demonstrated satisﬁed stability and accuracy [14], [15]. The Householder transformation [6], [16] is capable of efﬁciently bi-diagonalizingmatrices through vector computations, which is then followedby iterative implicit QR factorization [17] or divide-andconquer iterations [18] for generating singular values. InHouseholder transformation-based method, the SVD processis dominated by the calculations of Householder vectors andtheir respective updates, whose performance improvement ischallenged by the inherent data dependency. To parallelizethe Householder transformation, implementations have beendemonstrated on GPUs [7], [11] and multi-core platforms [8],in which possible accelerations of GPU-based designs areachieved only for matrices with signiﬁcantly large dimensionsdue to the iterative thread synchronization, while, the performance of implementation on multi-core platform is dominatedby the task splitting and time consumption of communications.The emergence of reconﬁgurable fabrics such as FPGAsintroduces low-cost solutions to parallelize the algorithm atthe operand-level granularity. To perform SVD, 1-dimensionalor 2-dimensional systolic arrays have been employed to parallelize the classic two-sided Jacobi rotation algorithm [9], [19]–[21]. With the featured independent 2 2 rotations, a highlyparallel 2-dimensional systolic array is employed to map theclassic two-sided Jacobi rotation algorithm into FPGA deviceswith the computational complexity of O(n log n) for an n-by-nsquare matrix. However, to ﬁt the architecture on a single chip,(2)A”pp0 Compared to the classic two-sided Jacobi rotation approach,the Hestenes-Jacobi method is capable to analyze an arbitraryrectangular matrix.B. Classic Two-sided Jacobi RotationsAi 1 Jil Ai Jir(6)Next, matrix B is further normalized through the equationB B · Σ 1 · Σ, in which Σ is a diagonal matrix with thesquared column norms as diagonal elements. Then, by settingU B · Σ 1 , eq. (6) can be rewritten as eq. (7), which is theresult form of SVD.A. Singular Value Decomposition (SVD)TUm m Σm n Vn nwhere bTi · bj 0Aqp Apq) (5)Aqq AppTo process SVD, Jacobi rotations are calculated on every2 2 matrix to zero out all the non-zero off-diagonal elements.The calculation of an independent 2 2 Jacobi rotation onlyaffects two rows and columns of a matrix, which providesan opportunity for parallel designs through simultaneouslyperforming independent 2 2 Jacobi rotations. Due to thenature of 2 2 Jacobi rotations, the input matrix is restrictedto square dimensions.221

the scalability is limited, as n2 processing elements (PEs) isneeded by the systolic array implementation.Compared to the classical Jacobi rotation approach, theHestenes-Jacobi algorithm provides a more ﬂexible solutionto analyze the rectangular matrices. To explore the high performance SVD design, FPGAs and GPUs have been employedto demonstrate the parallel implementations of the HestenesJacobi SVD algorithm [10]; however, the performance hassuffered from iterative thread synchronizations for the implementation on GPUs [11], and the iterative design withduplicated computations in the case of FPGA implementation[12].Algorithm 1: S INGULAR VALUE DECOMPOSITION VIAM ODIFIED H ESTENES -JACOBI ALGORITHM1234567IV. M ODIFIED H ESTENES -JACOBI A LGORITHMAs previously mentioned, the Hestenes-Jacobi algorithmcomputes the SVD through orthogonalizing every pair of vectors. Instead of directly performing element-wise operations toannihilate an off-diagonal, the Hestenes-Jacobi method appliesorthogonal transformation between the two vectors whoseindexes are equal to the row and column indexes of that offdiagonal element. To orthogonalize a pair of vectors, Jacobirotation is computed with the squared 2-norms of the vectorsand the covariance between them.In the Hestenes-Jacobi process (detailed in Algo. 1), the orthogonalization between two column vectors is started with thecalculation of their squared 2-norms and respective covariance.Then, Jacobi rotation is performed with the calculated squared2-norms and covariance, after which, the elements in those twocolumn vectors are updated by applying the generated rotationangle parameters. At runtime, pairwise orthogonalizations areperformed iteratively until the satisﬁed convergence is reached.The singular values are obtained as the square root of thediagonal elements in the resulted matrix.To optimize the algorithm by reducing the amount ofcomputations, the squared 2-norms of rotated vectors andtheir associated covariances are updated directly after eachrotation. Thus, the repeated regeneration of squared 2-normsand covariances has become unnecessary. In Algo. 1, matrixD is the covariance matrix, whose diagonal and off-diagonalelements are the squared 2-norms of column vectors and thecovariances between them, respectively.V. O UR H ESTENES -JACOBI : matrix AOutput: singular value vector SigR A/* Generating the squared 2-norms of columnvectors and their associated covariances*/for i 1 to N umof Column dofor j i to N umof Column doDi,j RiT Rjrepeatfor i 1 to N umof Column 1 dofor j i to N umof Column do/* Generating Jacobi rotation angleparameters with squared 2-normsof column vectors and theirrespective covariance*/norm1 Dj,jnorm2 Di,icov Di,jρ (norm2 norm 1 )/(2 cov)t sign(ρ)/( ρ 1 ρ2 ) cos 1/ 1 t2sin cos t/* Updating the squared 2-normsaffected by rotation*/Dj,j Dj,j t covDi,i Di,i t covcov 0/* Updating the covariancesaffected by rotation*/for k 1 to i 1 doDk,i Dk,i cos Dk,j sinDk,j Dk,i sin Dk,j cosfor k i 1 to j 1 doDi,k Di,k cos Dk,j sinDk,j Di,k sin Dk,j cosfor k j 1 to N umof Row doDi,k Di,k cos Dj,k sinDj,k Di,k sin Dj,k cosuntil convergence reachedfor i 1 to min(N umof Column, N umof Rows) doSigi Di,iARCHITECTUREassociated vectors. The Update operator is employed to updatethe affected columns and covariances.The Hestenes-Jacobi SVD is an iterative diagonalizationprocess, which performs the orthogonal transformations between every pair of columns by numerous iterations to achievesatisﬁed convergence. To reduce the amount of computations,instead of repeatedly regenerating all the squared 2-normsand covariances in each iteration, our Hestenes-Jacobi SVDarchitecture calculates all the squared 2-norms and covariances only in the ﬁrst iteration, and then those squared 2norms and covariances are directly updated and reused in thesubsequent iterations. To reduce the hardware resource usage,the Hestenes preprocessor is reconﬁgured to function as anThe Hestenes-Jacobi SVD process primarily consists ofthree computations: (1), calculating the squared 2-norms ofvectors and the covariances between vector pairs; (2), performing Jacobi rotations with paired squared 2-norms and theirrespective covariance; (3), updating rotated vector elementsand affected covariances.To implement the Hestenes-Jacobi SVD, we created threecomponents: the Hestenes preprocessor, the Jacobi rotationcomponent and the Update operator (shown in Fig. 1), all ofwhich are pipelined. The Hestenes preprocessor is responsiblefor computing the squared column 2-norms and the associatedcovariances. The Jacobi rotation component is used to zeroout the covariance through applying plane rotation with its222

Fig. 1: Block diagram of the Hestenes-Jacobi SVD architecture.Fig. 2: Example architecture of the Hestenes preprocessor.covariance between (j 1)th and (j 3)th columns.The example input for a single multiplier-array with fourmultipliers is described in Fig. 3, in which the new operandrequests for the multiplier array are underlined. The dashedarrows highlights the data movement for reuse and the dashedcircles indicate the entered operands, which are reused in latercomputations. In this case, in a single layer, four doubleprecision ﬂoating-point numbers and at most one doubleprecision ﬂoating-point number are needed as the input forthe starting cycle and every subsequent cycle respectively toperform the computations on a sub-vector. Then, the computations on different layers are initialized successively. Thus,16 cycles are used for the input to obtain the covariancematrix of an 8 8 matrix if 8 layers of multiplier-arraysare equipped. Additional adders are employed to process theaccumulations of partial results of covariances and squared2-norms for vectors with the lengths over 8.additional Update operator after the ﬁrst iteration. The squareroot operator in the Jacobi rotation component is employed toﬁnalize the SVD process, from which the singular values areproduced. Besides, as shown in Fig. 1, FIFOs are employed tosynchronize the computations and transmit data between theHestenes preprocessor and the Update operator. Local BRAMsare used to hold the generated rotation angle parameters cosand sin, and the covariances whose computations have notbeen completed with the current vector pairing.A. Hestenes PreprocessorThe Hestenes preprocessor is responsible for calculatingthe squared column 2-norms and the covariances betweencolumn vectors, in which ATi Aj is computed. Consideringthe overall system performance, we have to balance the amountof parallel computation with the I/O requests. In the Hestenespreprocessor (shown in Fig. 2), multiple layers of pipelinedmultiplier-arrays are devised, in which operands are reusedby all the multipliers successively in a multiplier-array tocalculate the partial results of various squared column 2norms and their related covariances. The resulting product of amultiplier is summed up with the results of its correspondingmultiplications across layers, whose operands are the matrixelements from the same columns. For example, in Fig. 2, thematrix element Ai,j 3 multiplies with Ai,j at the ﬁrst layer,whose product is then added to the product of multiplyingAi 1,j 3 by Ai 1,j at the second layer, the product of multiplying Ai 2,j 3 by Ai 2,j at the third layer, and the productof multiplying Ai 3,j 3 by Ai 3,j at the forth layer, whosesum is the partial result of the covariance between the jthand (j 3)th columns. Meanwhile, Ai,j 3 moves leftwardsto be applied to the adjacent multiplier for multiplication ofAi,j 3 and Ai,j 1 , whose product is used for computing theB. Jacobi Rotation ComponentJacobi rotation component performs the orthogonal transformation between two column vectors through a series of operations on their squared column 2-norms and the covariancebetween them. To calculate the Jacobi rotations, the CORDIC(for COordinate Rotation DIgital Computer) algorithm [22] isa popular choice in the research literature, due to its advantageson efﬁciently performing complicated trigonometric functionsthrough simple shift-and-add operations. Although CORDIChas been demonstrated as a hardware-efﬁcient algorithm forﬁxed-point operations, its efﬁcient ﬂoating-point implementation is challenged by its inherent bit-width shift-and-addstructure. As ﬂoating-point arithmetic has become increasinglypopular in signal processing applications for its support of amuch wider range of values compared to decimal ﬁxed-point223

Fig. 5: The architecture of a single update kernel.Fig. 3: Example input to a single layer of multiplier-array.t cos n2 n1 2 c1,2 (n2 n1 )2 4 c21,2(n2 n1 )2 2 c21,2 n2 n1 (n2 n1 )2 4 c21,2 n2 n1 sin (sign)(8) (n n )2 4 c21,2 2 1222 c21,2(n2 n1 )2 4 c21,2 n2 n1 (n2 n1 ) 4 c1,2(9) (n2 n1 )2 4 c21,2(10)C. Update Operator Ai Ai cos Aj sin Aj Ai sin Aj cosFig. 4: Dataﬂow of the Jocobi rotation procedure.(11)(12)The Update operator is responsible for updating columnelements and covariances which are affected by the processedrotations. Generated rotation angle parameters cos and sinare employed to update the covariances before they are usedby later rotations. The update process for a pair of elementscontains simple multiplications, additions and subtractions asis shown in eq. 11 and eq. 12. An architecture of a singleupdate kernel is demonstrated in Fig. 5, in which pipelinedmultipliers, an adder and a subtractor are employed. Multipleupdate kernels are included in the Update operator. Thenumber of update kernels that can be allocated to a singlechip is determined by the resource capacity on the chip.This determines the efﬁciency of the system, especially forlarge-scale matrices, where performance is dominated by theamount of updates after each rotation. The convergence ofSVD requires the orthogonal transformation of the matrix tobe performed in numerous iterations. Both individual columnelements and covariances have to be updated in the ﬁrstiteration, and in the subsequent iterations, only covariancesare operated. To optimize the use of hardware resources, theHestenes preprocessor is able to be reconﬁgured to functionas multiple update kernels.format, our architecture is designed to perform ﬂoating-pointcalculations by using pipelined IEEE-754 double-precisionﬂoating-point operators.As described in Algo. 1, Jacobi rotation of two columnvectors is computed with their squared 2-norms and covariance through a series of addition, subtraction, multiplication,division and square-root. The Jacobi rotation equations canbe represented as eq. (8), eq. (9), eq. (10), where n1 and n2represents the squared 2-norms of column vectors, while thecovariance between them is represented by c1,2 . The calculatedparameter t is then applied to update the squared 2-normsof rotated vectors and zero out their covariance. In Fig. 4,the computations of Jacobi rotation is demonstrated, in whichindependent calculations can be processed simultaneously. Tominimize hardware resource usage, the expensive ﬂoatingpoint computational cores are reused by those calculations.Once all the orthogonal transformations are completed, thesquare-root operator in the Jacobi rotation component is usedto generate the singular values by applying it to the diagonalelements of the processed matrix.224

Fig. 7: SVD computation time (in seconds) for square matricesby our Hestenes-Jacobi architecture, Matlab, Intel MKL andGPUiterations. To perform Jacobi rotation, 1 multiplier, 2 adders,1 divider and 1 square-root calculators are used, which canstart 8 independent Jacobi rotations in every 64 clock cycles.Additionally, an array of eight update kernels are implementedin the Update operator, which contains 32 multipliers and 16adders or subtractors. The IP core generated computationalcores are conﬁgured with default latencies as 9, 14, 57, 57clock cycles for multiplier, adder or subtractor, divider andsquare-root calculator respectively. Two groups of eight 64-bitwidth FIFOs are programmed to synchronize the input andoutput, while a group of eight 127-bit width FIFOs are usedfor the data transmissions between the Hestenes processor andthe Update operator. Simple dual port RAMs are employedto temporarily cache the rotation angle parameters and somecovariances. The whole covariance matrix can be stored in thelocal memory for matrices of column dimension no greaterthan 256. The system is tested by executing at 150Mhzfor 6 iterations, which is believed sufﬁcient for achievingconvergence with certain thresholds. Also, a software modelis implemented using Matlab to conduct the convergenceevaluation.Fig. 6: Demonstration of employed cyclic order for vectorpairing.D. The Cyclic Order for Vector PairingThe order of vector pairing determines the speed andfeasibility of the convergence. In our design, we employ thecyclic ordering, which was demonstrated with the capability ofachieving convergence efﬁciently [9]. In Fig. 6, cyclic orderingis demonstrated with 32 vectors, in which the numbers represent the column indexes, and the arrows indicate the directionfor the movement of indexes to form the new vector pairs. Inthe cyclic ordering, each column has to be paired with everyother column. The paired vectors are highlighted by solidboxes in Fig. 6. Besides, due to the limited hardware resourceson a single chip, a limited number of vector pairs can beoperated simultaneously. In Fig. 6, a dashed box highlights agroup of vector pairs, whose computations can be performed inparallel. All the vector-pair groups enter our Hestenes-Jacobiarchitecture successively.B. Performance analysisWe experimented with both square and rectangular matriceswith various dimensions, the performance of which has beensummarized in Table I. The experimental results demonstratethat the execution time grows signiﬁcantly as the numberof matrix columns increases, which determines the amountof covariances, whose computation dominates the overallperformance. Comparably, the number of rows, which onlyaffects the execution time of the Hestenes preprocessing,has smaller impact on the performance. When the matrixcolumn size grows over 256, the performance is increasinglyaffected by the I/O bandwidths due to the increased covariancecommunications have to be made between our Hestenes-Jacobiarchitecture and off-chip memory.Comparisons of execution times have been made betweenour implementation and experimental results from the published literature [7] as well as a Matlab SVD routine. InVI. E XPERIMENTS AND E VALUATIONSA. Implementation and experimental setupTo evaluate the performance of our Hestenes-Jacobi design,a single Xilinx Virtex-5 XC5VLX330 FPGA on our ConveyHC-2 system [23] is used to implement our architecture. In ourimplementation, double-precision ﬂoating-point computationalcores are generated by using Xilinx Coregen generator [24].In the Hestenes preprocessor, four layers of multiplier-arrayare implemented, in which 16 multipliers and 16 adders areused. To improve the computational intensity on the limitedhardware resources, the Hestenes processor calculates all thesquared 2-norms and covariances in the ﬁrst iteration oforthogonalization, and it is then reconﬁgured as four updatekernels with 16 multipliers and 8 adders in the remaining225

Fig. 8: SVD computation time (in seconds) for rectangularmatrices by our Hestenes-Jacobi architecture, Matlab, IntelMKL and GPUFig. 10: Convergence process of different dimensional matrices.dimensions from 128 to 2048. Table II shows the resourceutilization by our Hestenes-Jacobi architecture.Compared to the experimental results of the latest GPUbased and FPGA-based Hestenes-Jacobi implementations, ourarchitecture shows the best performance [11], [12]. In [12],the GPU-based implementation, which ran 106.90ms and1022.92ms to decompose a 128 128 and a 256 256 matrixrespectively, failed to achieve any speedup compared to aconventional software solution. The FPGA-based design [11]was devised to perform ﬁxed-point operations, which can onlyanalyze the matrices with the size up to 32 128 due to the limitation of on-chip memory. Although the better performancehas been demonstrated compared to their software executionwith Matlab SVD for matrices with dimensions range from2 2 to 32 127, our Matlab SVD routine runs 100 times fasterthan their Matlab SVD, and shares comparable speeds withtheir FPGA-based design. In further comparison to [11], inwhich 24.3143ms is needed to decompose the largest analyzedmatrix with the dimensions of 32 127, the execution time ofoperating a 128 128 matrix by our architecture shows morethan 5 times speedup.Fig. 9: Speedup of our Hestenes-Jacobi SVD compare toMatlab SVDFig. 7 and Fig. 8, the performance of our design, the Matlab7.10.0 SVD routine running on a 2.2 GHz dual core IntelXeon processor, SVD solutions with Intel MLK 10.0.4 andNVIDIA 8800 GPU with 128 stream processors [7] havebeen demonstrated. By analyzing those data points in Fig.7, our architecture has better efﬁciency than other softwaresolutions when matrix with dimensions under 512, and ourexecution slows down when the dimensions over 512 due tothe limits of our chosen platform’s I/O throughput. In Fig.8, the comparison is made between matrices with identicalcolumn dimensions but various row sizes, which indicates thegrowth of row number causes a relatively slow increase of theexecution time due to the quantity of covariances is determinedby the column size.In Fig. 9, the dimensional speedups of our FPGA-basedHestenes-Jacobi SVD compared to the Matlab SVD solutionrunning on an Intel platform are presented, in which ourHestenes-Jacobi architecture shows better efﬁciency in analyzing matrices with small to medium column dimensionscompared to the standard software solution, even when theyare with comparably large row dimensions. The dimensionalspeedups that can be achieved range from 3.8 to 43.6 for matrices with column sizes from 128 to 256 and rowTABLE I:m\n12825651210241284.39 10 32.52 10 21.70 10 11.23TABLE II:Execution time in seconds.2566.30 10 33.30 10 22.01 10 11.355121.01 10 24.84 10 22.63 10 11.6110241.79 10 27.94 10 23.87 10 12.01Resource consumption of our Hestenes-Jacobi architecture.ResourceOurArchitectureSlice LUT89%BRAM91%DSPs53%C. Convergence AnalysisSVD is a process of diagonalizing matrix through iterativerotations; to evaluate the correctness and accuracy of the226

[2] Y. Mu, J. Dong, X. Yuan, and S. Yan, “Accelerated low-rank visualrecovery by random projection,” in Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), June 2011, pp.2609–2616.[3] R. Liao, Y. Fernandess, K. Tavernier, G. Taylor, and M. Irving, “Recognition of partial discharge patterns,” in Proceedings of IEEE Power andEnergy Society General Meeting, July 2012, pp. 1–8.[4] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis?” The Computing Research Repository, vol. 0912.3599, 2009.[5] L. N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM: Societyfor Industrial and Applied Mathematics, 1997.[6] G. Golub and W. Kahan, “Calculating the singular values and pseudoinverse of a matrix,” Journal of the Society for Industrial and AppliedMathematics, Series B: Numerical Analysis, vol. 2, no. 2, pp. 205–224,1965.[7] S.

FPGA implementations [12]). In this paper, we present an FPGA-based hardware design of the Hestenes-Jacobi algorithm for SVD with ﬂoating-point arithmetic, which attempts to analyze an arbitrary m n matrix. Compared to a previous FPGA-based Hestenes-Jacobi i

Related Documents:

Nonprofit Self-Assessment Checklist

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

1.4K Views

2y ago

Name of thé élément in thé language and script of thé ... - UNESCO

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

114 Views

9m ago

[Kl - Mauritius

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

468 Views

1y ago

Employee Benefits Event - Schneider Downs Tax Services

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

327 Views

1y ago

Study Investigating thè Effect of E- Service Quality on Customer's ...

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

123 Views

9m ago

Kinh Giải Thâm Mật HT. Thích Trí Quang dịch giải

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

1.6K Views

3y ago

Fpga-based Simulation and Implementation of Induction Motor Torque ...

In this thesis, FPGA-based simulation and implementation of direct torque control (DTC) of induction motors are studied. DTC is simulated on an FPGA as well as a personal computer. Results prove the FPGA-based simulation to be 12 times faster. Also an experimental setup of DTC is implemented using both FPGA and dSPACE. The FPGA-based design .

26 Views

1y ago

FPGA Technology and Industry Experience - Enclustra

FPGA ASIC Trend ASIC NRE Parameter FPGA ASIC Clock frequency Power consumption Form factor Reconfiguration Design security Redesign risk (weighted) Time to market NRE Total Cost FPGA vs. ASIC ü ü ü ü ü ü ü ü FPGA Domain ASIC Domain - 11 - 18.05.2012 The Case for FPGAs - FPGA vs. ASIC FPGAs can't beat ASICs when it comes to Low power

10 Views

1y ago

Recent Views

BETTER NUTRITION BRIGHTER FUTURE - Maryland.gov Enterprise Agency Template

TOFU BUY: 12- to 16-ounce container Brands and types shown here ONLY Not WIC Approved: With added fats, sugar, oil, or salt With added ﬂavorings, sauces, or seasonings Azumaya Extra Firm Franklin Farms Firm, Medium Firm, Extra Firm, Soft House Foods Organic: Soft, Firm, Medium Firm, Extra Firm Premium: Soft, Firm, Medium Firm, Extra Firm

1y ago

192 Views

Leaving a Law Firm: A Guide to the Ethical Obligations in Law Firm .

associates or otherwise employed in the firm "not to (1) actively exploit their positions within the [law firm] for their own personal benefits, or (2) hinder the ability of the [law firm] to conduct the business for which it was developed." Burke v. Lakin Law Firm, 2008 WL 64521 (S.D.Ill. Jan. 3, 2008), quoting FoodComm Intern. V.

1y ago

113 Views

Uses of Special Purpose Vehicles (SPVs) in structuring financing .

TFR "Best Law Firm in Trade Finance" Trade & Forfaiting Review (TFR) named Sullivan & Worcester "Best Law Firm in Trade Finance" in its 2014, 2015 and 2016 TFR Excellence Awards . GTR "Best Law Firm" Sullivan & Worcester UK LLP was top ranked firm in the . Global Trade Review (GTR) Best Law Firm 2015 and 2016 polls . The Legal 500 UK . 2016

1y ago

139 Views

Global Elite Law Firm Brand Index 2022 - thomsonreuters

such areas as law firm brand, firm usage, and legal market trends. The responses are distilled . down into four different and non-related measures gathered from the Sharplegal research and . then used to generate the individual Global Elite Law Firm Brand Index score for each law firm. How we generate our insights. In-depth interviews with

1y ago

166 Views

Notice and Order - Law Firm Names - Amendments to RPC 7.5 and Related Rules

LAW FIRM NAMES - AMENDMENTS TO RPC 7.5 AND COURT RULES 1:21-1A, 1:21-1B, AND 1:21-1C The Supreme Court has adopted amendments to Rule of Professional Conduct 7.5 ("Law Firm Names and Letterheads") so as to remove the requirement that the law firm name include the name of a lawyer and describe the nature of the firm's legal practice.

1y ago

134 Views

Law Student's Guide to the Washington, DC-Area Law Firm Market

Years 6-8: Return to law firm as senior associate or counsel . Benefit: In addition to your government experience, law firm employers will value your prior firm experience with billing time, working with private sector clients, etc. In other words, you already "know how law firms work" and this provides a smoother transition back. *Disclaimer:

1y ago

151 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Overcoming Ethical Challenges for Multi-Firm Lawyers and Their Firms .

- Florida Bar Op. 94-7: o Law firm refers personal injury cases to a lawyers who is "of counsel" to the firm and who sometimes works in the law firm's offices, but who also . Formal Ops. 1995-9: o A law firm named "A B & C" is a NY partnership consisting of partners A, B, and C. Motivated by tax concerns, C retires and becomes .

1y ago

116 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Law Firm Performance Metrics - Thomson Reuters

ProLaw XII reporting offers a firm the capability to turn data into knowledge for law firm performance management. The new reporting features within ProLaw XII provide key financial and operational metrics necessary to monitor firm performance - many of which can be self‐defined by the firm.

1y ago

104 Views

CHAPTER 11 35 per hour to firm A but differ in their .

flock to the piece rate firm. After the price of output falls, firm A values all workers at 17.50 per hour, while worker 1’s value at firm B falls to 50 cents, worker 2’s value falls to 1 at firm B, etc. The question is what happens to the wage. Presumably wage also falls, to 17.50 per hour in firm A.

2y ago

165 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

An FPGA Implementation Of The Hestenes-Jacobi Algorithm .

It looks like you're using an ad-blocker