Low-Complexity FPGA Implementation Of Compressive Sensing .

3y ago
16 Views
2 Downloads
230.28 KB
5 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Emanuel Batten
Transcription

2013 International Conference on Computing, Networking and Communications, Multimedia Computing and CommunicationsSymposiumLow-Complexity FPGA Implementation ofCompressive Sensing ReconstructionJerome L.V.M. Stanislaus and Tinoosh MohseninDept. of Computer Science & Electrical EngineeringUniversity of Maryland, Baltimore CountyAbstract—Compressive sensing (CS) is a novel technologywhich allows sampling of sparse signals under sub-Nyquistrate and reconstructing the image using computational intensivealgorithms. Reconstruction algorithms are complex and softwareimplementation of these algorithms is extremely slow and powerconsuming. In this paper, a low complexity architecture for thereconstruction of compressively sampled signals is proposed. Thealgorithm used here is Orthogonal Matching Pursuit (OMP)which can be divided into two major processes: optimizationproblem and least square problem. The most complex part ofOMP is to solve the least square problem and a scalable Q-Rdecomposition (QRD) core is implemented to perform this. Anovel thresholding method is used to reduce the processing timefor the optimization problem by at least 25%. The proposedarchitecture reconstructs a 256-length signal with maximumsparsity of 8 and using 64 measurements. Implementation onXilinx Virtex-5 FPGA runs at two clock rates (85 MHz and 69MHz), and occupies an area of 15% slices and 80% DSP cores.The total reconstruction for a 128-length signal takes 7.13 μswhich is 3.4 times faster than the state-of-art-implementation.I. IntroductionIn many signal processing systems, the useful informationis far less than the sampled data and all the redundant data areeliminated through compression. One such method, compressive sensing (CS), reduces the amount of data collected duringthe acquisition [1], [2]. In compressive sampling, it is possiblethat sparse signals [2] and images can be recovered from veryfew samples (sub-Nyquist rate) compared to the traditionalShannon sampling. Compressive sensing has received a lotof attention in communication, radar and biomedical imagingapplications.In magnetic resonance imaging (MRI), where scan durationis directly proportional to the number of acquired samples,CS has the potential to dramatically decrease scan time [3],[4], [5]. Because information such as boundaries of organs isvery sparse in most MR images, compressed sensing makesit possible to reconstruct the same MR image from a verylimited set of measurements which significantly reduces theMRI scan duration.CS has also received a lot of attention in radar imaging.Synthetic Aperture Radar (SAR) is an advanced radar imagingtechnique that provides long distance high resolution imagesthrough finer spatial resolution compared to the normal beamscanning method. SAR images can be sparse or compressiblein some representation such as those from wavelet transform.There have been several studies on CS theory and its applications in radar imaging [6], [7], [8], [9].978-1-4673-5288-8/13/ 31.00 2013 IEEECS has two main stages - sampling and reconstruction.While sampling is performed on the transmitter, reconstructionis done on the receiver as shown in Fig. 1. Compact signalrepresentation using CS results in lower rate ADC implementation which reduces the power dissipation and memoryrequirements as well. However, CS reconstruction requireshigh computational intensive algorithms.Consider an m-sparse signal x of dimension N 1 and Φ isthe measurement matrix of dimension d N, where d is thenumber of measurements to be taken. Multiplying these twovectors yields y of dimension d:y Φx(1)Now y is processed to reconstruct m values which will bethe close estimate for x, denoted as x̂. Since the reconstructionis NP-hard, efforts are made to find a close estimate of x. Thesignal can be sparse in any domain and not necessarily in thesampling domain. Reconstruction requires high computationand the complexity increases with the dimension of the signal.Also, reconstruction is an application of information theoryand the complexity increases with the accuracy and the totalmeasurements. There are several reconstruction algorithmsproposed in recent years and most of them are computationalintensive. Software implementation of these algorithms aretime consuming since they often require matrix multiplicationsfor which the processors are poor performers. The abovementioned drawbacks create more interests in hardware implementation for real-time reconstruction for CS signals. The twomostly used reconstruction algorithms are 1 –minimizationand Orthogonal Matching Pursuit (OMP). 1 –minimizationalgorithm is better in terms of accuracy, but its implementationis very complex and time consuming. OMP is less complex[10] and it is a Greedy algorithm that finds the closelycorrelated values in each iteration. The complexity of thedesign increases with data length.This paper uses a high speed architecture for OMP algorithm for 256– length input vector, and Q-R decompositionprocess (QRD) [11] is used for solving least square problem. The architecture is based on the high speed hardwarearchitecture proposed by J. Stanislaus and T. Mohsenin [12].A thresholding method is implemented in this paper thateliminates certain columns of Φ matrix to be used for dotproduct calculation for further iterations. The design has beenimplemented on Virtex-5 FPGA and results are compared with671

e. Increment t and return to step b if t is less than mf. Solve the least square problem to find x̂ for the indicesin Λ y x̂ arg min Φx(6)xFig. 1.Basic block diagram for compressive sensingΦ64x256CFind m indices of ΦLeast Square ProblemYm values(a)64x1Fig. 2.(b)Basic diagram for OMP reconstruction.the previous work.II. BackgroundThere are only a very few studies available for the implementation of OMP algorithm on the hardware [13][14]. Bothstudies are based on FPGA implementation, where [13] usesa 128–length vector for a sparsity of 5. The same architecturewill take a lot more time to compute 256–length vector with asparsity of 8 due to the path delay in finding the dot productand also due to the division operations performed in the matrixinverse. GPUs have a bottleneck on memory bandwidth [15].In the architecture proposed in [12], more than 90% of thecomputation time is taken by the dot product computation. Thehigh speed architecture explained in [12] will be used as a basefor this paper and the new architecture design optimizationsare discussed.III. OMP AlgorithmConsider an m–sparse signal x sampled using a randommatrix Φ and y is the sampled data. Our ultimate goal is tofind m columns of Φ which contributed to y. To begin with,residual R is initialized to y. For each iteration, a column of Φis chosen which has the best correlation with R. The residualR is then updated by subtracting the correlation from R fornext iteration. This is repeated for m times to find m columnsof Φ and the estimated signal x̂ is obtained by solving anover-determined least square problem. The procedure is givenbelow: anda. Initialize the residual R y, the index set Φthe iteration counter t 1b. Find the index λt which is most correlated to Φ bysolving the optimization problemλt arg max Rt 1 , φ j j 1.NIV. Proposed ArchitectureFigure 2 shows a high level block diagram for OMPreconstruction. OMP reconstruction consists of two mainstages (a) and (b) and takes two inputs Φ and y. Φ is theresultant of measurement matrix (A) multiplied by the lineartransformation matrix Ψ. This is done if the signal is notsparse in the sampling domain. Examples include capturing ofan image in spatial domain and converted to wavelet domainby the linear wavelet transformation function Ψ. The imagesignal becomes sparse in the wavelet domain. The architectureworks for 256-length signal with a sparsity of 8. Detailedanalysis on the work performed by [13] and [12] show thatthe increased computation time is due to the dot productioncalculation in 2(a). The architecture proposed here implementsa new thresholding method that cuts down the dot productcalculation by upto 25% thus improving the reconstructiontime of the overall architecture.A. Improved OMP algorithmAlthough the architecture explained in [12] is faster thanthe state-of-the-art implementation, 90% of the reconstructiontime is consumed in the optimization problem for finding mcolumns of Φ. Studies are made to reduce the processingtime of the optimization problem [16], but such methodsare very complex to implement in hardware. The algorithmproposed here is scalable and less complex for implementingin hardware.B. AlgorithmAs seen before, x is an m-sparse signal which is sampledby Φ and y is the sampled data. For finding m columns ofΦ, it is clear that we need to calculate the dot product for 2vectors of length d for N m times. The new thresholdingmethod proposed in this paper eliminates certain columns ofΦ to be used for dot product calculation for further iterations.The threshold is chosen to be the ratio α of the averageof the absolute value of dot products. The columns of Φwhose absolute dot product value is below the threshold areexcluded from further iterations. This threshold is updated atthe end of each iteration. It should be noted that the meansquare error (MSE) of the estimated signal increases with αand attains its minimum when α is 0. The procedure is givenbelow:(2) c. Update the index set Λt and column set ΦΛt Λt 1 {λt } [Φ Φλt ]Φ(3)(4)λt arg max Rt 1 , Φ̂ j d. Calculate the new residual according to t · Φ t )Rt 1Rt Rt 1 (Φ , Φ̂ Φ,a. Initialize the residual R y, the index set Φthresh 0 and the iteration counter t 1b. Find the index λt which is most correlated to Φ̂ bysolving the optimization problemj 1.k(5)672where k No. of columns of Φ̂(7)

Sampled Data (Y)Random Array Matrix (Φ)Matrix Multiplication (R Φj)Residual(R)Next Index(j)Multiply &AccumulateFind MaxFind AvgFlagUpdateResidualThreshold(α)Update FlagCalculate QXQTInverse SquareRootC (a) Find m indices of ΦU-1Invert UCalculate UX(b) Solving Least Square ProblemFig. 3. Detailed architecture diagram of Improved OMP reconstruction algorithm. (a) This block iterates 8 times to solve the optimization problem eliminatingcolumns of Φ those are poorly correlated to Y. Operates at 85 MHz. (b) Finds inverse of a 8x8 matrix after finding 8 columns of Φ. Operates at 69 MHz. c. Update the index set Λt and column set ΦΛt Λt 1 {λt } [Φ Φ̂λt ]ΦI0 I1(8)A0(9)d. Calculate the new residual according to t · Φ t )Rt 1Rt Rt 1 (Φ e. Calculate the average of the dot product1 Rt 1 , Φ̂ j Cavgt k j 1.kA1A2B0A3A5A4B1A6B2A7B3(10)C0(11)OR GATE(13)xC. Solving Optimization ProblemThe hardware is implemented for reconstructing a signal oflength N 256, measurements k 64 and sparsity m 8similar to the work in [12]. The size of data used here is 24bit which following 10.14Q (10 integers bits and 14 fractionalbits) fixed point format. A series of 64 24-bit multipliers areoperated in parallel to perform the dot product. This multiplierblock is resource shared for sovling least square problem also.The maximum index (λt ) is updated on each clock cycle asgiven by (2). At the end of each iteration, the residual isupdated by subtracting it with the correlation of the columnof Φ as shown in (5). Figure 3(a) shows the block diagramfor solving the optimization problem and is repeated to findm indices of Φ as given by (3) and (4).The architecture for the new algorithm is designed for thesame signal specification; that is N 256, measurementsk 64 and a sparsity of m 8. Each data uses 24-bit 10.14Qfixed point format. The main change in the improved algorithmD0O0O1O2O3O4(12)g. Increment t and return to step b if t is less than mh. Solve the least square problem to find x̂ for the indicesin Λ y (14)x̂ arg min ΦxC1 D0 C 0 · C1 B0 · B1 C 0 · B2 · B3 A0 · A1 B0 · A2 · A3 C 0 · A4 · A5 C 0 · B2 · A6 · A7 I 0 · I1 A0 · I 2 · I3 B0 · I 4 · I5 B0 · A2 · I 6 · I7 C 0 ·I 8 ·I9 C 0 ·A4 ·I 10 ·I11 C 0 ·B2 ·I 12 ·I13 C 0 ·B2 ·A6 ·I 14 ·I15Z [O0 O1 O2 O3 O4 ]f. Remove columns whose dot product is less than thethresholdthresht α CavgtΦ̂ Φ̂ \ {Φ̂ j } Rt 1 · Φ̂ j threshtI10 I11 I12 I13 I14 I15I2 I3 I4 I5 I6 I7 I8 I9Fig. 4.16-bit leading zero calculatoris the calculation of average and eliminating the columns of Φwhose dot product is below the threshold. To calculate the average of the dot products on each iteration, the absoulte valueof the dot product of Φ̂i and R is added to an accumulator.Once the final sum is found, the average is calculated fromthe sum. Since fixed point division operation in hardware haslarger latency, the average is approximated. The approximateaverage for iteration t is calculated as follows:C sumt k Φ̂ j · R(15)j 1C sumt C sumt (N k) (Φ̂λt · Rt 1 )(16)C sumt(17)Cavgt NThe advantage of this approximation is that it avoids thefixed point division by k as given in (11) since division by256 is carried out just by shifting C sumt to the right through8 bits. It is also obvious that the approximate average will begreater than the actual average. This means that MSE for this673

(a) Cycle Counts vs ThresholdNo. of LeadingZeros256-bit Flag (showing first 41 bits)11 0101 0000 0000 0000 0000 0000 0000 0000 0101 0110--121 0100 0000 0000 0000 0000 0000 0000 0001 0001 1000--131 0000 0000 0000 0000 0000 0000 0000 0100 0110 0000-- 2941 0001 1000 0000 0000 0000 0000 0000 0000 0000 0000--2500Cycle CountsClk20001500100035000.05Moved to nextnon-zero indexImproved OMPBasic OMP0.10.150.2Sent to 32-bit leading zero 0.450.5(b) Reconstruction Time vs ThresholdD. Solving Least Square ProblemAfter finding m columns of Φ, the next stage is to solve theleast square problem given by (6). This often requires finding30252015100.05Improved OMPBasic OMP0.10.150.20.250.3Threshold0.35(c) Normalized Mean Square Error vs Threshold0.015Normalized MSEarchitecture will be smaller for a fixed threshold (α).The improved architecture is built on top of the originalarchitecture explained in section III. A 256-bit flag is used tokeep track of the active columns of Φ for any iteration. The ithbit of f lag corresponds to the ith column of Φ. In the originalhardware the dot product is calculated every clock cycle forith column of Φ and i is then incremented by 1. But in theimproved architecture, i is incremented to the next non-zeroindex of f lag, skipping all other columns of Φ in between.The next non-zero index of f lag is found by sending the first32 bits of f lag to a 32-bit leading zeros calculator. The output(Z) of 32-bit leading zeros calculator is 6-bit which gives thenumber of leading zeros. Figure 4 shows the logic for findingthe number of leading zeros for a 16-bit input (I).The f lag is then shifted to its left by Z times to point itto the next non-zero index. Leading zero calculator for higherbit size ( 32) increases the latency and thus decreasing theoperating frequency of the hardware. Figure 5 illustrates anexample for 4 clock cycles. All the bits of f lag are set to 1for the first iteration and are updated before the next iteration.The remaining hardware blocks are kept the same as shownin Fig. 3. The detailed pseudo-code for the above describedalgorithm is given below:f lag AllOnesfor Iter 1 to m doIndex 1; C sum 0while Index N doDPIndex R · ΦIndexC sum C sum DPIndexLeadZeros LeadingZeros32( f lag)Index Index LeadZerosf lag f lag LeadZeros { - Left Shift}end whileCthresh α C sum /Nfor i 1 to N dof lag[i] (DPi Cthresh ) ? 0:1end forend forLeadingZeros32(X) gives the number of leading zerosfor the first 32 bits of XReconstruction Time (μs)Fig. 5. Finding the next non-zero index of 256-bit flag for 4 clock cycles.First 41 bits are shown here and the remaining bits are assumed 0s0.01Improved OMPBasic OMP0.00500.050.10.150.20.250.3Threshold0.350.4Fig. 6. Plots showing the variation of different parameters with the threshold(α) for the improved architecture. (a) Reduction in the cycle counts forincreasing threshold. (b) Reconstruction time as a function of α. (c) Variationin Normalized MSE as a function of threshold. T Φ. The mainthe inverse of an 8x8 matrix C where C Φpurpose of this is to solve for x̂ from: x̂ Φ T y T Φ)(Φ(18)The rest of the hardware is derived from [12] for its speedand efficiency for larger vector length. It includes the QRDmethod and also the fast inverse square root algorithm.V. FPGA Implementation and AnalysisThe proposed compressive sensing reconstruction architecture is synthesized and placed and routed on a Xilinx Virtex5 XC5VL110T FPGA device. The implementation for theimproved OMP runs at two different clocks, 85 MHz and69 MHz, for blocks (a) and (b), respectively, as shown inFig. 2.A. Hardware Implementation of Improved OMPFor a 256–length vector with sparsity m 8 and threshold(α) 0, it requires 2100 clock cycles to find 8 columns of Φand 160 cycles for QRD block. Hence the total reconstructiontime is 27.14 μs. This proves that the real bottleneck of thealgorithm is finding the dot product of Φ and residual R thattakes 256 8 2048 cycles.For the improved OMP, the threshold (α) is varied from 0.05to 0.5 and the performance of the hardware is shown in Fig. 6.For each α value, the experiment is repeated for 1000 differentsets of 256-bit input signal (x) of sparsity 8 and the averageis taken. As expected, the cycle counts for improved OMP674

Septimus[13]Operating Freq(MHz)Slice LUTsDSP CoreReconstruction Time*(μs)Normalized MSESpeed Up39NANA240.0015-This workBasic OMP Imp OMP85,6911%80%100.00152.485,6915%80%7.13 0.0071 3.4 (a)TABLE IImplementation summary for the proposed OMP reconstruction hardware onVirtex-5 FPGA. *Reconstruction time for 128-length vector of sparisity 5. Results for a threshold α 0.25decreases with increasing α as shown in Fig. 6(a). It is alsoevident from Fig. 6(b) that the reconstruction time decreaseswith α due to the fact that block (a) of Fig. 2 contributes to90% of the reconstruction time. Figure 6(c) shows the tradeoff in Normalized MSE for improved reconstruction time. Theimproved OMP hardware for 128–length vector of sparsity 5on Xilinx FPGA Virtex-5 takes approximately 444 cycles forfinding 5 columns of Φ with threshold (α) 0.25. This givesa reconstruction time of 7.13 μs which is 3.4 times faster than[13]. The approximate reconstruction time for a 256–lengthvector of sparsity 8 with α 0.25 is 18 μs. This is becauseit takes approximately 1280 cycles to find 8 columns of Φ asopposed to 2048 cycles for hardware implemented for basicOMP [12]. Table I summarizes the hardware implementationresults for the proposed method compared to the state-of-theart results [12], [13]. A reconstructed 256 256 image usingthe proposed method with α 0.25 is shown in Fig. 7. Theimage is divided into 256 sub-images of dimension 16 16and the reconstruction is performed sequentially as describedin [17]. The reconstructed image is enhanced using a 6 6median filter.VI. ConclusionThis paper presents an architectural design and FPGAimplementation of low-complexity compressive sensing reconstruction hardware. The proposed architecture supportsvectors of length 256. A thresholding method is appliedfor the improved version to eliminate certain columns of φmatrix for reducing the dot product computation latency. Thereconstruction time on Xilinx FPGA Virtex-5 for a 128-lengthsignal of sparsity 5 is 7.13 μs for proposed improved OMPwith threshold, α 0.25. This is 3.4 times faster than thestate-of-the-art implementation. The same architecture withthreshold, α 0.25 for a 256-length signal of sparsity 8 takeson an average of 17 μs for reconstruction as compared to27.12 μs reconstruction time of unimproved architecture asshown in Figure 6(b).(b)Fig. 7. (a) Original Image (b) Reconstructed Image (enhanced*) using theproposed method, PSNR 30.25 dB. The

2013 International Conference on Computing, Networking and Communications, Multimedia Computing and Communications Symposium 671. Fig. 1. Basic block diagram for compressive sensing Find m indices of Φ Least Square Problem . bits) fixed point format. A series of 64 24-bit multipliers are

Related Documents:

In this thesis, FPGA-based simulation and implementation of direct torque control (DTC) of induction motors are studied. DTC is simulated on an FPGA as well as a personal computer. Results prove the FPGA-based simulation to be 12 times faster. Also an experimental setup of DTC is implemented using both FPGA and dSPACE. The FPGA-based design .

FPGA ASIC Trend ASIC NRE Parameter FPGA ASIC Clock frequency Power consumption Form factor Reconfiguration Design security Redesign risk (weighted) Time to market NRE Total Cost FPGA vs. ASIC ü ü ü ü ü ü ü ü FPGA Domain ASIC Domain - 11 - 18.05.2012 The Case for FPGAs - FPGA vs. ASIC FPGAs can't beat ASICs when it comes to Low power

Step 1: Replace ASIC RAMs to FPGA RAMs (using CORE Gen. tool) Step 2: ASIC PLLs to FPGA DCM & PLLs (using architecture wizard), also use BUFG/IBUFG for global routing. Step 3: Convert SERDES (Using Chipsync wizard) Step 4: Convert DSP resources to FPGA DSP resources (using FPGA Core gen.)

The LabVIEW implementation of the control system consisted of two main parts; (i) host PC virtual instrument (VI) and (ii) FPGA VI. A host PC VI was deve loped to model the PID control transfer function and inter act with the FPGA based RIO hardware. The FPGA VI was programmed in LabVIEW and synthesized to ru n on the FPGA.

- Removes FPGA hold-time violations - Reduces complexity of clock trees, which speeds up FPGA place and route - Faster P&R times - better quality of results Benefits: - No hold-time violations in user clock domains - Removes any FPGA-specific clock limitations - Improves FPGA timing closure - Accelerates FPGA P&R times

14 2 FPGA Architectures: An Overview Fig. 2.5 Overview of mesh-based FPGA architecture [22] 2.4.1 Island-Style Routing Architecture Figure2.5 shows a traditional island-style FPGA architecture (also termed as mesh-based FPGA architecture). This is the most common

5.2 Inspection of Structural Adder Using Schematic and fpga editor 5.2.1 Schematics and FPGA layout Now let’s take a look at how the Verilog you wrote mapped to the primitive components on the FPGA. Three levels

FPGA Resource Small FPGA Large FPGA Logic PLBs per FPGA 256 25,920 LUTs and flip -flops per PLB 1 8 System-on-Chip Test ArchitecturesEE141 Ch. 12 - FPGA Testing - P. 15 15 Routing Wire segments per PLB 45 406 PIPs per PLB 139 3,462 Specialized Cores Bits per memory core 128 36,864 Memory cores per FP