FFT-Based Deep Learning Deployment In Embedded Systems

2y ago
44 Views
2 Downloads
2.15 MB
6 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Helen France
Transcription

FFT-Based Deep Learning Deployment in EmbeddedSystemsarXiv:1712.04910v1 [cs.LG] 13 Dec 2017Sheng Lin1\ast , Ning Liu1\ast , Mahdi Nazemi2 , Hongjia Li1 , Caiwen Ding1 , Yanzhi Wang1 , and Massoud Pedram21Dept. of Electrical Engineering & Computer Science, Syracuse University, Syracuse, NY, USA2Dept. of Electrical Engineering, University of Southern California, Los Angeles, CA, USA1{shlin,nliu03,hli42,cading,ywang393}@syr.edu,2 {mnazemi,pedram}@usc.eduAbstract—Deep learning has delivered its powerfulness inmany application domains, especially in image and speech recognition. As the backbone of deep learning, deep neural networks(DNNs) consist of multiple layers of various types with hundredsto thousands of neurons. Embedded platforms are now becomingessential for deep learning deployment due to their portability,versatility, and energy efficiency. The large model size of DNNs,while providing excellent accuracy, also burdens the embeddedplatforms with intensive computation and storage. Researchershave investigated on reducing DNN model size with negligibleaccuracy loss. This work proposes a Fast Fourier Transform(FFT)-based DNN training and inference model suitable forembedded platforms with reduced asymptotic complexity of bothcomputation and storage, making our approach distinguishedfrom existing approaches. We develop the training and inferencealgorithms based on FFT as the computing kernel and deploythe FFT-based inference model on embedded platforms achievingextraordinary processing speed.I.I NTRODUCTIONRecently deep learning has outstood from traditional machine learning techniques in many application areas, especiallyin image and speech recognition [1], [2]. The excellence ofdeep learning has also resulted in explorations of severalemerging real-world applications, such as self-driving systems[3], automatic machine translations [4], drug discovery andtoxicology [5]. The deep learning is based on the structure ofdeep neural networks (DNNs), which consist of multiple layersof various types and hundreds to thousands of neurons in eachlayer. Recent evidence has revealed that the network depth is ofcrucial importance to the success of deep learning, and manydeep learning models for the challenging ImageNet datasetare sixteen to thirty layers deep [1]. Deep learning achievessignificant improvement in overall accuracy by extractingcomplex and high-level features at the cost of considerableup-scaling in the model size.In the big data era and driven by the development ofsemiconductor technology, embedded systems are now becoming an essential computing platform with ever-increasingfunctionalities. At the same time, researchers around the worldfrom both academia and industry have devoted significantefforts and resources to investigate, improve, and promote theapplications of deep learning in embedded systems [6]. Despitethe advantages in DNN recognition accuracy, the deep layeredstructure and large model size of DNNs also increase computational complexity and memory requirement. Researchersare faced with the following challenges when deploying deeplearning models on embedded systems: (i) Confined by thecommunication bandwidth of embedded systems, which areusually mobile terminals, it is still challenging to downloadlarge-size DNN models, even which can be offline-trained indata centers. (ii) The large model size of deep learning also*S. Lin and N. Liu contributed equally to this work.imposes stringent requirements on the computing resources andmemory size of embedded systems.Motivated by these challenges, it is intuitive to implement areduced-size deep learning model with negligible accuracy loss.In fact, the state-of-the-art DNNs are often over-parameterized,hence the removal of redundant parameters in the deep learningmodels, if performed properly, will produce similar overall accuracy as the original models [1]. Encouraged by this discovery,various deep learning model compression approaches have beeninvestigated [6]–[10], including weight precision reduction,network pruning, weight matrix factorization, etc. In this work,we propose a Fast Fourier Transform (FFT)-based DNN trainingand inference model suitable for embedded systems due toreduced asymptotic complexity of both computation and storage.Our approach has obvious advantages over existing works ondeep learning model compression e.g., [6], [8], [9] in thatthose approaches result in an irregular network architecturethat increases training and inference computation time, whileour approach facilitates computation. Please also note the ourproposed framework is distinct from the prior work of usingFFT for convolutional layer acceleration by LeCun et al. [11],because this prior work can only achieve convolutional layeracceleration instead of simultaneous compression. We developthe training and inference algorithms based on FFT as thecomputing kernel and deploy the FFT-based inference modelon embedded platforms. Experimental test results demonstratethat our model provides the optimization in different languagesand achieve a significant improvement.II.R ELATED W ORKOver the past decade, a substantial number of techniquesand strategies have been proposed to compress neural networksize. Weight pruning [6] is a well-known effective approach,in which many weights with values of 0 are pruned toachieve high compression ratio. Other techniques such asthreshold setting [6], biased weight decay [9], etc., could beintegrated to the weight pruning procedure. Another simpleand popular approach to DNN model compression is the lowrank approximation of the weight matrix [12]. To overcomethe potential high accuracy loss after low-rank approximation,[13] proposed to perform fine-tuning for the post-factorizationof low-rank weight matrices to retain accuracy . Lowering thepresentation precision of weights is also an straightforwardtechnique to reduce both the model size and computation costof DNNs. A fixed-point implementation was explored to replacethe original floating-point models [14]. Furthermore, designswith ultra-low precision weights, such as binary (-1 / 1) orternary (-1 / 0 / 1) representation were proposed [15], [16].By exploring the local and global characteristics of the weightmatrix, weight clustering was proposed to reduce the numberof weights linearly [17]. In addition, with the aid of gradientsclustering in the training phase, the accuracy loss incurred bythe weight clustering can be negligible [6].Some recent works adopted structured weight matrices in

The fully-connected (FC) layer is the most storageintensive layer in DNN models [20] since each of its neurons isfully connected with all the neurons in the previous layer. Thecomputation procedure of a FC layer consists of matrix-vectorarithmetics (multiplication and addition) and transformation bythe activation function, described as follows:\bfy \psi (\bfW \bfT \bfx \theta )(1)where \bfy and \bfx are outputs of this layer and the previous layer,respectively; W \in \BbbR m\times n is the weight matrix of the synapsesbetween this FC layer (with n neurons) and its previous layer(with m neurons); \theta \in \BbbR n is the bias vector; and \psi (\cdot ) is theactivation function. The Rectified Linear Unit (ReLU) \psi (x) \mathrm{m}\mathrm{a}\mathrm{x}(0, x) is the most widely utilized activation function inDNNs.The convolutional (CONV) layer, as the name implies,performs two-dimensional convolution of its input to extractfeatures that will be fed into subsequent layers for higher-levelfeature extracting. A CONV layer is associated with a set oflearnable filters [21], which are activated when specific typesof features are found at some spatial positions from the inputs.Filter-sized moving windows are applied to the inputs to obtaina set of feature maps, by calculating the convolution of thefilter and inputs in the moving window. Each convolutionalneuron, representing one pixel in a feature map, takes a setof inputs and the corresponding filter weights to calculate theinner-product. Given input feature map X and the r \times r-sizedx(1)x(3)FFTSize N/2x(N-1)Fig. 1.W0W1X(N/2)X(N/2 1)-W N/2-1.Deep neural networks (DNNs) are distinguished from othertypes of neural networks by their depth and have dramaticallyimproved the state-of-the-art in speech recognition, objectdetection, etc. Some commonly adopted DNN models includedeep convolutional neural networks, deep belief networks,and recurrent neural networks. Despite the various networktopologies targeting for different applications, these DNNmodels comprise of multiple functional layers with somecommonly used structures. Following are the most commonlyused layer structures in the state-of-the-art DNN models:.A. Deep Neural NetworksX(N/2-1).In this section, we introduce basic concepts of deep neuralnetworks (DNNs), Fast Fourier Transform (FFT), and structuredmatrices, as the background of our proposed FFT-based trainingand inference algorithms. Specifically, we explain the variousDNN layer types, the Cooley-Tukey algorithm for FFT, andthe block-circulant matrices as the adopted structured matrices.BACKGROUNDX(0)X(1)FFTSize N/2x(N-2).III.x(0)x(2).order to reduce the model size. In [18], weight matrices of fullyconnected (FC) layers were constructed in the Toeplitz-likeformat to remove the redundancy of the DNN model. In [19],the circulant matrix was introduced to enable further reductionin model size. An n-by-n circulant matrix has a smaller numberof parameters i.e., n than that of a same-size Toeplitz matrixi.e., 2n. In this work, we generalize the structured weightmatrix method in that (1) we utilize block-circulant matricesfor weight matrix representation, which achieves a trade-offbetween compression ratio and accuracy loss; (2) we extendthe structured matrix method to convolutional (CONV) layersbesides the FC layers; (3) we propose FFT-based DNN trainingand inference model and algorithm, which is highly suitablefor deployment in embedded systems; and (4) we implementand test the FFT-based DNN inference in various embeddedplatforms.-X(N-1)Illustration of Cooley-Tukey algorithm of FFT.filter (i.e., the convolutional kernel) F, the output feature mapY is calculated asr \sumr\sumya,b xa i - 1,b j - 1 \times fi,j ,(2)i 1 j 1where ya,b , xa i - 1,b j - 1 , and fi,j are elements in Y, X, andF, respectively. Multiple convolutional kernels can be adoptedto extract different features in the same input feature map.Multiple input feature maps can be convolved with the samefilter and results are summed up to derive a single feature map.B. Fast Fourier TransformsThe Fast Fourier Transform (FFT) is an efficient procedurefor computing the discrete Fourier transform (DFT) of timeseries. It takes advantage of the fact that the calculation of thecoefficients of the DFT can be carried out iteratively, whichresults in a considerable savings of computation time. TheFFT not only reduces the computational complexity, but alsosubstantially reduces round-off errors associated with thesecomputations. In fact, both the computation time and round-offerror are essentially reduced by a factor of n/(log2 n) wheren is the number of data samples in the time series [22]. Fig. 1shows the simplest and most common form of FFT, which isbased on the Cooley-Tukey algorithm [23]. It uses a divideand conquer approach to recursively break down the DFT ofan arbitrary composite size N N1 \cdot N2 into many smallerDFTs of sizes N1 and N2 , in order to reduce the computationtime to O(n \mathrm{l}\mathrm{o}\mathrm{g} n) for highly composite N [23].C. Structured MatricesAn n-by-m matrix \bfA is called a structured matrix when ithas a low displacement rank \upsilon [18]. One of the most importantcharacteristics of structured matrices is their low number ofindependent variables. The number of independent parametersis O(n) for an n-by-n structured matrix instead of O(n2 ),which indicates that the storage complexity can be potentiallyreduced to O(n). As a representative example, a circulantmatrix \bfW \in \BbbR n\times n is defined by a vector \bfw (w1 , w2 , ., wn )as follows:\left[\right]w1wn. . . w3 w 2w2w1wnw3.wn - 1wn.wn - 1.w2wnw1The definition and analysis of structured matrices havebeen generalized to the case of m-by-n matrices where m \not n,e.g., the block-circulant matrices. Besides, the computationalcomplexity for many matrix operations, such as matrix-vectormultiplication, matrix inversion, etc., can be significantlyreduced when operating on structured matrices.

IV.FAST F OURIER T RANSFORM -BASED DNN M ODELIn this section, we propose an efficient inference algorithmand explain the training algorithm in deep neural networks byusing block-circulant matrices. We achieve a simultaneousand significant reduction in computational complexity ofinference and training processes, and also weight storage.Besides, we have performed theoretical analysis to prove theeffectiveness of substituting matrix multiplication with theFast Fourier Transform method and utilizing block-circulantmatrices, thereby guaranteeing applicability of the proposedframework on a wide variety of applications and emerging deeplearning models.A. Block-Circulant Matrix-Based Inference and Training Algorithms for FC LayersCheng et al. proposed circulant matrix-based DNN trainingand inference algorithms for FC layers [19]. However, in manypractical applications such schemes cannot be directly usedbecause: (1) It is very common that the weight matrices ofDNNs are non-square matrices due to the specific need ofdifferent applications; and (2) Even if the weight matricesare square, in many cases the compression is too aggressiveand hence causes non-negligible accuracy loss. To address theabove challenges, we present the block-circulant matrix-basedinference and training algorithms.Recall that the forward propagation during the inferencephase of a FC layer is performed as \bfy \psi (\bfW \bfT \bfx \theta ), where \psiis the activation function, \bfW is the weight matrix, \bfx is the inputvector, and \theta is the biases. The computation bottleneck is thecalculation of \bfW \bfT \bfx . When using a block-circulant matrix forrepresenting \bfW , a fast multiplication algorithm for \bfW \bfT \bfx exists,which will result in a significant reduction in computationalcomplexity. Assume that the weight matrix is an m-by-n blockcirculant matrix \bfW [\bfC 1 \bfC 2 . \bfC k ]\bfT ; the input vector is\bfx (\bfx 1 \bfx 2 . \bfx k ); and the bias vector is \theta (\theta 1 \theta 2 . \theta k ).Each circulant matrix \bfC i \in \BbbR n\times n is defined by a lengthn vector \bfw i (wi,1 , wi,2 , ., wi,n )\bfT , i \in \{ 1, ., k\} , m kn, and \bfx i (xi,1 , xi,2 , ., xi,n )\bfT . Hence, \bfW \bfT \bfx , as the keycomputation bottleneck in the inference phase, can be simplifiedas below:\bfW \bfT \bfx k\sumi 1\bfC i \bfx i k\sum\bigl(\bigr)IFFT FFT(\bfw i ) \circ FFT(\bfx i )(3)i 1where FFT, IFFT, and \circ represent a Fast Fourier transform(FFT), an inverse FFT, and an element wise multiplication,respectively. This “FFT \rightarrow component-wise multiplication \rightarrowIFFT" procedure to implement \bfW \bfT \bfx shown in Fig. 2 is derivedfrom the circular convolution theorem [24], [25]. The overallcomputational complexity in this FC layer will be O(n \mathrm{l}\mathrm{o}\mathrm{g} n),achieving a significant reduction compared to O(n2 ) whencalculating \bfW \bfT \bfx directly. In order to store the weights for theinference phase, we can simply keep the FFT result FFT(\bfw i )(which is a vector) instead of the whole matrix \bfW , therebyreducing the storage complexity to O(n) for an FC layer.Algorithm 1 summarizes the FFT-based inference algorithm.Algorithm 1: Block-circulant Matrix-based InferenceInput: w, x, m, nOutput: asa \leftarrow max(m, n);si \leftarrow min(m, n);k \leftarrow \lceil sa/si\rceil ;partition w into k vectors, w1 , . . . , wk ;if m n thenfor i \leftarrow 0 until k doa \leftarrow a ifft(fft(\bfw \bfi ) \circ fft(x));endelsepartition x into k vectors, x1 , . . . , xk ;for i \leftarrow 0 until k doa \leftarrow a ifft(fft(\bfw \bfi ) \circ fft(xi ));endendreturn a;(ai,1 , ai,2 , ., ai,n )\bfT , then the weight updating rule for theblock-circulant FC layer is given by:\Bigr)\Bigl(\bigl( \partial L \bigr)\bfw i \leftarrow \bfw i - \epsilon \cdot IFFT FFT\circ FFT(\bfx \prime i ) \cdot \bfI(4)\partial \bfa iwhere J, \bfI , \epsilon , and \bfx \prime i represent the loss function, an all-onecolumn vector, the learning rate, and the base vector that\partial \bfa idefines the circulant matrix \partial \bfw(which is formally derived),i\partial \bfa irespectively. Notice that since \partial \bfwis a circulant matrix,isimilar to inference, we can utilize the “FFT\rightarrow component-wisemultiplication\rightarrow IFFT" procedure to accelerate the matrix-vectormultiplication. The computational complexity will be O(n \mathrm{l}\mathrm{o}\mathrm{g} n)in each updating step in this layer, which is a significantreduction from O(n2 ) in traditional backpropagation procedure.Algorithm 2 summarizes the FFT-based training algorithm.B. Block-Circulant Matrix-Based Inference and Training Algorithms for CONV LayerThe use of block-circulant matrices can also enable significant reduction in computational and storage complexities ofthe Convolutional layer. The Convolutional layers are oftenassociated with multiple input and output feature maps inDNNs. Therefore, the computation of the Convolutional layeris described using tensor format as follows:\scrY (x, y, p) \scrF (i, j, c, p)\scrX (x i - 1, y j - 1, c),i 1 j 1 c 1(5)where \scrX \in \BbbR W \times H\times C , \scrY \in \BbbR (W - r 1)\times (H - r 1)\times P , \scrF \in\BbbR r\times r\times C\times P denote the input, output, and weight “tensors" ofFFT(w) Pre-calculatedxyFFT( x )Fig. 2.}}Besides the inference procedure, the reformulated training(weight updating) algorithm in the scenario of using blockcirculant matrices will also result in significant accelerations.We denote \bfa \bfW \bfT \bfx \theta (\bfa 1 \bfa 2 . \bfa k )\bfT and \bfa i For general values of m and n, we can apply zero padding such that thedefinition of block-circulant matrices can be applied.r \sumr \sumC\sumIFFT(FFT(w) FFT( x )The “FFT \rightarrow component-wise multiplication \rightarrow IFFT" procedure.

ArchitectureAlgorithm 2: Block-circulant Matrix-based TrainingInput: \partial L\partial a , w, x, m, n\partial L \partial LOutput: \partial w, \partial xParameterssa \leftarrow max(m, n);si \leftarrow min(m, n);k \leftarrow \lceil sa/si\rceil ;partition w into k vectors, w1 , . . . , wk ;\partial L\partial L\partial Lpartition \partial w, . . . , \partial winto k vectors, \partial w;1kif m n then\partial L\partial Lpartition \partial L\partial a into k vectors, \partial a1 , . . . , \partial ak ;for i \leftarrow 0 until k do\partial L\partial L\prime\partial wi \leftarrow ifft(fft( \partial a ) \circ fft(\bfx )) \cdot \bfone ;\partial L\partial L\partial L\prime\partial x \leftarrow \partial x ifft(fft( \partial a ) \circ fft(\bfw \bfi ));endelsepartition x into k vectors, x1 , . . . , xk ;\partial L\partial Lpartition \partial L\partial x into k vectors, \partial x1 , . . . , \partial xk ;for i \leftarrow 0 until k do\partial L\partial L\prime\partial wi \leftarrow ifft(fft( \partial a ) \circ fft(\bfx \bfi )) \cdot \bfone ;\partial L\partial L\prime\partial xi \leftarrow ifft(fft( \partial a ) \circ fft(\bfw \bfi ));endend\partial L \partial Lreturn \partial w, \partial x ;InputsFig. 4.In the Convolutional layer, to enhance the implementationefficiency, software tools provide an efficient approach ofchanging tensor-based operations to matrix-based operationsequivalently [26], [27]. Fig. 3 demonstrates the applicationof the method to reformulate Eqn. (3) to the matrix mul2tiplication \bfY \bfX \bfF , where \bfX \in \BbbR (W - r 1)(H - r 1)\times Cr ,2\bfY \in \BbbR (W - r 1)(H - r 1)\times P , and \bfF \in \BbbR Cr \times P .r2CXF (H-r 1)yPPPxP(W-r 1)r*rF Y(H-r 1)(W-r 1)Cxr2CW(H-r 1)(W-r 1)Fig. 3.CrrOpenCVLibraryInputsParserS OFTWARE I MPLEMENTATIONIn this section, we provide detailed explanation of oursoftware implementation, experimental setup, and evaluationof the proposed inference framework on various Androidbased platforms with embedded processors and various datasets.The purpose of this software implementation is to reveal thepotential of embedded systems in running real time applicationsthat involve deep neural networks.The software implementation of proposed inference framework for Android-based platforms is comprised of four highlevel modules. The first module is responsible for constructingthe network architecture. The second module reads a file thatcontains trained weights and biases. The third module loads testdata that consists of input features and predefined classificationlabels, and finally, the fourth module performs inference forpredicting labels. Fig. 4 depicts these high-level building blocksof the software implementation, along with their interactions.It should be noted that the test data may be loaded from a file,camera, etc.We utilize the OpenCV[28] as core computing library inour project. OpenCV is an open-source cross-platform libraryof programming functions that is mainly targeted for computervision applications and includes efficient implementation ofaforementioned operations. OpenCV is written in C , and itprovides the API (Application Program Interface) for bothC and Java. We implement two versions of software forinference: one that uses OpenCV’s Java API, which is moreconvenient for Android development, and another one that isdeveloped in C using Android NDK (Native DevelopmentKit), uses OpenCV’s C API, and is expected to have a betterperformance.A. Experimental SetupWe run the inference application on various platforms ofdifferent generations in order to evaluate the applicability ofthe inference on embedded systems. Table I summarizes thespecifications of test platforms.Reformulation of Eqn. (3) to matrix multiplication.Based on the reshaping principle between \scrF and \bfF , wehave:fa C(i - 1) Cr(j - 1),b fC(i - 1) Cr(j - 1),b - a , \forall a, bPredictionJava/C InterfaceBuilding blocks of software implementation.V.We generalize the “block-circulant structure" as rank-4tensor (\scrF ) in the Convolutional layer, i.e., each slice \scrF (\cdot , \cdot , i, j)is a circulant matrix. Then, we reformulate the inference andtraining algorithms of the Convolutional layer to nginewhere \bfF is a block-circulant matrix. Therefore, the“FFT\rightarrow component-wise multiplication \rightarrow IFFT" procedure canbe applied to accelerate \bfY \bfX \bfF , leading to the accelerationof (3). With the assist of the proposed approach, the computational complexity for (3) is reduced from O(W Hr2 CP ) toO(W HQ \mathrm{l}\mathrm{o}\mathrm{g} Q), where Q \mathrm{m}\mathrm{a}\mathrm{x}(r2 C, P ).the Convolutional layer, correspondingly. C is the number ofinput maps. W and H are the spatial dimensions of the inputmaps. P is the total number of output maps, and r is the sizeof the convolutional kernel.CArchitectureParser(6)The OpenCV Manager is installed on all target platforms inorder to link OpenCV libraries dynamically and reduce memoryusage. Additionally, hardware specific optimizations are appliedby OpenCV Manager for an application’s supported platforms.

TABLE I.PlatformLG Nexus 5Odroid XU3Huawei Honor 6XP LATFORMS UNDER TEST AND THEIR SPECIFICATIONS .AndroidPrimary CPUCompanion CPUCPU ArchitectureGPURAM (GB)6 (Marshmallow)7 (Nougat)7 (Nougat)4 \times 2.3\mathrm{G}\mathrm{H}\mathrm{z} Krait 4004 \times 2.1\mathrm{G}\mathrm{H}\mathrm{z} Cortex-A154 \times 2.1\mathrm{G}\mathrm{H}\mathrm{z} Cortex-A534 \times 1.5\mathrm{G}\mathrm{H}\mathrm{z} Cortex-A74 \times 1.7\mathrm{G}\mathrm{H}\mathrm{z} Cortex-A53ARMv7-AARMv7-AARMv8-AAdreno 330Mali T628Mali T830223In order to standardize the evaluation process on allplatforms, the airplane mode is switched on to eliminatetelecommunication overhead; all other running applicationsare closed to ensure they do not affect runtime; and the deviceis plugged in to avoid performance throttling applied by aplatform’s governor. Though this is the standard setup, we willstudy the performance of inference process in situations wherethe device is running on its battery.Considering different architectures mentioned in Table II,one can observe that going from the smaller network to a biggernetwork increases the accuracy by about 2% while it increasesthe memory required for storing parameters by a factor of abouttwo and increases the runtime of Java and C implementationsby about 2% and 9%, respectively. It should be noted that whenthe device is running on its battery, the runtime will increase byabout 14% in the Java implementation, but remains unchangedin the C implementation.B. MNISTMNIST dataset [29] is a handwritten digits dataset whichincludes 28\times 28 greyscale images with 60,000 images fortraining and 10,000 images for testing. The original images inthe MNIST dataset are resized using a bilinear transformation,and such transformation is used for both training and testing.Various neural network architectures are explored for eachdataset and a few of them are presented in this paper.For the MNIST dataset, two different neural networkarchitectures are evaluated. In the first architecture (Arch. 1),the input layer consists of 256 neurons that represent the resizedMNIST images. The next two layers comprise of 128 neuronseach and are based on block-circulant matrix based FC layers.Finally, the last layer is a softmax layer that consists of 10neurons representing the ten possible predictions for the digits.The second architecture (Arch. 2) has 121 neurons in theinput layer, 64 neurons in the two hidden layers, and similar toArch. 1, a softmax layer as the output layer. Table II summarizesthe runtime of each round of inference process using thesearchitectures and on various mobile platforms.TABLE II.ArchitectureC ORE RUNTIME OF EACH ROUND OF INFERENCE FORRESIZED MNIST IMAGES .ImplementationAccuracy (%)Runtime (µ\mathrm{s} per image)Nexus 5Arch. 1Arch. 2Java95.47359.6XU3294.1C. CIFAR-10The CIFAR-10 [30] dataset contains 32\times 32 color imagesfrom 10 classes, where there are 50,000 training imagesand 10,000 testing images. The structure of deep neuralnetwork can be denoted as 1024F-1024F-10F (Arch. 3). Here128x3x32x32 represents that (i) the batch size is 128; (ii) thenumber of input channel is 3, (iii) and the feature size of inputdata is 32x32. In addition, 128Conv3 indicates that 128 3x3convolutional filters are used in the convolutional layer. Inaddition, 512F or 10F means that the number of neurons inthe FC layer is 512 or 10, respectively. In addition, both theoriginal and compressed models are trained with learning rate0.001 and momentum 0.9. In this network architecture, the firsttwo convolutional layers are traditional convolutional layers (noblock circulant, which is treated as preprocessing similar to theIBM TrueNorth paper [31]). Based on the results summarizedin Table III, the C implementation is about 130% faster thanthe Java implementation.TABLE III.ArchitectureC ORE RUNTIME OF EACH ROUND OF INFERENCE PROCESSFOR CIFAR -10 IMAGES .Accuracy (%)XU3Honor 6XJava80.22103219785C 80.289128244Honor 6X256.7C 95.47140.0122.0101.0Java93.59350.9278.2221.7C 93.59128.5119.198.5Based on the results summarized in Table II, the C implementation is about 60-65% faster than the Java implementation.One of the reasons for this superior performance is related tomemory limitations and management policy in Android. Whileapplications written in C have an unlimited heap size, Javaapplications are restricted to platform-specific heap sizes. As aresult, a constraint is imposed on the amount of data that anapplication written in Java can deal with at each instance oftime.Another potential reason that may explain the considerableperformance difference between the two implementations isthe overhead due to switching from Java to C and vice versa.Because the OpenCV library is written in C , it needs to covertdata from C data types to Java data types whenever the JavaAPI is used. We believe that these conversions do not affectthe runtime significantly, but can cause certain difference inperformance across the two implementations.Arch. 3Runtime (µ\mathrm{s} per image)ImplementationD. Comparison Results on Performance and AccuracyIn this section, we provide comprehensive comparisonresults on MNIST, CIFAR-10, and IBM TrueNorth [31], [32].Our test platform consists of one or two qual-core ARM, whilethe IBM TrueNorth includes 4,096 ASIC cores, which is around500-1000 times more than our testing platform. In Fig. 5,compared with IBM TrueNorth results on MNIST [32], ourmodel performs 10\times faster than IBM TrueNorth with a littleaccuracy reduction on the best device result. The accuracyfor IBM TrueNorth is 95% and the runtime is 1000µ\mathrm{s} perimage on MNIST. Compared with IBM TrueNorth results onCIFAR-10 [31], with 500-1000 times less cores, our modelperforms 10\times slower than IMB TrueNorth. The accuracy forIBM TrueNorth is 83.41% and the runtime is 800µ\mathrm{s} per image.We can see that the later work [31] in 2016 on CIFAR10 is optimized more efficiently compared with the formerwork [32] in 2015. Although our mobile phone based frameworkachieves lower performance compared with IBM TrueNorthon CIFAR-10, it is still reasonably good result considering the

Accuracy (%)100IBM-TN (MNIST)Our Method (MNIST)IBM-TN (CIFAR-10)9080Our Method (CIFAR-10)706050

The deep learning is based on the structure of deep neural networks (DNNs), which consist of multiple layers of various types and hundreds to thousands of neurons in each layer. Recent evidence has revealed that the network depth is of crucial importance to the success of deep learning, and many deep

Related Documents:

AES Encryption - A multiprocessor AES encryption implementation FFT Filter - a multiprocessor FFT filter using shared memory FFT Filter 2 - a multiprocessor FFT filter using communication chan

The numpy.fft.fft() Function The fft.fft() function accepts either a real or a complex array as an input ar

Deep Learning: Top 7 Ways to Get Started with MATLAB Deep Learning with MATLAB: Quick-Start Videos Start Deep Learning Faster Using Transfer Learning Transfer Learning Using AlexNet Introduction to Convolutional Neural Networks Create a Simple Deep Learning Network for Classification Deep Learning for Computer Vision with MATLAB

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

fused-layer CNN accelerators which focus on reducing data flow across layers and fusing multiple convolutional layer computation together and achieved 95% reduction in total data transfer. However, not much of research in CNN acceleration on embedded hardware has focused on fast Fourier transform (FFT)-based convolutions (FFT-Convs).

Appendix B. FFT (Fast Fourier Transform) /* This computes an in-place complex-to-complex FFT x and y are the real and imaginary arrays of 2 m points. dir 1 gives forward transform dir -1 gives reverse transform */ short FFT(short int dir,long m,double *x,double *y) {long n,i,i1,j,k,i2,l,l1,l2; double c1,c2,tx,ty,t1,t2,u1,u2,z;

FFT FFT FFT N x N pixels N x N pixels (complex) N x N pixels (complex) *B.V.K. Vijaya Kumar, M. Savvides, K. Venkataramani, C. Xie, "Spatial frequency domain image processing for biometric recognition," IEEE Proc. of International C

from The Adventures of Tom Sawyer MARK TWAIN In this famous selection from The Adventures of Tom Sawyer (1876), written by Mark Twain (born Samuel Langhorne Clemens, 1835–1910), Tom, burdened with the chore to whitewash his Aunt Polly’s fence as punishment for his having played hooky from school, comes up with an ingenious way to get out of his work: He convinces his friends that it’s .