C CNN: Accelerating And Compressing Deep Neural Networks .

1y ago
9 Views
2 Downloads
2.05 MB
14 Pages
Last View : 1d ago
Last Download : 3m ago
Upload by : Jenson Heredia
Transcription

C CNN: Accelerating and Compressing Deep Neural NetworksUsing Block-Circulant Weight MatricesCaiwen Ding ,1 , Siyu Liao ,2 , Yanzhi Wang ,1 , Zhe Li1 , Ning Liu1 , Youwei Zhuo3 , Chao Wang3 ,Xuehai Qian3 , Yu Bai4 , Geng Yuan1 , Xiaolong Ma1 , Yipeng Zhang1 , Jian Tang1 , Qinru Qiu1 , XueLin5 , Bo Yuan2authors contributed equally.University of New York, City College, 3 University of Southern California, 4 California StateUniversity Fullerton, 5 Northeastern edu,byuan@ccny.cuny.edu1 SyracuseUniversity,2 City TheseABSTRACTperformance with a small hardware footprint. Based on the FPGAimplementation and ASIC synthesis results, C CNN achieves 6- 102X energy e ciency improvements compared with the beststate-of-the-art results.Large-scale deep neural networks (DNNs) are both compute andmemory intensive. As the size of DNNs continues to grow, it iscritical to improve the energy e ciency and performance whilemaintaining accuracy. For DNNs, the model size is an important factor a ecting performance, scalability and energy e ciency. Weightpruning achieves good compression ratios but su ers from threedrawbacks: 1) the irregular network structure after pruning, whicha ects performance and throughput; 2) the increased training complexity; and 3) the lack of rigirous guarantee of compression ratioand inference accuracy.To overcome these limitations, this paper proposes C CNN,a principled approach to represent weights and process neuralnetworks using block-circulant matrices. C CNN utilizes the FastFourier Transform (FFT)-based fast multiplication, simultaneouslyreducing the computational complexity (both in inference andtraining) from O(n2 ) to O(n log n) and the storage complexity fromO(n 2 ) to O(n), with negligible accuracy loss. Compared to otherapproaches, C CNN is distinct due to its mathematical rigor: theDNNs based on C CNN can converge to the same “e ectiveness”as DNNs without compression. We propose the C CNN architecture, a universal DNN inference engine that can be implemented invarious hardware/software platforms with con gurable networkarchitecture (e.g., layer type, size, scales, etc.). In C CNN architecture: 1) Due to the recursive property, FFT can be used as thekey computing kernel, which ensures universal and small-footprintimplementations. 2) The compressed but regular network structureavoids the pitfalls of the network pruning and facilitates high performance and throughput with highly pipelined and parallel design.To demonstrate the performance and energy e ciency, we test C CNN in FPGA, ASIC and embedded processors. Our results showthat C CNN architecture achieves very high energy e ciency andCCS CONCEPTS Computer systems organization Embedded hardware;KEYWORDSDeep learning, block-circulant matrix, compression, acceleration,FPGAACM Reference format:Caiwen Ding , 1 , Siyu Liao , 2 , Yanzhi Wang , 1 , Zhe Li1 , Ning Liu1 , YouweiZhuo3 , Chao Wang3 , Xuehai Qian3 , Yu Bai4 , Geng Yuan1 , Xiaolong Ma1 ,Yipeng Zhang1 , Jian Tang1 , Qinru Qiu1 , Xue Lin5 , Bo Yuan2 . 2017. C CNN: Accelerating and Compressing Deep Neural Networks Using BlockCirculant Weight Matrices. In Proceedings of MICRO-50, Cambridge, MA,USA, October 14–18, 2017, 14 ODUCTIONFrom the end of the rst decade of the 21st century, neural networkshave been experiencing a phenomenal resurgence thanks to the bigdata and the signi cant advances in processing speeds. Large-scaledeep neural networks (DNNs) have been able to deliver impressiveresults in many challenging problems. For instance, DNNs haveled to breakthroughs in object recognition accuracy on the ImageNet dataset [1], even achieving human-level performance for facerecognition [2]. Such promising results triggered the revolution ofseveral traditional and emerging real-world applications, such asself-driving systems [3], automatic machine translations [4], drugdiscovery and toxicology [5]. As a result, both academia and industry show the rising interests with signi cant resources devotedto investigation, improvement, and promotion of deep learningmethods and systems.One of the key enablers of the unprecedented success of deeplearning is the availability of very large models. Modern DNNstypically consist of multiple cascaded layers, and at least millionsto hundreds of millions of parameters (i.e., weights) for the entiremodel [6–9]. The larger-scale neural networks tend to enable theextraction of more complex high-level features, and therefore, leadPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro t or commercial advantage and that copies bear this notice and the full citationon the rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci c permission and/or afee. Request permissions from permissions@acm.org.MICRO-50, October 14–18, 2017, Cambridge, MA, USA 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-4952-9/17/10. . . 15.00https://doi.org/10.1145/3123939.3124552395

MICRO-50, October 14–18, 2017, Cambridge, MA, USAC. Ding et al.to a signi cant improvement of the overall accuracy [10–12]. Onthe other side, the layered deep structure and large model sizesalso demand increasing computational capability and memory requirements. In order to achieve higher scalability, performance,and energy e ciency for deep learning systems, two orthogonalresearch and development trends have both attracted enormousinterests.The rst trend is the hardware acceleration of DNNs, which hasbeen extensively investigated in both industry and academia. Asa representative technique, FPGA-based accelerators o er goodprogrammability, high degree of parallelism and short developmentcycle. FPGA has been used to accelerate the original DNNs [13–17],binary neural networks [18, 19], and more recently, DNNs withmodel compression [20]. Alternatively, ASIC-based implementations have been recently explored to overcome the limitations ofgeneral-purpose computing approaches. A number of major hightech companies have announced their ASIC chip designs of theDNN inference framework, such as Intel, Google, etc. [21, 22]. Inacademia, three representative works at the architectural level areEyeriss [23], EIE [24], and the DianNao family [25–27], which focus speci cally on the convolutional layers, the fully-connectedlayers, and the memory design/organization, respectively. Thereare a number of recent tapeouts of hardware deep learning systems[23, 28–33].These prior works mainly focus on the inference phase of DNNs,and usually su er from the frequent accesses to o -chip DRAMsystems (e.g., when large-scale DNNs are used for ImageNet dataset).This is because the limited on-chip SRAM memory can hardlyaccommodate large model sizes. Unfortunately, o -chip DRAMaccesses consume signi cant energy. The recent studies [34, 35]show that the per-bit access energy of o -chip DRAM memoryis 200 compared with on-chip SRAM. Therefore, it can easilydominate the whole system power consumption.The energy e ciency challenge of large models motivates thesecond trend: model compression. Several algorithm-level techniqueshave been proposed to compress models and accelerate DNNs,including weight quantization [36, 37], connection pruning [34, 35],and low rank approximation [38, 39]. These approaches can o er areasonable parameter reduction (e.g., by 9 to 13 in [34, 35]) withminor accuracy degradation. However, they su er from the threedrawbacks: 1) the sparsity regularization and pruning typicallyresult in an irregular network structure, thereby undermining thecompression ratio and limiting performance and throughput [40]; 2)the training complexity is increased due to the additional pruningprocess [34, 35] or low rank approximation step [38, 39], etc.; 3) thecompression ratios depending on network are heuristic and cannotbe precisely controlled.We believe that an ideal model compression technique should:i) maintain regular network structure; ii) reduce the complexityfor both inference and training, and, most importantly, iii) retain arigorous mathematical fundation on compression ratio and accuracy.As an e ort to achieve the three goals, we propose C CNN,a principled approach to represent weights and process neuralnetworks using block-circulant matrices [41]. The concept of theblock-circulant matrix compared to the ordinary unstructured matrix is shown in Fig. 1. In a square circulant matrix, each row (orcolumn) vector is the circulant reformat of the other row (column)Unstructured Weight Matrix(18 parameters)Block-Circulant Weight Matrix(6 parameters)Figure 1: Block-circulant Matrices for weight representation.vectors. A non-squared matrix could be represented by a set ofsquare circulant submatrices (blocks). Therefore, by representinga matrix with a vector, the rst bene t of C CNN is storage sizereduction. In Fig. 1, the unstructured 6 3 weight matrix (on the left)holds 18 parameters. Suppose we can represent the weights usingtwo 3 3 circulant matrices (on the right), we just need to store 6parameters, easily leading to 3x model size reduction. Intuitively,the reduction ratio is determined by the block size of the circulantsubmatrices: larger block size leads to high compression ratio. Ingeneral, the storage complexity is reduced from O(n 2 ) to O(n).The second bene t of C CNN is computational complexity reduction. We explain the insights using a fully-connected layer ofDNN, which can be represented as y (Wx ), where vectorsx and y represent the outputs of all neurons in the previous layerand the current layer, respectively; W is the m-by-n weight matrix;and (·) is activation function. When W is a block-circulant matrix,the Fast Fourier Transform (FFT)-based fast multiplication methodcan be utilized, and the computational complexity is reducedfrom O(n2 ) to O(n log n).It is important to understand that C CNN incurs no conversionbetween the unstructured weight matrices and block-circulant matrices. Instead, we assume that the layers can be represented byblock-circulant matrices and the training generates a vector foreach circulant submatrix. The fundamental di erence is that: thecurrent approaches apply various compression techniques (e.g.,pruning) on the unstructured weight matrices and then retrainthe network; while C CNN directly trains the network assumingblock-circulant structure. This leads to two advantages. First, theprior work can only reduce the model size by a heuristic factor,depending on the network, while C CNN provides the adjustablebut xed reduction ratio. Second, with the same FFT-based fastmultiplication, the computational complexity of training is alsoreduced from O(n 2 ) to O(n log n). Unfortunately, the prior workdoes not reduce (or even increase) training complexity.Due to the storage and computational complexity reduction, C CNN is clearly attractive. The only question is: can a network reallybe represented by block-circulant matrices with no (or negligible)accuracy loss? This question is natural, because with the much lessweights in the vectors, the network may not be able to approximatethe function of the network with unstructured weight matrices.Fortunately, the answer to the question is YES. C CNN is mathematically rigorous: we have developed a theoretical foundation andformal proof showing that the DNNs represented by block-circulantmatrices can converge to the same “e ectiveness" as DNNs withoutcompression, fundamentally distinguishing our method from prior396

C CNN: Accelerating and Compressing Deep Neural NetworksMICRO-50, October 14–18, 2017, Cambridge, MA, USAarts. The outline of the proof is discussed in Section 3.3 and thedetails are provided in technical reports [42, 43].Based on block-circulant matrix-based algorithms, we proposeC CNN architecture, — a universal DNN inference engine thatcan be implemented in various hardware/software platforms withcon gurable network architecture (e.g., layer type, size, scales, etc.).Applying C CNN to neural network accelerators enables notablearchitectural innovations. 1) Due to its recursive property and itsintrinsic role in C CNN, FFT is implemented as the basic computing block. It ensures universal and small-footprint implementations.2) Pipelining and parallelism optimizations. Taking advantage ofthe compressed but regular network structures, we aggressivelyapply inter-level and intra-level pipelining in the basic computing block. Moreover, we can conduct joint-optimizations considering parallelization degree, performance and power consumption.3) Platform-speci c optimizations focusing on weight storage andmemory management.To demonstrate the performance and energy e ciency, we testC CNN architecture in three platforms: FPGA, ASIC and embeddedprocessors. Our results show that C CNN architecture achievesvery high energy e ciency and performance with a small hardwarefootprint. Based on the FPGA implementation and ASIC synthesisresults, C CNN achieves 6 - 102X energy e ciency improvementscompared with the best state-of-the-art results.where W 2 Rm n is the weight matrix of the synapses between thisFC layer (with m neurons) and its previous layer (with n neurons);2 Rm is the bias vector; and (·) is the activation function. TheRecti ed Linear Unit (ReLU) (x ) max(0, x ) is the most widelyutilized in DNNs.The convolutional (CONV) layer, as the name implies, performs a two-dimensional convolution to extract features from itsinputs that will be fed into subsequent layers for extracting higherlevel features. A CONV layer is associated with a set of learnable lters (or kernels) [47], which are activated when speci c types offeatures are found at some spatial positions in inputs. A lter-sizedmoving window is applied to the inputs to obtain a set of featuremaps, calculating the convolution of the lter and inputs in themoving window. Each convolutional neuron, representing one pixelin a feature map, takes a set of inputs and the corresponding lterweights to calculate the inner-product. Given input feature map Xand the r r -sized lter (i.e., the convolutional kernel) F, the outputfeature map Y is calculated asa,b rr XXi 1 j 1x a i 1,b j 1 fi, j ,(2)where a,b , x a i 1,b j 1 , and fi, j are elements in Y, X, and F, respectively. Multiple convolutional kernels can be adopted to extractdi erent features in the same input feature map. Multiple inputfeature maps can be convolved with the same lter and results aresummed up to derive a single feature map.The pooling (POOL) layer performs a subsampling operationon the extracted features to reduce the data dimensions and mitigateover tting issues. Here, the subsampling operation on the inputsof pooling layer can be realized by various non-linear operations,such as max, average or L2-norm calculation. Among them, the maxpooling is the dominant type of pooling strategy in state-of-the-artDCNNs due to the higher overall accuracy and convergence speed[20, 23].Among these three types of layers, the majority of computationoccurs in CONV and FC layers, while the POOL layer has a relativelylower computational complexity of O(n). The storage requirementof DNNs is due to the weight matrices W’s in the FC layers and theconvolutional kernels F’s in CONV layers. As a result, the FC andCONV layers become the major research focuses on energy-e cientimplementation and weight reduction of DNNs.2 BACKGROUND AND MOTIVATION2.1 Deep Neural NetworksDeep learning systems can be constructed using di erent types ofarchitectures, including deep convolutional neural networks (DCNNs), deep belief networks (DBNs), and recurrent neural networks(RNNs). Despite the di erences in network structures and targetapplications, they share the same construction principle: multiplefunctional layers are cascaded together to extract features at multiple levels of abstraction [44–46]. Fig. 2 illustrates the multi-layerstructure of an example DCNN, which consists of a stack of fullyconnected layers, convolutional layers, and pooling layers. Thesethree types of layers are fundamental in deep learning systems.densedense2.2}}Convolutional layersFully connectedlayersMathematical investigations have demonstrated signi cant sparsityand margin for weight reduction in DNNs, a number of prior worksleverage this property to reduce weight storage. The techniquescan be classi ed into two categories. 1) Systematic methods [48–50]such as Singular Value Decomposition (SVD). Despite being systematic, these methods typically exhibit a relatively high degradationin the overall accuracy (by 5%-10% at 10 compression). 2) Heuristicpruning methods [34, 35, 51] use heuristic weight together withweight quantization. These method could achieve a better parameter reductions, i.e., 9 -13 [34, 35], and a very small accuracydegradation. However, the network structure and weight storageFigure 2: Multi-layer structure of an example DCNN.The fully-connected (FC) layer is the most storage-intensivelayer in DNN architectures [14, 15] since its neurons are fully connected with neurons in the previous layer. The computation procedure of a FC layer consists of matrix-vector arithmetics (multiplications and additions) and transformation by the activation function,as described as follows:y (Wx )DNN Weight Storage Reduction andAcceleration(1)397

MICRO-50, October 14–18, 2017, Cambridge, MA, USAC. Ding et al.after pruning become highly irregular (c.f. Fig. 3) and therefore indexing is always needed, which undermines the compression ratioand more importantly, the performance improvement.0.36 -1.39 0.06 0.43 -0.24 3.42 -0.12 1.561.56 0.36 -1.39 0.06 0.43 -0.24 3.42 -0.12pruningsynapsespruningneurons3.42 -0.12 1.56 0.36 -1.39 0.06 0.43 -0.24-1.39 0.36 0.43 0.06 3.42 -0.24 1.56 -0.12.0.06 1.22 1.72 0.08 1.45 -1.42 0.57 1.47.1.22 0.06 0.08 1.72 -1.42 1.45 1.47 0.57.(b) ProposedFigure 4: Baseline [54] and C CNN. The baseline method(a) formulates a large, square circulant matrix for FC layerweight representation when the numbers of inputs and outputs are not equal, whereas the proposed method (b) uses theblock-circulant matrix to achieve a ne-grained tradeo ofaccuracy and compression/acceleration.Besides the pros and cons of the two approaches, the prior worksshare the following common limitations: 1) mainly focusing onweight reduction rather than computational complexity reduction;2) only reducing the model size by a heuristic factor instead ofreducing the Big-O complexity; and 3) performing weight pruningor applying matrix transformations based on a trained DNN model,thereby adding complexity to the training process. The third itemis crucial because it may limit the scalability of future larger-scaledeep learning systems.the two dimensions, by a vector of lters. The support for CONVlayers allow C CNN to be applied in the whole network.Block-circulant matrices. To mitigate the ine ciency due tothe single large circulant matrix used in [54], C CNN uses blockcirculant matrices for weight representation. The bene ts are twofold. First, it avoids the wasted storage/computation due to zeropadding when the numbers of inputs and outputs are not equal.Second, it allows us to derive a ne-grained tradeo between accuracy and compression/acceleration. Speci cally, to achieve bettercompression ratio, larger block size should be used, however, it maylead to more accuracy degradation. The smaller block sizes providebetter accuracy, but less compression. There is no compression ifthe block size is 1.Mathematical rigorousness. Importantly, we perform theoretical analysis to prove that the “e ectiveness" of block-circulantmatrix-based DNNs will (asymptotically) approach that of originalnetworks without compression. The theoretical proof also distinguishes the proposed method with prior work. The outline of theproof is discussed in Section 3.3 and the details are provided inreports [42, 43].Fig. 4 illustrates the di erence between the baseline [54] andC CNN. The baseline method (a) formulates a large, square circulant matrix by zero padding for FC layer weight representationwhen the numbers of inputs and outputs are not equal. In contrast, C CNN (b) uses the block-circulant matrix to avoid storagewaste and achieve a ne-grained tradeo of accuracy and compression/acceleration.Overall, with the novel techniques of C CNN, at algorithm level,it is possible to achieve the simultaneous and signi cant reductionof both computational and storage complexity, for both inference andtraining.FFT-Based MethodsLeCun et al. has proposed using FFTs to accelerate the computations in the CONV layers, which applies only to a single lter in theCONV layer [52]. It uses FFT to calculate the traditional inner products of lters and input feature maps, and can achieve speedup forlarge lter sizes (which is less common in state-of-the-art DCNNs[53]). The underlying neural network structure and parametersremain unchanged. The speedup is due to lter reuse and it cannotachieve either asymptotic speedup in big-O notation or weightcompressions (in fact additional storage space is needed).The work most closely related to C CNN is [54]. It proposedto use circulant matrix in the inference and training algorithms.However, it has a number of limitations. First, it only applied to FClayers, but not CONV layer. It limits the potential gain in weightreduction and performance. Second, it uses a single circulant matrixto represent the weights in the whole FC layer. Since the numberof input and output neurons are usually not the same, this methodleads to the storage waste due to the padded zeros (to make thecirculant matrix squared).2.40.36 -1.39 0.06 0.43 -0.24 3.42 -0.12 1.56(a) BaselineFigure 3: Illustration of the heuristic weight pruning methods.2.3-0.12 1.56 0.36 -1.39 0.06 0.43 -0.24 3.42Novelty of C CNNCompared with LeCun et al. [52], C CNN is fundamentally different as it achieves asymptotic speedup in big-O notation andweight compression simultaneously. Compared with [54], C CNNgeneralizes in three signi cant and novel aspects.Supporting both FC and CONV layers. Unlike FC layers, thematrices in CONV layers are small lters (e.g., 3 3). Instead ofrepresenting each lter as a circulant matrix, C CNN exploits theinter- lter sparsity among di erent lters. In another word, C CNNrepresents a matrix of lters, where input and output channels are3 CIRCNN: ALGORITHMS AND FOUNDATION3.1 FC Layer AlgorithmThe key idea of block-circulant matrix-based FC layers is to partitionthe original arbitrary-size weight matrix W 2 Rm n into 2D blocksof square sub-matrices, and each sub-matrix is a circulant matrix.398

C CNN: Accelerating and Compressing Deep Neural NetworksMICRO-50, October 14–18, 2017, Cambridge, MA, USAThe insights are shown in Fig. 5. Let k denote the block size (size ofeach sub-matrix) and assume there are p q blocks after partitioningW, where p m k and q n k. Then W [Wi j ], i 2 {1 . . . p},j 2 {1 . . . q}. Correspondingly, the input x is also partitioned asx [xT1 , xT2 , . . . , xTq ]T . Then, the forward propagation process inthe inference phase is given by (with bias and ReLU omitted):26 Pq W1j xj 37 2 377 66 a1 7766 Pqj 1Wx2jj77 66 a2 77 ,6(3)a Wx 66 j 177 66. . .7766Pq . . .66 77764 j 1 Wp j xj 75 4 ap 5Algorithm 1: Forward propagation process in the FC layer ofC CNNInput: wi j ’s, x, p, q, kOutput: aInitialize a with zeros.for i1 until p dofor j1 until q doaiai IFFT(FFT(wi j ) FFT(xj ))endendreturn awhere ai 2 Rk is a column vector. Assume each circulant matrixWi j is de ned by a vector wi j , i.e., wi j is the rst row vector of Wi j .Then according to the circulant convolution theorem[41, 55], the cal culation of Wi j xj can be performed as IFFT FFT(wi j ) FFT(xj ) ,where denotes element-wise multiplications. The operation procedure is shown on the right of Fig. 5. For the inference phase,the computational complexity of this FC layer will be O (pqk log k ),which is equivalent to O (n log n) for small p, q values. Similarly,the storage complexity will be O (pqk ) because we only need tostore wi j or FFT(wi j ) for each sub-matrix, which is equivalentto O (n) for small p, q values. Therefore, the simultaneous acceleration and model compression compared with the original DNNcan be achieved for the inference process. Algorithm 1 illustratesthe calculation of Wx in the inference process in the FC layer ofC CNN.W1q.Wij.ai ai Wp1.apqj 1Wpqxx1Wijxj.WW11-1.39 0.06 1.56 0.360.360.57xj0.36 -1.39 0.06 1.560.434.86.aa11.56 0.36 -1.39 0.06xq0.06 1.56 0.36 -1.39x-0.12 3.42Algorithm 2: Backward propagation process in the FC layerof C CNNInput: @L@a , wi j ’s, x, p, q, k@L ’s, @LOutput: @w@xij@L ’s and @L with zeros.Initialize @w@xijfor i1 until p dofor j1 until q do@L@L ) FFT(x 0 ))IFFT(FFT( @aj@wij@L@x jendend@L ’s, @Lreturn @w@xijwijwij or FFT( wij ) is 0.123.42Figure 5: Illustration of the calculation of Wx in the inference process of FC layer.3.2kCONV Layer AlgorithmIn practical DNN models, the CONV layers are often associatedwith multiple input and multiple output feature maps. As a result,the computation in the CONV layer can be expressed in the formatof tensor computations as below:Next, we consider the backward propagation process in the training phase. Let ail be the l-th output element in ai , and L denotethe loss function. Then by using the chain rule we can derive thebackward propagation process as follows:X @L @a@L @ai@Lil ,@wi j@ail @wi j@ai @wi j(5)@ai@aiWe have proved that @wand @xare block-circulant matrices.ijj@ai@L@LTherefore, @w and @a @x can be calculated as the “FFT!elementijijwise multiplication!IFFT” procedure and is equivalent to O (n log n)computational complexity per layer. Algorithm 2 illustrates backward propagation process in the FC layer of C CNN.In C CNN, the inference and training constitute an integratedframework where the reduction of computational complexity can begained for both. We directly train the vectors wi j ’s, correspondingto the circulant sub-matrices Wi j ’s, in each layer using Algorithm2. Clearly, the network after such training procedure naturallyfollows the block-circulant matrix structure. It is a key advantageof C CNN compared with prior works which require additionalsteps on a trained neural network.1.09-4.10FFT0.36i@L ) FFT(w )) IFFT(FFT( @aijip XpkX@L@L @ail X @L @ai .@xj@ail @xj@ai @xji 1 l 1i 1IFFT(FFT( wij ) FFT( x j ))-1.39 0.06 1.56 0.36@L@x j(4)Y (x, , p) l 1399r XCr XXi 1 j 1 c 1F (i, j, c, p)X(x i1, j1, c), (6)

MICRO-50, October 14–18, 2017, Cambridge, MA, USAC. Ding et al.neural network should be able to approximate any continuousor measurable function with arbitrary accuracy provided that anenough large number of parameters are available. This propertyprovides the theoretical guarantee of using neural networks tosolve machine learning problems, since machine learning tasks canbe formulated as nding a proper approximation of an unknown,high-dimensional function. Therefore, the goal is to prove the universal approximation property of block circulant matrix-based neuralnetworks, and more generally, for arbitrary structured matricessatisfying the low displacement rank . The detailed proofs forthe block circulant matrix-based networks and general structuredmatrix-based ones are provided in the technical reports [42, 43].The proof of the universal approximation property for blockcirculant matrix-based neural networks is brie y outlined as follows: Our objective is to prove that any continuous or measurablefunction can be approximated with arbitrary accuracy using a blockcirculant matrix-based network. Equivalently, we aim to prove thatthe function space achieved by block-circulant matrix-based neuralnetworks is dense in the space of continuous or measurable functions with the same inputs. An important property of the activationfunction, i.e., the component-wise discriminatory property, is proved.Based on this property, the above objective is proved using proofby contradiction and Hahn-Banach Theorem [58].We have further derived an approximation error bound of O(1/n)when the number of neurons in the layer n is limited, with detailsshown in [43]. It implies that the approximation error will reducewith an increasing n, i.e., an increasing number of neurons/inputsin the network. As a result, we can guarantee the universal “e ectiveness" of the proposed framework on di erent DNN types andsizes, application domains, and hardware/software platforms.where X 2 RW H C , Y 2 R (W r 1) (H r 1) P , F 2 Rr r C Prepresent the input, output, and weight “tensors" of the CONVlayer, respectively. Here, W and H are the spatial dimensions ofthe input maps, C is the number of input maps, r is the size of theconvolutional kernel, and P is the number of output maps.We generalize the concept of “block-circulant structure" to therank-4 tensor (F ) in the CONV layer, i.e., all the slices of the formF (·, ·, i, j) are circulant matrices. Next, we reformulate the inferenceand training algorithms of the CONV layer to matrix operations. Weuse the inference process as an example, and the training processcan be formulated in a similar way.Software tools such as Ca e provide an e cient methodology oftransforming tensor-based operations in the CONV layer to matrixbased operations [56, 57], in order to enhance the impleme

C!"CNN: Accelerating and Compressing Deep Neural Networks MICRO-50, October 14Ð18, 2017, Cambridge, MA, USA arts. The outline of the proof is discussed in Section 3.3 and the details are provided in technical reports [42, 43]. Based on block-circulant matrix-based algorithms, we propose C!"CNN architecture, Ñ a universal DNN inference engine that

Related Documents:

Fast R-CNN [2] enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed. 3 FASTER R-CNN Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2]

CNN R-CNN: Regions with CNN features Figure 1: Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN .

Fast R-CNN a. Architecture b. Results & Future Work Agenda 42. Fast R-CNN Fast test-time, like SPP-net One network, trained in one stage Higher mean average precision than slow R-CNN and SPP-net 43. Adapted from Fast R-CNN [R. Girshick (2015)] 44.

fast-rcnn. 2. Fast R-CNN architecture and training Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling

Jia-Bin Huang, Virginia Tech. Today's class Overview Convolutional Neural Network (CNN) Understanding and Visualizing CNN Training CNN. Image Categorization: Training phase Training . CNN as a Similarity Measure for Matching FaceNet [Schroff et al. 2015] Stereo matching [Zbontar and LeCun CVPR 2015]

A deep CNN is learned to jointly optimize pedestrian detection and other semantic tasks to im-prove pedestrian detection performance [32]. In [5,36,20,40,33,4], Fast R-CNN [16] or Faster R-CNN [27] is adapted for pedestrian detection. In this paper, we explore how to learn a deep CNN to improve performance for detecting partially occluded .

Zhang et al. [35] used CNN to regress both the density map and the global count. It laid the foundation for subsequent works based on CNN methods. To improve performance, some methods aimed at improving network structures. MCNN [36] and Switch-CNN [2] adopted multi-column CNN structures for mapping an im-age to its density map.

Prosedur Akuntansi Hutang Jangka Pendek & Panjang BAGIAN PROYEK PENGEMBANGAN KUR IKULUM DIREKTORAT PENDIDIKAN MENENGAH KEJURUAN DIREKTORAT JENDERAL PENDIDIKAN DASAR DAN MENENGAH DEPARTEMEN PENDIDIKAN NASIONAL 2003 Kode Modul: AK.26.E.6,7 . BAGIAN PROYEK PENGEMBANGAN KURIKULUM DIREKTORAT PENDIDIKAN MENENGAH KEJURUAN DIREKTORAT JENDERAL PENDIDIKAN DASAR DAN MENENGAH DEPARTEMEN PENDIDIKAN .