Performance Analysis Of GPU-Based Convolutional Neural Networks

1y ago

6 Views

3 Downloads

997.72 KB

10 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Aarya Seiber

Report this link

Download PDF

Transcription

2016 45th International Conference on Parallel ProcessingPerformance Analysis of GPU-based ConvolutionalNeural NetworksXiaqing Li†‡§ , Guangyan Zhang†‡§ , H. Howie Huang¶ , Zhufan Wang†‡ , Weimin Zheng†‡† Departmentof Computer Science and Technology, Tsinghua UniversityNational Laboratory for Information Science and Technology§ State Key Lab of Mathematical Engineering and Advanced Computing, Wuxi, China¶ Department of Electrical and Computer Engineering, George Washington UniversityEmail: li-xq14@mails.tsinghua.edu.cn, gyzh@tsinghua.edu.cn, howie@gwu.eduwang.zhufan1993@gmail.com, zwm-dcs@tsinghua.edu.cn‡ TsinghuaDriven by industry groups like Google, YouTube, Twitter andFaceBook, CNNs require to be trained on some very largedatasets (e.g., text, audio and video). Again, training on thoselarge-scale datasets requires signiﬁcant runtime, and severalweeks or months is not uncommon.To address this challenge, using GPUs to accelerate thetraining process of CNNs is popular. During CNN training,the computation is inherently parallel and involves a massiveamount of ﬂoating-point operations, e.g., matrix and vectoroperations. This computing pattern is well suitable for GPUcomputing model. Many of emerging deep learning frameworks are highly optimized on GPUs with the CUDA programming interface, including cuda-convnet [2], cuda-convnet2[18], Theano [19], Torch [20], Decaf [21] and Caffe [23].Most of these frameworks are open source and support oneor multiple GPUs. Moreover, some GPU-optimized librariesare explored to accelerate CNNs, such as cuDNN [24] andfbfft [25].However, few studies have been performed to enable acomprehensive evaluation on the performance characteristicsof those implementations over a wide range of conﬁgurations.As our experiments and evaluations will show, each implementation has pros and cons, and there is no single implementationthat performs well in all scenarios. The best performance isheavily dependent on different conﬁgurations.The goal of this work is to assist practitioners identifying theimplementations that best serve their CNN computation needsin different scenarios, and provide insights and suggestionsto practitioners and pinpoint aspects for researchers who areinterested in convolution optimization on GPUs. In this paper,we conduct a head-to-head comparison of their runtime to assist identifying the fastest implementation for a wide range ofscenarios. Furthermore, we also examine their memory usageand shape limitation during GPU kernel execution. In addition, developing optimization schemes and implementationsrequires an understanding of how efﬁciently the computingpower of GPUs has been exploited and where the potentialperformance bottlenecks of those implementations are. Wethus conduct a performance proﬁling to study the intrinsiccharacteristics of those implementations on GPU over differenttypical conﬁgurations.Abstract—As one of the most important deep learning models,convolutional neural networks (CNNs) have achieved great successes in a number of applications such as image classiﬁcation,speech recognition and nature language understanding. TrainingCNNs on large data sets is computationally expensive, leading toa ﬂurry of research and development of open-source parallelimplementations on GPUs. However, few studies have beenperformed to evaluate the performance characteristics of thoseimplementations. In this paper, we conduct a comprehensive comparison of these implementations over a wide range of parameterconﬁgurations, investigate potential performance bottlenecks andpoint out a number of opportunities for further optimization.Index Terms—Convolutional neural network, deep learning,GPU, performance evaluation, parallel computing.I. I NTRODUCTIONConvolutional neural networks (CNNs) are importantdeep learning models that have achieved great successesin large scale image classiﬁcations [2], [9], [22], speechrecognitions[3], [4] and nature language understanding [5],[6], [7]. This can be attributed to the advanced architecture ofCNNs (such as AlexNet, VGGNet, GoogleNet and OverFeat)[2], [12], [15], [22], large labeled training samples [16] andpowerful computing devices such as GPUs.The training cost of CNNs is very high for two reasons.First, CNNs are getting more complicated due to increaseddepth and parameters. For example, AlexNet, the winner ofILSVRC-2012, has 8 layers (5 convolutional layers and 3fully-connected layers) and more than 60 million parameters.VGGNet has 19 layers (16 convolutional layers and 3 fullyconnected layers) and over 144 million parameters. Anotherrecent model, GoogLeNet, is comprised of 22 layers withabout 6.8 million parameters [15]. Training these large-scaleCNNs requires thousands of iterations of forward and backward propagations, and therefore is much time-consuming.Second, the training samples are getting much larger. One ofthe early CNNs, LeNet-5, was trained to recognize handwrittendigits on MNIST data set, which only contains 60,000 imagesin the training set and 10,000 images in the testing set [8].CIFAR-10 [11] dataset consists of 60,000 32 32 color images,including 50,000 training images and 10,000 testing images.In contrast, a larger dataset called ImageNet was provided in2009, including more than 1.2 million high-resolution images.2332-5690/16 31.00 2016 IEEEDOI 10.1109/ICPP.2016.1567

ﬁeld, shared weight and pooling, until the last convolutionallayer holds a set of relatively high-level features. Finally, thosehigh-level features are mapped to a probability vector over tendifferent classes in last two fully-connected layers.FeatureMaps.Ǆ.Ǆ.B. Convolution StrategiesInputConvolutionRecently, many deep learning frameworks and libraries havebeen developed to implement CNN on GPUs, e.g., cudaconvnet [2], cuda-convnet2 [18], Theano [19], Torch [20],Decaf [21], Overfeat [22], Caffe [23], cuDNN [24] and fbfft[25]. Since convolutional layers is the central part of CNNs,researchers devote most efforts into design and optimization ofconvolutional layers. In order to implement CNN, researchershave explored different kind of convolution strategies. However, mainstream CNN implementations follow three convolution strategies: direct convolution, unrolling-based convolution[32], [24], and FFT(Fast Fourier Transformation)-based convolution. These strategies are depicted as follows.Direct Convolution. This is the traditional way to computeconvolution. During direct convolution, a small window slideswithin an input feature map and a dot production betweenthe ﬁlter bank and local patch of the input feature map iscomputed. The result of dot production is then passed into anon-linear activation function, e.g., Sigmoid and T anh. Outcome results from this activation function are organized intoa new feature map as output. Repeating the above process foreach ﬁlter bank, we can get a set of two-dimensional featuremaps as the output of the convolutional layer. Presentativeimplementations of direct convolution include cuda-convnet2[18], and Theano-legacy [31].Unrolling Based Convolution. Unrolling-based convolution is a very efﬁcient method on GPUs according to [32][24]. The key idea behind unrolling convolution is to reshapethe input and the ﬁlter bank to double large matrices. Thelocal regions of input image are unrolled into columns andthe ﬁlter banks are unrolled into rows using im2col. Theﬁnal convolution can be converted into a clean and efﬁcientmatrix-matrix production by using highly-optimized librariessuch as cuBLAS on GPUs [32]. Finally, the results should beremapped back to the proper dimension using col2im. Manynew frameworks and libraries are developed based on thisstrategy, such as Caffe [23], Torch-cunn [20], Theano-CorrMM[19], and cuDNN [24].FFT Based Convolution. This strategy is based on theconvolution theorem that a discrete convolution in the spatialdomain can be converted into the product of the Fourierdomain. The performance of FFT-based convolution can besigniﬁcantly improved thanks to its lower computation complexity. In general, FFT-based convolution can be implementedby three main steps. First, inputs and ﬁlter banks are transformed from the spatial domain to the Fourier domain withFast Fourier Transformation (FFT). Second, those transformedmatrices are multiplied in the Fourier domain. Finally, theproduct results are inversed from the Fourier domain to thespatial domain. This strategy is followed by fbfft [25], andTheano-fft [19].OutputPoolingConvolutionPoolingFull ConnectionFig. 1: A simple CNN architecture (LeNet-5).The rest of this paper is organized as follows. In Section2, we present an overview of the architecture of CNNs andthree convolution strategies. In Section 3, we describe theexperimental environment and evaluation methodology. InSection 4, we identify hotspot layers in CNNs and comparedifferent implementations in the running time over a widerange of conﬁgurations. In Section 5, we analyze hotspotfunctions in the hotspot layers and evaluate the performance ofeach implementation on GPU. Finally, we conclude this paperin Section 6.II. BACKGROUNDUnderstanding the architecture of CNNs better is key toevaluation and optimization of the convolution implementations. In this section, we present an overview of the architecture of CNNs and discuss different convolution strategies thatare adopted by typical CNN implementations.A. Convolutional Neural NetworksThe training process of CNNs is a typical feed-forwardneural network, which applies BP algorithm to adjust learnablekernels so as to minimize the cost function. Convolutionalneural network automatically provides some degree of shiftand distortion invariance by three key ideas: local receptiveﬁeld, shared weight, and pooling [26].Convolutional layer is the central part in CNNs. In convolutional layer, each neuron of the same feature map appliesthe same weights over input data at all possible positions toextract the corresponding features. The convolved results areorganized into a set of two dimensional feature maps. Allof neurons in a feature map share the same weights, whichare called shared weights. Each neuron of the current layeris connected to a local region of the previous layer. Thisconnectivity with a local region is called a local receptive ﬁled[26]. Pooling layers are optionally used after convolutionallayers, and it aims to reduce the spatial size of feature mapand to control the over-ﬁtting problem to some extent.We take Lenet-5 as a typical example to illustrate thearchitecture of CNNs. As shown in Figure 1, Lenet-5 isstacked by convolutional layer, pooling layer and two fullyconnected layers. The input image is ﬁrst fed to input layer,and then is passed through a stack of convolutional and poolinglayers. Repeat convolutions with the methods of local receptive68

III. E XPERIMENTAL M ETHODOLOGYA. Experimental EnvironmentWe evaluate CNN implementations on a CPU-GPU hybridsystem. Ubuntu 14.04.1 is installed on a machine with IntelXeon E5-2620 2.10 GHz 24 processor, 64GB main memoryand 1TB hard disk. A single K40c GPU card is used in ourexperiments. We use openCV 2.4.8 and CUDA Toolkit 7.5.The K40c GPU card has an excellent computing powerdue to its many-core architecture, large device memory, highmemory bandwidth and ﬂoating point throughput. The K40ccard consists of 15 Streaming Multiprocessors (SM), each SMwith 192 processing units (a.k.a., CUDA cores). Each CUDAcore can perform 2 ﬂoating-point operations per clock rate, andwork at a maximum core clock rate of 745 MHz. Therefore,all the 2880 (15 192) CUDA cores provide a peak singleprecision ﬂoating point performance of 4.29 TFLOPS.Each SM has 256KB register ﬁles and 48KB on-chipmemory. The card is also equipped with 12GB device memoryand has 288 GB/s peak memory bandwidth. More details aboutCUDA and GPU can refer to [1].Fig. 2: Runtime breakdown of typical real-life CNN models:GooleNet, VGG, OverFeat and AlexNet.IV. H IGH -L EVEL W ORKLOAD P ROFILINGIn this section, we make a high-level workload proﬁling.First, we break down four popular CNN models to investigate where hotspot layers are during their training iterations.Second, we compare the hotspot layers of those CNN implementations in terms of runtime over a large parameter space.A. Hotspot Layer AnalysisThe hotspot layer analysis can help understanding the ﬂowof CNN applications and identify hotspot layers that dominatethe total runtime in CNN models. We break down four popularreal-life CNN models, i.e., AlexNet, GoogleNet, OverFeat andVGG, to collect the runtime of each layer and identify thehotspot layers for each model. The runtime we collected isthe average runtime of each layer for 10 training iterations.Each training iteration includes one forward propagation andone backward propagation.Results. As shown in Figure 2, those real-life models aremainly comprised of convolutional layer (Conv Layer), Pooling layer, Relu layer, Fully Connected Layer (FC Layer) andConcat layer (in GooLeNet). Convolutional layer consumes thebulk of total runtime (86%, 89%, 90% and 94% respectivelyin four CNN models).Analysis. Convolutional layer involves large amount ofcomputation-intensive operations and requires substantialamount of computing resources. Especially for modern advanced CNN models, the computing cost of convolutionallayers is getting much higher due to the increasingly moreﬁlters and layers, smaller strides and their combinations [17].Therefore, we primarily focus on evaluating the performanceof convolutional layer in this paper.B. Evaluation MethodologyWe select Caffe [23], Torch-cunn [20], Theano-CorrMM[19], Theano-fft [19], cuDNN [24], cuda-convnet2 [18], andfbfft [25] as representative implementations in our evaluation.It should be noticed that we evaluate cuDNN-v3 in Caffe, fbfftin Torch and cuda-convnet2 with a Torch wrapper provided byconvnet-benchmarks [28]. Our evaluation methodology can becategorized into two groups: high-level workload proﬁling anddetailed performance proﬁling.For high-level workload proﬁling, we analyze the workloadfrom two aspects. We conduct a hotspot layer analysis for those CNNimplementations by proﬁling four typical CNN models(i.e., ImageNet, GoogleNet, VGG, and Overfeat).For hotspot layers, we conduct a head-to-head performance comparison in forms of speed across those sevenimplementations, with varying batch sizes, input sizes,ﬁlter numbers, kernel sizes and strides, and analyzestrengths and weaknesses for those implementations inshape limitations.For detailed performance proﬁling, we conduct four sets ofexperiments as follows. The goal is to explore the reasons behind performance differences between those implementations. B. Runtime ComparisonWe run ﬁve groups of experiments in terms of runtime thatis averaged over 10 iterations on GPUs, to compare the totalruntime of a single convolutional layer of the seven implementations (Caffe, cuDNN, cuda-convnet2, Theano-CorrMM,Theano-fft, Torch-cunn and fbfft) with respect to differentsize of mini-batch, input image, ﬁlter number, kernel size andstride. For a better performance comparison, the total runtimewe test here does not include the time of network initializationand data preparation. We organize those 5 parameters into a5-tuple (b, i, f, k, s) similar to [35]. In order to investigateFor aforementioned hotspot layers, we identify top kernels that dominate the total runtime.We compare peak GPU memory usage for those implementations over a wide range of conﬁgurations.With the nvprof tool [14] provided by NVIDIA, we proﬁleand analyze those top kernels in ﬁve important metricsand two events.We evaluate the overheads of data transfers between CPUand GPU over ﬁve typical conﬁgurations.69

not stable with different mini-batch sizes. It performs well onlyfor those cases when mini-batch size is a multiple of 128.In Figure 3(c), ﬁlter number ranges from 32 to 512 withmultiple of 16. In this conﬁguration space, fbfft is consistently faster than other implementations (from 1.19 to5.1 ), while Theano-fft still results in the worst performance.Cuda-convnet2 cannot support all given ﬁlter numbers in ourexperiment and thus its runtime on GPU is reported withdots in Figure 3(c). For unrolling-based convolution, TheanoCorrMM slightly outperforms its counterparts with large ﬁlternumbers (greater than 160 in our experiment).In Figure 3(d), for small kernel size (smaller than 7 inour experiment) cuDNN and Theano-CorrMM result in betterperformance than others. For example, the speed advantageof cuDNN over fbfft is from 1.21 to 2.62 . But with theincreasing of the kernel size (greater than 7), the runtimeof fbfft tends to be a constant value and the performanceadvantage is becoming increasingly obvious. For example,fbfft is becoming increasingly faster than cuDNN (from 1.15 to 19 ). In addition, the performances of cuda-convnet2 andcuDNN are very close with all given kernel sizes.In Figure 3(e), fbfft outperforms other implementationswhen stride is size of 1. Because f bf f t and T heano conv2d f f t only support stride size of 1, their runtime isdenoted as an spot in the ﬁgure. For greater stride (greaterthan 1), cuDNN results in the best performance.Analysis. The speed of each implementation varies withdifferent conﬁgurations and there is no single implementationthat is the fastest for all given scenarios in our experiments. Wesummarize the main observations from runtime comparison asfollows: fbfft is the overall fastest convolutional implementationand cuDNN performs the second best in most scenarios. For small kernels (smaller than 7 in our experiment),cuDNN outperforms fbfft. Otherwise, fbfft is faster thancuDNN. For unrolling-based convolution, cuDNN is the overall fastest implementation. But for large ﬁlter numbers(greater than 160 in our experiment), Theano-CorrMMslightly outperforms cuDNN. cuda-convnet2 performs well only for certain cases, suchas for mini-batch sizes of multiple of 128.In most scenarios, the speed of fbfft is much faster due toits low arithmetic complexity compared with unrolling-basedconvolution and direct convolution. cuDNN is much slowerthan fbfft when computing convolution with a large kernel size(large than 7 in our experiment). But for a small kernel size(smaller than 7 in our experiment), fbfft is a bit slower thancuDNN. In essence, this arises from the differences betweentheir convolution strategies. fbfft can beneﬁt signiﬁcantly fromdramatic reduction of arithmetic complexity when running ona large kernel size. But for a small kernel, the computationalcost of fbfft is higher than other counterparts, which leadsto a lower speed. It is important to note that fbfft andTheano-fft share the similar convolution strategy, but theypresent a clear difference in performance. Because of differentFig. 3: Runtime comparison for seven convolutional implementations on GPU with varing conﬁgurations.how each parameter impacts on the overall performance ofconvolutional layer, our evaluation is divided into ﬁve groups.Each group only tests one kind of the parameters, and theother four parameters are ﬁxed. All input images and kernelsare square and we have a basic conﬁguration 5-tuple (64, 128,64, 11, 1). According to ﬁve different parameters, we have ﬁvegroups of 5-tuples: (b, 128, 64, 11, 1), (64, i, 64, 11, 1), (64,128, f , 11, 1), (64, 128, 64, k, 1) and (64, 128, 64, 11, s).Taking the ﬁrst tuple for example, we test a changeable minibatch by ﬁxing the other four parameters. In addition, we alsoobserve the shape limitations for each implementation duringthe runtime comparison.Results. Figure 3(a and b) shows the speed of the sevenimplementations in different mini-batch size and input size,which ranges from 32 to 512 and 32 to 256 with multipleof 32 and 16 respectively. The runtime clearly presents theadvantage of fbfft over other implementations (from 1.4 to9.7 ) in all given mini-batch and input sizes, while Theano-fftresults in the slowest speed. For unrolling-based convolution,cuDNN has consistent superior performance in all given minibatch and input sizes. The performance of cuda-convnet2 is70

implementation techniques, fbfft is much faster than Theanofft. Cuda-convnet2 was optimized for mini-batch sizes of amultiple of 128, and thus performs well only in those cases.Summary. From the perspective of speed, fbfft is the fastestimplementation to train a CNN model with large kernels. Forsmall kernels, cuDNN would be a good choice. Moreover, fora model with small kernel and large ﬁlter number, TheanoCorrMM slightly outperforms other implementations.From the perspective of shape restrict, unrolling-based implementations are most ﬂexible in conﬁguration selection asthey support any possible shapes. Cuda-convnet2 only supportssquare input images and square kernels, its mini-batch sizemust be a multiple of 32 and its ﬁlter number must be amultiple of 16. FFT-based convolutions (i.e., fbfft and Theanofft) are applicable to any conﬁguration shapes except that theirstride must be 1.V. D ETAILED P ERFORMANCE P ROFILINGIn this section, we primarily focus on the performance proﬁling of convolutional layer in each implementation. First step,we conduct a detailed hotspot kernel analysis to look moreclosely at the inside of each convolutional implementation.Secondly, we evaluate the memory usage for each convolutionimplementation. Thirdly, we report a comprehensive proﬁlingand analysis of the GPU performance for those convolutionimplementations. Finally, we evaluate the data transfer overhead between CPU and GPU.A. Hotspot Kernels in Convolutional LayerA convolutional layer in each implementation consists ofmultiple kernels and it is worthwhile to ﬁgure out whichkernel determines the overall performance of convolutionallayers. The analysis of hotspot kernels helps understandingand identifying which kernels dominate the total runtime inconvolutional layer.For different conﬁgurations, the convolutional layer in thesame implementation shows the similar hotspot kernel results.We thus choose one set of conﬁguration (64, 128, 64, 11,1), which indicates that a square input of size 128, 64 minibatch size, 64 ﬁlters, square kernel of size 11 and stride ofsize 1, as the representative to analyze hotspot kernels. Basedon the proﬁling results, we group the similar kernels whohave the same functionalities into one. Take GEMM (GeneralMatrix to Matrix Multiplication) as an example, all differentkernels that are responsible for matrix-matrix or matrix-vectormultiplications are classiﬁed into GEMM.Results. Figure 4 shows the hotspot kernels developedfor convolutional layer of each implementation in terms ofpercentages. As we can see, different convolution strategiesresult in totally different hotspot kernel results. Even for thesame convolution strategy, the kernels can be clearly differentdue to different implementation methods. According to Figure4(a,b,c), for unrolling-based convolution, Caffe, Torch-cunnand Theano-CorrMM have similar hotspot kernel results, inwhich GEMM operations take up 87%,83%,80% of their totalruntime respectively. But the hotspot kernel results of cuDNNFig. 4: Runtime breakdowns of convolutional layers in different implementations.are totally different with its counterparts (Caffe, Torch-cunnand Theano-CorrMM) due to its different kernel implementations. As shown in Figure 4(d), wgrad alg0 engine andcuDN N gemm dominate the runtime of cuDNN. cudaconvnet2 computes for convolutional layers directly, whichis mainly achieved by three kernels: f ilterActs Y xX olor,im acts color and conv weight acts c preload.Analysis. We summarize some observations as follows: GEMM operations are the essence of convolutional layers. Especially in unrolling-based convolution, GEMMsare dominant of the total runtime, followed by unrollingoperations. For FFT-based convolution, GEMM, FFT transform, FFTinverse and data transposition account for most of the71

runtime in fbfft. On the contrary, most of the runtime isspent on data preparation and data transfer between CPUand GPU in Theano-fft.For unrolling-based convolution, in Caffe, Torchcunn and Theano-CorrMM, im2col gpu kernel andcol2im gpu kernel mainly take up the rest of the runtime.im2col gpu kernel is used to unroll the input data andﬁlters to double large matrixes and then the traditionalconvolution can be converted into a clean matrix-matrixmultiplication by using highly-optimized GEMM libraries.The col2im gpu kernel is used to convert the multiplicationresult back to the right format, the same as the formatbefore unrolling. In cuDNN, the unrolling operations andmatrix-matrix multiplications are optimized by using sharedmemory and tiled matrix multiplication [24], which is mainlyachieved by wgrad alg0 engine and cuDN N gemmkernels.For FFT-based convolution, the computation of convolutional layers is mainly achieved by three steps in fbfft.Firstly, the kernel decimateInF requency uses DIF algorithm to transform input and weight data from spatial domain to frequency domain. Secondly, the T ranspose kernel is used to convert the BDHW layout into HW BDand then conducts Cgemm matrix multiplications. Thirdly,the T ranspose kernel converts the Cgemm results backto BDHW layout and performs an inverse FFT by usingdecimateInF requencyInverse [25].Summary. GEMM is the essence of convolutional layers inunrolling-based implementations, which indicates that kernelsresponsible for GEMM computing are the ﬁrst-order modulesto be optimized. So are FFT and Cgemm in fbfft.Fig. 5: Memory usage comparison for seven convolutionalimplementations on GPU with varing conﬁgurations.B. Memory UsageFor most applications at present, memory is not the primarylimitations, and while the fastest algorithm is considered as thebest algorithm. As a result, a common way to rank order algorithms is using their computing speeds as a criterion. However,GPU cannot afford a large memory-consuming application dueto its limit device memory. Thus memory usage also shouldbe considered as a signiﬁcant portion on GPUs.Results. We use nvidia-smi to monitor memory usage onGPU for each implementation. Figure 5 shows the peak memory consumption of the seven convolutional implementationsby varying different parameters that are similar to runtimecomparison. In all given scenarios of our experiments, cudaconvnet2 have the lowest consumption of GPU memory (from125 MB to 2076 MB), followed closely by Torch-cunn (from170 MB to 2093 MB). While the other three unrolling-basedimplementations, cuDNN, Caffe and Theano-CorrMM, are ofa relatively higher consumption (from 155MB to 3810MB,from 136MB to 3809MB and from 130MB to 3709MBrespectively). On the contrary, FFT-based convolution havethe highest consumption of GPU memory. Taking fbfft asan example, it consumes a large amount of GPU memory,from 1632 MB to 10866 MB in our experiments. There arealso several abnormal memory consumptions in FFT-basedimplementations. Figure 5 (b) shows that there are dramaticﬂuctuations in memory usage of fbfft over certain input size.The same ﬂuctuation also can be observed in fbfft and Theanofft in Figure 5 (d). Such abnormal memory usage can lead toprogram crush which we will investigate as part of future work.Analysis. We summarize main observations from the aboveresults as follows: cuda-convnet2 is the most memory efﬁcient one in allscenarios given in our experiment. Torch-cunn is the overall most memory efﬁcient implementation in unrolling-based convolution, while withthe increase of kernel size, cuDNN becomes the mostmemory efﬁcient implementation. fbfft requires the most memory, followed by Theano-fft.Cuda-convnet2 computes the convolution directly and thusdoes not need temporary memory to keep intermediate data.Compared with cuda-convnet2, Caffe, Theano-CorrMM andTorch-cunn require extra memory to store the unrolled matrices using, but there are still slight differences of memory usagedue to different data layouts and programming techniquesbetween them. Although cuDNN does not need extra memory72

TABLE I: Convolution conﬁgurations for uration 2,128,9,1)(128,16,128,7,1)(128,13,384,3,1)for unrolling, it consumes more memory than other unrollingbased implementations to achieve a better performance.On the contrary, low computational complexity of FFTbased implementations and highly optimized CUDA codesbring an excellent speed to fbfft, however, at the expense ofan unreasonable memory consumption. The main reason isthat FFT-based implementations require substantial amountsof temporary memories to keep the intermediate data suchas input and ﬁlter data of the Fourier domain, and they alsoneed extra memory for zero-padding to extend ﬁlter bankto be the same size of input. Therefore, when choosing aCNN implementation, a trade-off between speed and memoryconsumption needs to be considered.Summary. Cuda-convnet2 is well suitable for cases whenthe memory is limited. Otherwise, fbfft is a great choice tocompute for convolutional layer. If a good balance betweenmemory, speed and ﬂexibility is needed, cuDNN is most likelythe best choice.C. GPU Performance EvaluationIn this subsection, we conduct a detailed runtime proﬁlingstudy based on nvprof CUDA tool. Metrics and events arecollected by using nvprof to analyze kernel performance duringkernel execution. An event collects hardware counter valuesduring kernel execution and a metric is computed based on oneor more event values to identify characteristics of an CUDAapplication [14]. To investigate the performance differencesamong seven different convolution implementations, we usethe following metrics to proﬁle GPU performance [14]: achieved occupancy is the ratio of the average activewarps per active cycle to the maximum number of warpssupported on a SM. ipc is the instructions executed per cycle. warp execution efﬁciency is the Ratio of the averageactive thread

We evaluate CNN implementations on a CPU-GPU hybrid system. Ubuntu 14.04.1 is installed on a machine with Intel Xeon E5-2620 2.10 GHz 24 processor, 64GB main memory and 1TB hard disk. A single K40c GPU card is used in our experiments. We use openCV 2.4.8 and CUDA Toolkit 7.5. The K40c GPU card has an excellent computing power

Related Documents:

OpenCV on a GPU

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

154 Views

2y ago

GPU Tutorial 1: Introduction to GPU Computing

GPU Tutorial 1: Introduction to GPU Computing Summary This tutorial introduces the concept of GPU computation. CUDA is employed as a framework for this, but the principles map to any vendor’s hardware. We provide an overview of GPU computation, its origins and development, before presenting both the CUDA hardware and software APIs. New Concepts

43 Views

2y ago

Take GPU processing power beyond graphics with GPU ...

limitation, GPU implementers made the pixel processor in the GPU programmable (via small programs called shaders). Over time, to handle increasing shader complexity, the GPU processing elements were redesigned to support more generalized mathematical, logic and flow control operations. Enabling GPU Computing: Introduction to OpenCL

65 Views

2y ago

GPU Ray Tracing - GPU Technology Conference 2012

Possibly: OptiX speeds both ray tracing and GPU devel. Not Always: Out-of-Core Support with OptiX 2.5 GPU Ray Tracing Myths 1. The only technique possible on the GPU is “path tracing” 2. You can only use (expensive) Professional GPUs 3. A GPU farm is more expensive than a CPU farm 4. A

39 Views

2y ago

GPU Computing Advances in 3D Electromagnetic Simulation

Latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. Current and future GPU developments at CST; detailed simulation results. Keywords: gpu acceleration; 3d full wave electromagnetic simulation, cst studio suite, mpi-gpu, gpu technology confere

32 Views

2y ago

Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model ...

transplant a parallel approach from a single-GPU to a multi-GPU system. One major reason is the lacks of both program-ming models and well-established inter-GPU communication for a multi-GPU system. Although major GPU suppliers, such as NVIDIA and AMD, support multi-GPUs by establishing Scalable Link Interface (SLI) and Crossﬁre, respectively .

14 Views

1y ago

NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server

NVIDIA vCS Virtual GPU Types NVIDIA vGPU software uses temporal partitioning and has full IOMMU protection for the virtual machines that are configured with vGPUs. Virtual GPU provides access to shared resources and the execution engines of the GPU: Graphics/Compute , Copy Engines. A GPU hardware scheduler is used when VMs share GPU resources.

18 Views

1y ago

Measurement and analysis of GPU-accelerated applications with HPCToolkit

plify development of HPC applications, they can increase the difficulty of tuning GPU kernels (routines compiled for offloading to a GPU) for high performance by separating developers from many key details, such as what GPU code is generated and how it will be executed. To harness the full power of GPU-accelerated nodes, application

12 Views

1y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

445 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

303 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

316 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

388 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

326 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

10m ago

123 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

366 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

378 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

295 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

336 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

481 Views

MANAGERIAL FINANCE - GBV

of Managerial Finance page 2 Introduction to Managerial Finance 1 Starbucks—A Taste for Growth page 3 1.1 Finance and Business What Is Finance? 4 Major Areas and Opportunities in Finance 4 Legal Forms of Business Organization 5 Why Study Managerial Finance? Review Questions 9 1.2 The Managerial Finance Function 9 Organization of the Finance

3y ago

6.8K Views

Chapter 1 The roles of finance function in organisations

The roles of the finance function in organisations 4. The role of ethics in the role of the finance function Ethics is the system of moral principles that examines the concept of right and wrong. Ethics underpins an organisation’s sustained value creation. The roles that the finance function performs should be carried out in an .File Size: 888KBPage Count: 10Explore furtherRole of the Finance Function in the Financial Management .www.managementstudyguide.c Roles and Responsibilities of a Finance Department in a .www.pharmapproach.comRoles and Responsibilities of a Finance Department .www.smythecpa.comTop 10 – Functions of Business Finance in an om23 Functions and Duties of Accounting and Finance nded to you b

1y ago

335 Views

Introduction - Google Earth User Guide

Google Earth Community: Learn from other Google Earth users by asking questions and sharing answers on the Google Earth Community forums. Using Google Earth: This blog describes how you can use some of the interesting features of Google Earth. Selecting a Server Note: This section is relevant to Google Earth Pro and EC users.

3y ago

288 Views

Using Google Forms to Manage Officials Signups

Google Sheets, deleting a response from the form or sheet will not affect the other. Once the Google Form is linked to a Google Sheet, clicking on the spreadsheet icon will open the linked Google Sheet. Google Responses Sheet Google automatically creates and populates the sp

2y ago

276 Views

Performance Analysis Of GPU-Based Convolutional Neural Networks

It looks like you're using an ad-blocker