Compact Deep Convolutional Neural Networks For Image .

2y ago
36 Views
4 Downloads
226.19 KB
7 Pages
Last View : 21d ago
Last Download : 2m ago
Upload by : Aarya Seiber
Transcription

1Compact Deep Convolutional Neural Networks forImage ClassificationZejia Zheng, Zhu Li, Abhishek Nagar1 and Woosung Kang2Abstract—Convolutional Neural Network is efficient in learning hierarchical features from large datasets, but its modelcomplexity and large memory foot prints are preventing it frombeing deployed to devices without a server backend support.Modern CNNs are always trained on GPUs or even GPU clusterswith high speed computation power due to the immense size ofthe network. Methods on regulating the size of the network, onthe other hand, are rarely studied. In this paper we present anovel compact architecture that minimizes the number of lowerlevel kernels in a CNN by separating the color information fromthe original image. A 9-patch histogram extractor is built toexploit the separated color information. A higher level classifierlearns the combined features from the compact CNN, trainedonly on grayscale image with limited number of kernels, andthe histogram extractor. We apply our compact architecture toCIFAR-10 and Samsung Mobile Image Dataset. The proposedarchitecture has a recognition accuracy on par with those ofstate of the art CNNs, with 40% less parameters.I. I NTRODUCTIONConvolutional Neural Network (CNN) [7] is one of theleading image classification architectures for hierarchical feature extraction. CNNs have been reported to have state of theart performance on many image recognition and classificationtasks, including hand written digit recognition [6], housenumbers recognition [10], traffic signs classification [2], and1000 class ImageNet dataset classification and localization [5],[11].Despite these encouraging progresses, there is still limitedresearch on compact convolutional neural networks that can beeasily implemented on to a mobile device. The large amount ofparameters inside current state of the art CNNs makes it hardfor mobile devices to label an arbitrary RGB image in shorttime. In this paper we propose a CNN based architecture thatuses minimal number of lower level kernels while maintainingthe high performance of a CNN with more parameters in lowerlevel layers. The network is trained only on grayscale imagesthus both the size and the number of the kernel on 1st layer canbe reduced. This leads to a 40% drop on the final size of thenetwork. The loss in the capability of the network introducedby limiting the lower level feature extractor is amended withthe help of a carefully crafted color histogram feature vectorextracted from patches of the original image. Several differentconfigurations of the combination are tested.We report the experiment result on CIFAR-10 and SamsungMobile Image Dataset. The CIFAR-10 dataset has been heavily1 Zejia Zheng, Zhu Li, Abhisheck Nagar are with Samsung ResearchAmerica’s Multimedia Core Standards Research Lab in Richardson, TX. ZejiaZheng is also a Ph.D. student studying at Michigan State University, EI Lab.2 Woosung Kang is with Samsung Electronics, Soul, Korea.tested on by previous works [4] [5] [8]. Our result showsthat the compact architecture achieves similar performancewith minimal number of parameters (40% less). The SamsungMobile Image Dataset is a hierarchical dataset with more than80000 images and 31 class labels. The images have higherresolution compared to the CIFAR-10 dataset, which makesthe histogram feature vector more useful. As a result combining hand crafted histogram feature vector with the CNN finalfeature vector improves the accuracy of the CNN classifier(on grayscale images) by 4%, achieving same performancecompared to a single CNN trained on RGB images. The finalarchitecture is much more compact compared to the originalversion, while the performance is similar, sometimes better,compared to a single CNN trained on RGB images.Our contributions in this paper can be summarized asfollows: We propose a compact architecture based on the combination of CNN and hand-crafted color histogram featureextractor. The proposed architecture minimizes networksize by separating color information from the originalimage, thus limiting the number of kernels required toextract feature from the grayscale input. The compactnetwork has 40% less parameter to tune with but itmaintains the performance of the original CNN trainedon RGB images. We apply our compact network to a hierarchical dataset(i.e. Samsung Mobile Image Dataset) with clean basiccategories and confusing subcategories. The experimentresult reveals that hand crafted feature (i.e. 9 patch colorhistogram) helps the network to clarify the boundariesamong classes in the same basic category. Global andlocal histogram vector is more useful when the imagecontains more information (i.e. high resolution).II. R ELATED W ORKA. Convolutional Neural NetworkIn recent years commercial and academic datasets for imageclassification have been growing at an unprecedented pace.The SUN database for scenery classification contains 899categories and 130,519 images [14]. The ImageNet datasetcontains 1000 categories and 1.2 million images [5]. In response to this immensely increased complexity, a great manyresearchers have focused on increasing the depth of classifiersto capture invariance and useful features.Among a great number of available deep architectures,Convolutional Neural Network (CNN) is reported to havethe leading performance on many image classification tasks.

2Overfeat, a CNN-based image features extractor and classifier,scored a 29.8% error rate in classification and localization taskon ImageNet 2013 dataset. Clarifai, a hierarchical architectureof CNN and deconvlutional neural network, achieved an11.19% error recognition rate on ImageNet 2013 classificationtask [15].It has also been reported that the performance of CNN ishighly correlated with the number of layers. Winners of thecompetitions mentioned above have millions of parameters totune with, which requires a large number of training samples.The ILSVRC 2012 challenge winner CNN by Krizhevsky hasaround 60 million parameters [5]. Overfeat, the ILSVRC 2013challenge winning CNN, has more than 140 million parameters[11]. These networks are always trained on a GPU machineor GPU clusters for better performance.As is introduced in previous section, our goal in thispaper is to find a compact architecture that balances the sizeof the network and its performance on device based imageclassification task. Thus we do not attempt to outperform theexisting works in [5], [11] on CIFAR-10.B. Histogram-based ClassificationColor histograms are widely used to compare images despitethe simplicity of this method. It has been proven to havegood performance on image indexing with relatively smalldatasets [12]. Color histograms are trivial to compute andtend to be robust against small changes to camera viewpoint,which makes them a good compact image descriptor. It wasalso reported in [1] that the performance of a histogram basedclassifier was improved when the higher level classifier was asupport vector machine.However, when applied to large dataset, histogram basedclassifiers tend to give poor performance because of highvariances within the same category. It is also observed thatimages with different labels may share similar histograms [9].In this work, we propose a novel architecture that combinesthe histogram-based classification method with CNN. Thehistogram representation of color information helps the CNNto exploit color information in the original image. This meansthat we can minimize the size of the basic feature detectors (i.e.layer 1 of the CNN). The proposed architecture is introducedin the following section.III. C OMPACT CNNWITHC OLOR D ESCRIPTORA. Deep Convolutional Neural NetworksWe use the architecture of Krizhevsky et al. [5] to train the‘original’ CNN in the experiments. We then modified layer 1by changing the kernel size (from 5 5 3 to 5 5 1) and thenumber of kernels (from 64 to 32) in later experiments. Thedetails of the experiments are introduced in the next section.We trained two CNNs with different number of kernels inthe first layer: an original version and a compact version.The ‘original’ network is the exact replicate of the CNNreported in [4], which gave a final error recognition rate of13% using multi-view testing. In this work, however, we onlyuse single view testing when reporting the final result for boththe original CNN and compact CNN.Both the original and the compact CNNs have four convolution layers. Table I shows the details of the two networks whentrained on cropped images from the CIFAR-10 dataset. Ourcompact CNN is marked in bold font to show the difference.There are only 32 kernels in the first layer of the compactCNN while the number is 64 in the original CNN. This cutsdown the number of parameters by 50% in layer 3 (i.e. the2nd convolution layer), thus the final compact CNN has 40%less parameters to tune compared to the original version.The convolution operation is expressed as:Xy j(r) ReLU (bj(r) k ij(r) xi(r) )(1)iiwhere x is the ith input map and y j is the jth output map.k ij is the convolution kernel corresponding to the ith inputmap and the jth output map. r indicates a local region on theinput map where the weights are shared.ReLU non-linearity (i.e. ReLU (x) max(0, x)) is usedin the network. It is observed that ReLU yields better performance and faster convergence speed when trained by errorback propagation [5].Max pooling is done in a 3 3 sliding window at a 2 2stride size in layer 2 and layer 4. This helps the network toextract the most prominent low level features and reduce thesize of feature vector.More details about the experiment set up can be found inFig. 1 and table I.TABLE IIS IZE OF DIFFERENT CONFIGURATIONCNN configurationTotal num of parameteroriginal grayscalek 1143168original RGBk 3146368CompactArchitecture91168B. Color InformationA color is represented by a three dimensional vector corresponding to a point in the color space. We choose red-greenblue (RGB) as our color space, which is in bijection with thehue-saturation-value (HSV).HSV may seem attractive in theory for a classifier purelybased on histograms. HSV color space separates color component from the luminance component, making the histogramless sensitive to illumination changes. However, this doesnot seem to be important in practice. [1] reports minimalimprovement when switching from RGB color space to HSVcolor space.The choice for the choice of RGB is that the three channelsshare the same range (i.e. from 0-255), making it easier fornormalization.We experiment with three different configurations of thecolor histogram:

3TABLE IO RIGINAL AND C OMPACT CNN A RCHITECTURE (CIFAR-10)operationoriginal input sizecompact input sizefilter sizecompact filter sizeoriginal filter numcompact filter numpool sizestrideoutputlayer 1conv24x24xk24x24x15x5xk5x5x164321x124x24x64layer 2max24x24x6424x24x323x32x212x12x64layer 3conv12x12x6412x12x325x5x645x5x3264641x112x12x641) Global histogram, 48 bins. In this setup we examine ifglobal color information helps with the classification.2) 9-patch histogram, 192 bins. The 9 patches are generatedas is shown in Fig. 1. As CIFAR-10 dataset contains only32 by 32 images, which makes it harder to extract usefulhistograms, the number of bins in this setup should be48, 2 24, 2 24, and 4 24.3) 9-patch histogram, 384 bins. Numbers of bins are doubled compared to the previous set up.These experiments on histogram configuration are solely carried out on the CIFAR image dataset. This series of experimentserves as a guideline for our experiment on Samsung MobileImage Dataset.C. Combined ArchitectureOnce the CNN is trained for the classification task withthe grayscale version of the training set, we replace the fullyconnected layer and the softmax layer (i.e. layer 7 and 8 as isshown in table I) with a new fully connected layer and a newsoftmax layer trained on the combined feature vector, usingthe feature vector from the same training set.The combined feature vector is generated by algorithm 1.Input: image I, total patch number kOutput: Combined Feature Vector vec combinedsegment I into {Ii , i 1, 2, ., k};extract histogram vector hist vec from {Ii };resize I to CNN input size, feed I into CNN;extract layer 6 output cnn layer 6 vec from CNN;reshape cnn layer 6 vec to a one dimensional vectorcnn vec;vec combined concatenate(cnn vec, hist vec);return vec combinedAlgorithm 1: EXTRACT NEW FEATURE VECTORWith the new feature vector extracted from both the trainingset and testing set, we train a new layer 7 (fully connectedlayer) and layer 8 (softmax layer) based on the combinedfeature vector extracted from the training set.The purpose of this work is to find a compact architectureby combining handcrafted feature representation with finalfeature vector from the CNN. To make clear comparison,we evaluate the performance of the combined classifier withseveral different setups:layer 4max12x12x6412x12x643x32x26x6x64layer 5conv6x6x646x6x643x3x643x3x643232layer 6conv6x6x326x6x323x3x323x3x323232layer 7fully x3210x1layer 8softmax10x110x11) Cropped images and uncropped images. Training oncropped images means that we feed patches of imageinto the network instead of the original image. Thisallows the network to train with relatively more samples,but would jeopardize recognition for certain classes inSamsung Mobile Image Dataset (e.g. upper body andwhole body).2) Colored images and grayscale images. We use recognition on uncropped color images as the base line for performance evaluation. The propose compact architecture,however, separates color information from the originalimage, and feed only grayscale image to the pretrainedCNN.3) CIFAR-10 dataset and Samsung Mobile Image Dataset.We use the CIFAR-10 dataset to test different configurations of histograms and several data augmentationmethods. The results on CIFAR-10 serves as a guidelinefor us to construct a compact classifier for the SamsungMobile Image Dataset, a hierarchical dataset collectedat Samsung Research America.Details about these experiments are reported in the following section. In short, we found that the proposed compactarchitecture trained on cropped grayscale image maintains thehigh accuracy of the original CNN trained on cropped RGBimages.IV. E XPERIMENTA. CIFAR datasetThe CIFAR-10 dataset consists of 60000 32 by 32 colorimages from 10 basic categories. The class labels are: airplane,automobile, bird, cat, deer, dog, frog, horse, ship and truck.There are 6000 images per class, with 50000 training imagesand 10000 test images. The image included in the dataset isassumed to be easy to named by a human classifier withoutambiguity. The dataset is collected by Krizhevsky and Hintonand is reported in [4].CIFAR-10 has been heavily tested on with many classification methods. Krivzhevsky et al. [5] achieved a 13% test errorrate when using their ILSVRC 2012 winning CNN architecture(without normalization). By generalizing Hinton’s dropout [3]into suppression in weight values instead of activation values,Wan et al. [13] reported a error testing rate of 9.32 %, usingtheir modified Convolutional Neural Network DropConnect.Lin et al. [8] replaced the ReLU convolutional layer in

4Fig. 1. Compact CNN with histogram based color descriptor. We separate color information from the original image by only feeding the CNN with thegrayscale image. Color histogram is combined with the final feature vector. This figure shows how an image from Samsung Mobile Image Dataset is classifiedas is described in section IV-B. Image size and the number of bins in a histogram are reduced accordingly when testing on CIFAR-10. There are only 32filters in layer 1, compared to 64 filters in layer 1 of the original network. The performance of the Compact architecture, however, is similar to the originalarchitecture, with the network size 40% smaller when testing on CIFAR-10, and 20% smaller when testing on Samsung Mobile Image Database.Krivzhevsky’s architecture [5] with a convolutional multi-layerperceptron. They reported a test error rate of 8.8 %, currentlyranking top on the leader board of classification on CIFAR-10dataset.Despite the improvement and variations described above,our experiment in this work is still based on Krizhevsky’sarchitecture as is described in [5]. The goal of this paper isto study the contribution of color information to CNN basedimage classification, and to seek possible combination betweenhand crafted feature vector and CNN extracted feature vectorto further exploit the low level features with limited numberof parameters. For these reasons we apply our modificationsto a standard CNN architecture as is provided by Krizhevskyin [5]. We believe that the combined architecture can also beapplied to other CNN variants with few modifications.1) Getting Histogram: Because CIFAR consists of imageswith only 1024 pixels, getting a large histogram vector wouldbe meaningless. Therefore we only extract a global histogramof 48 bins from the original image in our first experiment. Thehistogram and the final feature vector from the CNN pass areconcatenated together as is described in algorithm 1.In later trials, we move on to more complicated histogramsfeature vector extraction configurations instead of just usingthe global histogram. We extracted histogram feature vectorsof different length from 9 patches of the input image. Supposewe are to extract a histogram feature vector of length 192, thenthe number of bins of each patch would be: 48 bins from theentire image, 24 4 bins from the left half, the right half, thetop half and the bottom half, 12 4 bins from the upper leftcorner, the upper right corner, the lower left corner and thelower right corner. The intention is to reflect the global colorinformation as well as the local color distribution at certainprecision to exploit the color details.2) Training Methods: Although our CNN architecture issimilar to Krivzhevsky’s network, we modify some parts ofthe training procedures in [5] to suit our needs.When trained on CIFAR-10 dataset, the first few CNNsare not trained on cropped images as is described in [5]. Bytraining the network on five image patches (top left, top right,lower left, lower right, and center) and their horizontal flip,Krivzhevsky was able to enlarge the size of the training datasetand generate more robust representations inside the network.We do not use cropped CIFAR images on initial trials forthe following reasons: (1) our intention in this work is toevaluate the effectiveness of directly feeding the classificationlayer with hand-crafted histogram instead of relying on theCNN to exploit color information. The comparison would bemore straightforward when we use the entire image instead ofimage patches. (2) cropping data may not be the best idea incertain applications. In Samsung Mobile Image Dataset, forexample, the classifier needs to distinguish human upperbodyfrom the entire human body. Training on certain patches mayintroduce confusion. (3) performance issue. Training on theentire image instead of image patches reduces training time.We report results on cropped images in later experiments, asis shown in table III.Another difference between our network and Krizhevsky’sreported 13% error recognition CNN is that we do not reportresult based on multiview tests. By adopting multiview testinstead of single shot test, the 13% error CNN takes patchesof images (and their horizontal flips) as input and aggregatesthe final output probability. As our intention is to improvethe network architecture, we feel that comparison should bedone with single shot tests. Our work, however, can be easilygeneralized to multiview testing scenarios. Performance isexpected to be improved accordingly.We use mini-batches of 128 examples, momentum of 0.9and weight decay of 0.004. All networks are initialized withlearning rates of 0.001. The learning rates are manuallyadjusted (lowered by a factor

Compact Deep Convolutional Neural Networks for Image Classification Zejia Zheng, Zhu Li, Abhishek Nagar1 and Woosung Kang2 Abstract—Convolutional Neural Network is efficient in learn-ing hierarchical features from large datasets, but its model complexity and large memory foot prints are preventing it from

Related Documents:

Learning a Deep Convolutional Network for Image Super-Resolution . a deep convolutional neural network (CNN) [15] that takes the low- . Convolutional Neural Networks. Convolutional neural networks (CNN) date back decades [15] and have recently shown an explosive popularity par-

Deep Neural Networks Convolutional Neural Networks (CNNs) Convolutional Neural Networks (CNN, ConvNet, DCN) CNN a multi‐layer neural network with – Local connectivity: Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions:

Video Super-Resolution With Convolutional Neural Networks Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K. Katsaggelos, Fellow, IEEE Abstract—Convolutional neural networks (CNN) are a special type of deep neural networks (DNN). They have so far been suc-cessfully applied to image super-resolution (SR) as well as other image .

Dual-domain Deep Convolutional Neural Networks for Image Demoireing An Gia Vien, Hyunkook Park, and Chul Lee Department of Multimedia Engineering Dongguk University, Seoul, Korea viengiaan@mme.dongguk.edu, hyunkook@mme.dongguk.edu, chullee@dongguk.edu Abstract We develop deep convolutional neural networks (CNNs)

Deep Convolutional Neural Networks for Remote Sensing Investigation of Looting of the Archeological Site of Al-Lisht, Egypt by Timberlynn Woolf . potential to expedite the looting detection process using Deep Convolutional Neural Networks (CNNs). Monitoring of looting is complicated in that it is an illicit activity, subject to legal sanction .

ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV 2014 K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015

2 Convolutional neural networks CNNs are hierarchical neural networks whose convolutional layers alternate with subsampling layers, reminiscent of sim-ple and complex cells in the primary visual cortex [Wiesel and Hubel, 1959]. CNNs vary in how convolutional and sub-sampling layers are realized and how the nets are trained. 2.1 Image processing .

Business Studies Notes Year 9 & 10 Chapter 1 The purpose of Business Activity A NEED is a good or service essential for living (food, water, shelter, education etc.). A WANT on the other hand is something we would like to have but is not essential for living (computer games, designer clothing, cars etc.). people’s wants are unlimited. The Economic Problem results from an unlimited amount of .