Transfer Learning For Latin And Chinese Characters With .

3y ago
32 Views
2 Downloads
206.87 KB
6 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Isobel Thacker
Transcription

Transfer Learning for Latin and Chinese Characterswith Deep Neural NetworksDan C. CireşanUeli MeierJürgen SchmidhuberIDSIAUSI-SUPSIManno, Switzerland, 6928Email: dan@idsia.chIDSIAUSI-SUPSIManno, Switzerland, 6928Email: ueli@idsia.chIDSIAUSI-SUPSIManno, Switzerland, 6928Email: juergen@idsia.chAbstract—We analyze transfer learning with Deep NeuralNetworks (DNN) on various character recognition tasks. DNNtrained on digits are perfectly capable of recognizing uppercaseletters with minimal retraining. They are on par with DNN fullytrained on uppercase letters, but train much faster. DNN trainedon Chinese characters easily recognize uppercase Latin letters.Learning Chinese characters is accelerated by first pretraininga DNN on a small subset of all classes and then continuing totrain on all classes. Furthermore, pretrained nets consistentlyoutperform randomly initialized nets on new tasks with fewlabeled data.I. I NTRODUCTIONKnowing how to drive a car helps to learn more quicklyhow to drive a truck. Learning French is easier if you alreadyknow a Latin language. Learning the second language is easierthan the first. Mathematics prepares students to study physics.Learning to get along with one’s siblings may prepare onefor getting along with others. Chess playing experience mightmake one a better strategic political or business thinker.Such musings motivate our investigation of transfer learning, where new tasks and concepts are learned more quicklyand accurately by exploiting past experience. In its mostgeneral form, transfer learning occurs when learning in onecontext enhances (positive transfer) or undermines (negativetransfer) a related performance in another context. Transfer isa key ingredient of human learning: humans are often ableto generalize correctly from a single training example. Unlikemost machine learners, however, humans are trained on manydifferent learning problems over their lifetime.Although there is no generally accepted definition of transfer learning, many have found that neural nets (NN) pretrained on one task can learn new, related tasks more quickly.For example, [1] investigates if learning the n-th thing is anyeasier than learning the first, and concludes from experimentsthat methods leveraging knowledge from similar learningtasks outperform models trained on a single task. Similarly,multitask learning [2] can improve generalization capabilitiesby learning several tasks in parallel while using a sharedrepresentation. From an NN point of view this is most easilyimplemented by a classification network with multiple outputs,one per class in the classification task. A similar paradigm isself-taught learning, or transfer learning from unlabeled data[3]. This approach is especially promising if labeled data setsfor the various learning tasks are not available.Here we focus on NN-based classifiers and experimentallyinvestigate the effect of transfer learning on well-definedproblems where enough labeled training data are available.In particular, we train deep NN (DNN) on uppercase lettersand digits from NIST SD 19 [4], as well as on Chinese characters provided by the Institute of Automation of the ChineseAcademy of Sciences (CASIA [5]). Training a classifier on theLatin alphabet (up to 52 classes) is a rather straightforwardproblem, whereas training a classifier on the GB1 subsetof Chinese characters (3755 classes) already poses a majorchallenge for any NN-based classifier that takes raw pixelintensities as its input.DNN consist of many layers. All but the last layer can beinterpreted as a general purpose feature extractor that mapsan input into a fixed dimensional feature vector. Usually allthe weights are randomly initialized and trained by backpropagation [6], [7], sometimes after unsupervised layer-wisepretraining [8], [9], [10]. Here we investigate if weights ofall but the last layer, trained on a given task, can be reusedon a different task. Instead of randomly initializing a net westart from an already trained feature extractor, transferringknowledge from an already learned task to a new task. Forthe new task only the last classification layer needs to beretrained, but if desired, any layer of the feature extractormight also be fine-tuned. As we will show, this strategy isespecially promising for classification problems with manyoutput classes, where good weight initialization is of crucialimportance.In what follows we will shortly describe the deep neuralnetwork architecture and give a detailed description of thevarious experiments we performed.II. D EEP N EURAL N ETWORK ARCHITECTUREOur DNN [7]consists of a succession of convolutional andmax-pooling layers. It is a hierarchical feature extractor thatmaps raw pixel intensities of the input image into a featurevector to be classified by a few, we generally use 2 or 3,fully connected layers. All adjustable parameters are jointlyoptimized, minimizing the misclassification error over thetraining set.

A. Convolutional layerEach convolutional layer performs a 2D convolution of itsM n 1 input maps with a filter of size Kx , Ky . The resultingactivations of the M n output maps are given by the sum ofthe M n 1 convolutional responses which are passed througha nonlinear activation function:XnYnj f (Yin 1 Wij),(1)iwhere n indicates the layer, Y is a map of size Mx , My ,and Wij is a filter of size Kx , Ky connecting input map i withoutput map j, and is the valid 2D convolution. That is, for aninput map Yn 1 of size Mxn 1 , Myn 1 and a filter W of sizeKx , Ky the output map Yn is of size Mxn Mxn 1 Kx 1,Myn Myn 1 Ky 1.B. Max-pooling layerThe biggest architectural difference between our DNN andthe CNN of [11] is the use of max-pooling layers [12], [13],[14] instead of sub-sampling layers. The output of a maxpooling layer is given by the maximum activation over nonoverlapping rectangular regions of size Kx , Ky . Max-poolingcreates slight position invariance over larger local regionsand down-samples the input image by a factor of Kx andKy along each direction. In the implementation of [15] suchlayers are missing, and instead of performing a pooling oraveraging operation, nearby pixels are simply skipped prior toconvolution.C. Classification layerKernel sizes of convolutional filters and max-pooling rectangles are chosen such that either the output maps of the lastconvolutional layer are down-sampled to 1 pixel per map,or a fully connected layer combines the outputs of the lastconvolutional layer into a 1D feature vector. The last layer isalways a fully connected layer with one output unit per classin the recognition task. We use a softmax activation functionfor the last layer such that each neuron’s output activation canbe interpreted as the probability of a particular input imagebelonging to that class.D. Training procedureDuring training a given dataset is continually deformed priorto each epoch of an online learning algorithm. Deformationsare stochastic and applied to each image during training,using random but bounded values for translation, rotation andscaling. These values are drawn from a uniform distributionin a specified range, i.e. 10% of the image size for translation, 0.9 1.1 for scaling and 5 for rotation. The finalimage is obtained using bilinear interpolation of the distortedinput image. These distortions allow us to train DNN withmany free parameters without overfitting and greatly improvegeneralization performance. All DNN are trained using on-linegradient descent with an annealed learning rate. Training stopswhen either the validation error becomes 0, the learning ratereaches its predefined minimum or there is no improvement onthe validation set for 50 consecutive epochs. The undistorted,original training set is used as validation set.III. T RANSFER LEARNINGWe start by fully training a net on a given task. After trainingstops, we keep the net with the smallest error on the validationdataset and change the number of neurons in the output layer tomatch the number of classes in the new classification task (e.g.if we train on digits before transferring to letters, the outputlayer size will grow from 10 to 26 neurons). The output layerweights are reinitialized randomly; the weights of remaininglayers are not modified.The net pretrained on the source task is then retrained onthe destination task. We check performance by fixing all butthe last n layers and retraining only the last n layers. Tosee how training additional layers influences performance, webegin by only training the last layer, then the last two, etc.,until all layers are trained. Since max-pooling layers do nothave weights, they are neither trained nor retrained (their fixedprocessing power resides in the maximum operator).One way of assessing the performance of transfer learningis to compare error rates of nets with pretrained weights tothose of randomly initialized nets. For all experiments we listrecognition error rates of pre-trained and randomly initializednets whose n top layers were trained.IV. E XPERIMENTSAll experiments are performed on a computer with i7-950(3.33GHz), 16GB RAM and 4 x GTX 580. We use the GPUimplementation of a DNN from [7]. This allows for performingall experiments with big and deep DNN on huge datasetswithin several days.We experiment with Latin characters [4] and Chinese characters [5]. We test transfer learning within each dataset, butalso from the Latin alphabet to Chinese characters. Withso many different tasks, i.e. digits, lowercase letters, uppercase letters, letters (case insensitive), letters (case sensitive),Chinese characters, one can try transfer learning in manyways. Instead of trying them all, we select the hardest andmost interesting ones. Here we consider a transfer learningproblem hard if the number of classes or the complexity of thesymbols increases. We perform transfer learning from digits touppercase letters, from Chinese characters to uppercase Latinletters and from uppercase Latin letters to Chinese characters.In total there are 344307 digits for training and 58646 fortesting; 69522 uppercase letters for training and 11941 fortesting. For the Chinese character classification task there are3755 different classes, which poses a major challenge forany supervised classifier, mainly because the dataset is huge(more than one Million samples equaling 2.5GB of data).We therefore evaluate the DNN on 1000 instead of all 3755classes, resulting in 239121 characters for training and 59660for testing. This is sufficient for a proof of concept that transferlearning between different datasets works. We also investigatethe effect of pretraining a DNN on subsets of the classes,

i.e., we use subsets of 10 and 100 out of the 1000 classes topretrain a DNN.A. Latin characters: from digits to uppercase lettersFor Latin characters the simplest symbols are the digits.Letters are generally more complex; there are also moreclasses, rendering the letter task even more difficult. We chooseuppercase letters as destination task because the data has highquality (few mislabeled images) and less confusion betweensimilar classes than lowercase letters. Since classification accuracy is not degraded by labeling errors and confused classes,results are more easily compared.All characters from NIST SD 19 are scaled to fit a 20x20pixel bounding box which is then placed in the middle ofa 29x29 pixel image. The empty border around the actualcharacter allows moderate distortions without falling outsidethe 29x29 box. As in [16], we use rotation of max. 15 ,scaling of max. 15%, and translation of max. 15%. Forelastic distortions we use a Gaussian kernel with σ 6and an amplitude of 36 (see [15] for an explanation of theseparameters).In our previous work [16] on NIST SD 19 data we usedrelatively small and shallow nets. That was fine to train manynets and build a committee. Here the focus is on obtaininggood results as quickly as possible, hence we only train onenet, although we use a big and deep DNN this time. Itsarchitecture is detailed in Table I. Filter sizes for convolutionaland max pooling layers are chosen as small as possible (2or 3) to get a deep net. The three stages of convolution andmax-pooling layers (the feature extractors) are followed bya classifier formed by one fully connected layer with 200neurons and the output layer. The learning rate starts at 0.001and is annealed by a factor of 0.993 after each epoch.the uppercase letter task (second row in Table II) yields goodresults even if only the last two fully connected layers areretrained.TABLE IIT EST ERRORS [%] FOR NETS PRETRAINED ON DIGITS AND TRANSFEREDTO UPPERCASE LETTERS .initializationrandomDIGITS12.072.09First trained layer3572.472.711.742.11 2.232.36838.444.13In addition, learning from pretrained nets is very fastcompared to learning from randomly initialized nets. In Figure1 test error rates [%] on uppercase letters are shown as afunction of training time [s], for both randomly initialized(solid, blue) and pretrained (dotted, red) nets trained from thefifth (left) and seventh (right) layer onwards. The pretrainednets start from a much lower error rate, and if only the last twofully connected layers are (re-)trained (right), the randomlyinitialized net never manages to match the pretrained net evenafter 10000 s of training. If the last convolutional layer is also(re-)trained (left), the pretrained net is much better after 1000s of training, but as training proceeds the difference betweenthe two nets becomes smaller.TABLE I8 LAYER DNN ARCHITECTURE USED FOR NIST SD 19.Layer012345678Typeinputconvolutionalmax poolingconvolutionalmax poolingconvolutionalmax poolingfully connectedfully connected# maps & neurons1 maps of 29x29 neurons50 maps of 28x28 neurons50 maps of 14x14 neurons100 maps of 12x12 neurons100 maps of 6x6 neurons150 maps of 4x4 neurons150 maps of 2x2 neurons200 neurons10 or 26 neuronskernel2x22x23x32x23x32x21x11x1A randomly initialized net fully trained on uppercase lettersreaches a low error rate of 2.07% (Table II), and 0.32% if weconsider the first two predictions. This indicates that 84.5%of the errors are due to confusions between similar classes.The error slowly increases if the first two convolutionallayers (1 and 3) are not trained. It is worth noting that withrandom convolutional filters, in layer 1 and 3, very competitiveresults are obtained, an intriguing finding already noted andinvestigated elsewhere [17], [18], [19]. However, when noconvolutional layer is trained the error spikes to almost twelvepercent. Transferring the weights learned on the digit task toFig. 1. Test error rates [%] on uppercase letters as a function of trainingtime [s] for randomly initialized nets (solid, blue) and nets pretrained on digits(dotted, red). Both nets are (re-)trained on uppercase letters starting from thefifth (left) and seventh (right) layer, respectively.We also check the performance of a fully trained smallnet that only has the classification layers (i.e. input layerfollowed directly by the fully connected layers: 7 and 8)of the DNN. When both fully connected layers are present(the net has one hidden layer with 200 neurons) the erroris 4.03%, much higher than the corresponding 2.36% of thepretrained net. If only the output layer is used (i.e. a net withinput layer followed by the output layer), the error goes up to42.96%, which is much higher than 4.13%, and even higherthen the random net’s error. The take home message is thatconvolutional layers, even when not retrained, are essential fora low error rate on the destination task.

B. Learning uppercase letters from few samples per classFor classification tasks with a few thousand samples perclass, the benefit of (unsupervised/supervised) pretraining isnot easy to demonstrate. After sufficient training, a randomlyinitialized net will eventually become as good as the pretrainednet. The benefits of unsupervised pretraining are most evidentwhen the training set has only few labeled data samples. Wetherefore investigate how a randomly initialized net comparesto a net pretrained on digits, (re-)training both nets on 10,50, 100, 500 and 1000 samples per class, respectively. InTable III the error rates on the original uppercase test set( 2500 samples per class) are listed for all experiments. Asexpected, the difference between the pretrained and randomlyinitialized net is the bigger the fewer samples are used.Using only 10 samples per class, the random net reachesan impressive error rate of 17.85% with and 34.60% withoutdistortions, indicating that the distortions capture the writingstyle variations of uppercase letters very well. Nevertheless,retraining the pretrained net on only 10 samples per classresults in a much lower error rate of 8.98% and 12.51% withand without distortions, still better than the error of the randomnet trained on distorted samples. The effects of distortionsfor the pretrained net are not as severe as for random nets.This indicates that the information extracted from the digittask as encoded in the network weights has been successfullytransferred to the uppercase letter task by means of a muchbetter network initialization.TABLE IIIT EST ERRORS [%] FOR NETS PRETRAINED ON DIGITS AND TRANSFEREDTO UPPERCASE LETTERS . T HE EXPERIMENTS ARE PERFORMED ONDISTORTED ( DIST.) AND UNDISTORTED TRAINING SAMPLES .initializationrandom dist.DIGITS dist.randomDIGITSNumber of training samples 015.1710.49 5.1512.516.545.033.30class10002.382.524.423.12C. Chinese characters to uppercase Latin lettersWe continue the experiments with a more difficult problem,namely, transfer learning between two completely differentdatasets. From previous experiments [20] we know that 48x48pixels are needed to represent the intricate details of theChinese characters. We use a similar but smaller net than theone used for the competition, because we need to run manymore experiments. As for the net used on Latin characters, wechoose small filters to obtain a deep net (Table IV). Becausethe input size of a DNN is fixed, uppercase letters are scaledfrom 29x29 to 48x48 pixel.The net fully trained on uppercase letters has a slightlylower error rate (1.89% Table V) than the one used inSubsection IV-A. This can be attributed to the deeper andbigger net. When the first layers are not trained, the errorincreases slightly and is twice as big as when only one convolutional layer (the seventh layer) is trained. If only the fullyTABLE IV10 LAYER DNN ARCHITECTURE USED FOR C HINESE CHARACTERS .Layer012345678910Typeinputconvolutionalmax poolingconvolutionalmax poolingconvolutionalmax poolingconvolutionalmax poolingfully connectedfully connected# maps & neurons1 maps of 48x48 neurons100 maps of 46x46 neurons100 maps of 23x23 neurons150 maps of 22x22 neurons150 maps of 11x11 neurons200 maps of 10x10 neurons200 maps of 5x5 neurons250 maps of 4x4 neurons250 maps of 2x2 neurons500 neurons10 or 26 or 100 or 1000 ed layers are trained, the error increases dramatically.We continue the experiments with nets pretrained on Chinesecharacters. Transfer learning works very well for this problem;the errors are lower than those of training random nets, evenif the first three convolutional layers (1, 3 and 5) are keptfixed. It seems that filters trained on Chinese characters can befully reused on Latin characters. This was expected, becausealthough Chinese characters are more complex than Latinones, they are written in the same way: a sequence of strokes.Even if the first nine layers are not trained and only the outputlayer is trained, a surprisingly low 3.35% test error rate isobtained.TABLE VT EST ERRORS [%] FOR NETS PRETRAINED ON C HINESE CHARACTERSAND TRANSFERED TO UPPERCASE LETTERS .initializationrandomCHINESE 100011.891.7932.041.89First trained layer5782.28 3.5310.451.91 1.882.22944.593.35D. Chinese characters: speeding up trainingTraining big and deep DNN i

C. Classification layer Kernel sizes of convolutional filters and max-pooling rect-angles are chosen such that either the output maps of the last convolutional layer are down-sampled to 1 pixel per map, or a fully connected layer combines the outputs of the last convolutional layer into a 1D feature vector. The last layer is

Related Documents:

Latin Primer 1: Teacher's Edition Latin Primer 1: Flashcard Set Latin Primer 1: Audio Guide CD Latin Primer: Book 2, Martha Wilson (coming soon) Latin Primer 2: Student Edition Latin Primer 2: Teacher's Edition Latin Primer 2: Flashcard Set Latin Primer 2: Audio Guide CD Latin Primer: Book 3, Martha Wilson (coming soon) Latin Primer 3 .

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

ebay,4life transfer factor eczema,4life transfer factor effectiveness,4life transfer factor en el salvador,4life transfer factor en espanol,4life transfer factor en español,4life transfer factor energy go stix,4life transfer factor enummi,4life transfer factor 4life transfer factor equine,4li

have been declared dead for quite some time. All science books were written in Latin and Greek until Seventeenth Century, as a matter of fact first medical text books used at Harvard were written in Latin and Greek. 60-70% of English words are derived from Latin and Greek based words. LATIN ROOTS Latin Root Meaning Examples 1. cede ceed cess go .