Exploring Deep Neural Networks For Regression Analysis

3y ago
57 Views
4 Downloads
449.78 KB
6 Pages
Last View : 5d ago
Last Download : 3m ago
Upload by : Kelvin Chao
Transcription

PESARO 2018 : The Eighth International Conference on Performance, Safety and Robustness in Complex Systems and ApplicationsExploring Deep Neural Networks for Regression AnalysisFlorian Kästner, Benedikt Janßen, Frederik Kautz, Michael HübnerChair for Embedded Systems of Information TechnologyRuhr-University Bochum, Bochum, GermanyEmail: {Florian.Kaestner, Benedikt.Janssen, Frederik.Kautz, Michael.Huebner}@rub.deAbstract—Designing artificial neural networks is a challengingtask due to the vast design space. In this paper, we presentour exploration on different types of deep neural networks anddifferent shapes for a regression analysis task. The networktypes range from simple multi-layer perceptron networks to morecomplex convolutional and residual neural networks. Within theexploration, we analyzed the behavior of the different networkshapes, when processing measurement data characteristic formass spectrometers. Mass spectrometers are used to determinesingle substances within gaseous mixtures. By applying deepneural networks for the measurement data processing, the behavior of the measurement system can be approximated indirectlythrough the learning process. In addition, we evaluate theusage of reinforcement learning to design the neural network’sarchitecture.Keywords–ANN; MLP; CNN; Reinforcement Learning.I. I NTRODUCTIONIn traditional machine learning approaches, manually designed features have to be provided for the input data. Extracting reasonable features is crucial for a successful usage ofthose algorithms. Moreover, the process of feature extraction istime consuming and requires expert knowledge regarding thespecific applications. In contrast, Artificial Neural Networks(ANNs) are capable of automatically extracting features fromgiven input data, superseding manually designed features.Furthermore, the increasing depth of Deep Neural Networks(DNNs) allows extracting very complex and abstract featuresout of several different representations with respect to priorlevels. All these features are then used to form a proper output.However, the design of ANNs is a challenging task dueto the degrees of freedom for their architecture, such as depthof the ANN, width and type of the layers, as well as dataflow paths. In addition, the training method has an impact onthe ANN’s performance, and has several degrees of freedomitself, for instance the initial parameter values, the batch size,the learning rate, and optimization algorithm.In this paper, we present our results of the exploration ofend-to-end trained ANNs for the processing of mass spectrameasured with a miniaturized mass spectrometer [1]. Massspectra allow the analysis of gaseous mixtures to determinetheir constituents. Within the exploration, we used a gaseousmixture with the constituents listed in Table I. In order toexclude any unknown effects of real measurement systems,we created a Python module to generate noisy mass spectraof the given mixture, based on the constituents’ characteristicmass-to-charge ratio peaks. The noise is evenly distributed,and explained in Figure 1 a), Figure 1 b) shows the resultingspectra. The generated mass spectra are normalized so that thesum of the constituents adds up to one. The noise within thegenerated mass spectra and the mass spectra’s minimal andmaximal, as well as mean values are depicted in Figure 1. InCopyright (c) IARIA, 2018.ISBN: 978-1-61208-628-6summary, our goal is to extract the constituents’ concentrationof the generated noisy mass spectra, without assuming anypre-conditional knowledge. This type of problem definition formachine learning is called multi-output-regression.TABLE I. DATA SET CONCENTRATION RANGE OF CONSTITUENTSConstituentH2 OCO2N2O2Concentration range0.0 % - 3.0 %0.2 % - 5.6 %65.0 % - 80.0 %15.0 % - 21.0 %Within the scope of the exploration, we analyzed thestructure and hyper parameters of different kinds of ANNssuitable for this purpose, implemented with TensorFlow [2]without manually extracting features. Due to the time-invariantapplication we focus on feedforward-ANNs starting with thetraditional Multi-Layer Perceptrons (MLPs). MLPs consist ofat least three fully-connected layer, in which every neuron isconnected to every neuron in the previous layer. In order touse spatial information in the signal and lower the numberof parameters, we included Convolutional Neural Networks(CNNs). CNNs apply filter kernels on the input data that compute the dot product, and thus combine spatial information [3].In addition, CNNs apply pooling layers that reduce the datadimension by down-sampling.For every layer we apply the Rectifier Linear Unit ReLUactivation function, due to its properties of smoothing the issueof vanishing or exploding gradients as shown by Glorot etal. [4]. Another important property of this activation functionapplied to ANNs is the sparse activation, meaning that acertain number of neurons within a network will never fire.Although, this feature is desirable when designing ANNs, ourexploration results of the network size could be influenced,due to the varying number of dead neurons. However, in orderto not further increase the exploration complexity, we assumethat this property does not influence the exploration, if theweight initialization is uniformly distributed.To speed-up training, smooth the issue of exploding gradients, and to avoid over-fitting we use batch normalization withtrainable scale and shift parameters. Within the scope of thiswork, we relinquish further methods avoiding over-fitting likedropout.Although, applying batch normalization and ReLUactivation function relieves the issue of vanishing or explodinggradients, the problem still exists and becomes more crucialwhen the network is deeper. Therefore, we extend our exploration with Residual neural Networks (ResNets), and highwaynetworks. The distinctive property of ResNets are shortcuts.Shortcuts implement the possibility to route the data flowaround layers, and afterwards recombine the processed and1

PESARO 2018 : The Eighth International Conference on Performance, Safety and Robustness in Complex Systems and ApplicationsFigure 1. Mass spectra: area of each constituent peak varies by 1 %, position by 0.5 m, variance by 5 %, and each measurement value by 10 %.zbypassed data [5]. In particular, the bypassing is done by anidentity shortcut connection. Thus, the original residual blocksimply adds the outcome of the convolution layers with theoriginal input. If the dimensions do not match after applyingconvolution due to stride, padding and amount of feature mapparametrization, the input can be compressed, preferably withnot trainable methods, or extended through padding. Shortcutmethodologies are subject to current research. Due to theshortcut connections, the influence of vanishing gradients isfar lower than with traditional ANNs. Moreover, the backpropagation can be applied much more efficient and simpler.Highway networks follow a similar approach, however, theyemploy a gating function optimized within the training phaseto filter the data through the shortcut [6].The related work within this field of research is presentedin Section II. The method of our exploration is twofold. Wefurther used Reinforcement Learning (RL) to automate theexploration for the MLP networks. The results of the RLbased exploration can be found in Section IV. A discussionand summary of the current results, as well as an outlook tofuture work can be found in Section V.II.R ELATED W ORKAn early work within the direction of our approach waspublished by Massicotte et al. [7], who investigated into thecalibration of high-pressure measurement systems with MLPnetworks. The authors compared an ANN-based method to aspline-based method, and state that the ANN-based methodachieves better results with lower quality reference data, andthus, enables a reduction of the time necessary for calibrationdata acquisition. Moreover, the spline-based method requiresthe parallel measurement of the temperature to consider temperature effects, which is not the case for the Ann-basedmethod.Several different classification methods for analyzing massspectra data were analyzed by the authors of [8]. Those classification methods include linear discriminant analysis, quadraticand discriminant analysis, k-nearest neighbor classifier, classification trees, Support Vector Machines (SVMs), and randomforest. Wu et al. found that Random Forrest outperforms theCopyright (c) IARIA, 2018.ISBN: 978-1-61208-628-6other classification methods in overall misclassification as wellas in stable assessment of classification errors.The authors of [9] used an unsupervised method for extracting features from mass spectra data followed by classificationrealized with SVMs. Their methodology results in greater 95 %correctly classified samples.The work described in [10] uses a RL method (Q-learning)based algorithm capable of generating high performance standard CNN. Created CNNs are outperforming existent networkswith similar layer types and are competitive with state of theart networks making use of more complex layer types.Referring to the screening methodology in the medical andgenetic fields, the authors of [11] developed an approach whichrandomly generates a high number of networks with differentparameter initialization and architectures. Configurations thatshow good results are used for further training. Regardingthe steady growth of computational power Pinto et al. [11]claim that this approach can speed up development successand understanding of biological vision.With the use of Cartesian Genetic Programming (CGP),Suganuma et al. [12] automatically construct a CNN for animage classification task with CIFAR-10 data set. Within theprocess the CNN structure is represented by CGP encodingmethod and is further optimized to reach best possible results.With this approach the authors claim to automatically findnetwork architectures which are comparable with commonstate of the art CNNs.III. E MPIRICAL E XPLORATIONWithin this Section we demonstrate our approach to explorethe shape of four different types of ANNs. The goal is toinvestigate, which structure suits best for the given qualitativeanalysis. This task represents a complex empirical optimizationdue to the high amount of adjustable hyperparameters, including among others batch-size, optimizer, and learning rate.Thus, the empirical approach can be seen as a starting pointfor further explorations preferably using optimization methodsuch as RL described in Section IV.The input vector consists of 990 floating point valuesrepresenting the mass spectra from 0 to 100 mz . The last layer of2

PESARO 2018 : The Eighth International Conference on Performance, Safety and Robustness in Complex Systems and Applicationsevery ANN within this work is built as a fully-connected layerwith four neurons representing the constituents concentrations,where no activation function is applied. Due to the regressiontask we define the cost function as the Mean Squared Error(MSE) between the labeled constituents concentrations and theoutcome of the last layer. The corresponding loss function isdefined in Equation 1, with the true value y and the predictionŷ for each of the four constituent.M SE 41X(yi yˆi )24 i 0(1)We apply batch normalization with trainable scale and shiftparameters for all networks at certain points in hierarchicalstructure. For training we use a fixed batch size consisting of25 randomly picked samples and Adaptive Moment Estimation(Adam) as the optimization method. The Adam optimizer waschosen due to the adaptive learning rate for every parameterand its good results dealing with sparse gradients. For furtherinformation, we refer to [13]. Those sparse gradients can resultfrom the properties of ReLU activation function. As a weightinitialization we choose the Xavier method [14]. To avoidoverfitting, we apply early stopping. The break condition isan increasing deviation of the prediction from the label of thelast three to the previous three predictions of the verificationdataset, after the network has been trained with all samplesin the dataset consisting of 100000 mass spectra. After thetraining, we verify the accuracy of the networks based on thedeviation of the output of the last layer and the correspondinglabels using a new dataset of 10000 mass spectra.A. Multi-Layer Perceptron NetworkFor the exploration of MLP networks, we created networkswith different depths, starting from three up to 13 layers. Thechoice of the layer sizes is based on NumPy’s logspace()function with base 10.0, generating a list of layers definedby the input layer size and the output layer size. The inputlayer size has been set within the range of 1.1 to 0.05 ofthe length of the input vector. We assume that the resultingfunnel-shape of the networks is a suitable approximation tofollow the feature extracting policy in order to raise the depthof the network with a sufficient number of neurons within thelayers. This implies that we assume the data to be compressedand the function of the network is more dependable on thedepth than on the width of the layers. We further save asignificant number of parameters when downsizing the widthof the fully-connected layers. The observed deviation on thetest dataset is depicted in Figure 2. With an increasing depth ofthe network, the deviation tends to be larger. The best overallresult for this exploration is achieved with a depth of fourfully-connected layers with a mean of 1.5 %. The result of theexploration matches our expectations. The vanishing gradientproblem prevent the network to perform better with increasingnetwork depth. This represents a well-known problem in thedeep learning domain. In Section III-C, we tackle this problemwith the introduction of residual blocks allowing us to builddeeper networks.B. Convolutional Neural NetworkCNNs are a famous type of ANNs in the computer vision domain, especially in object detection, segmentation andtracking applications [15]. They are mainly responsible forCopyright (c) IARIA, 2018.ISBN: 978-1-61208-628-6Figure 2. Deviations for different MLP configurations.the today’s popularity of ANNs due to their success in thisfield starting with ImageNet in 2011. One reason for theirsuccess is the property of the corresponding convolutionallayer to combine spatial information applying 2-dimensionalfilter kernels to the input followed by pooling to reducedimensions along the depth of the network. We also wantto take advantage of these feature assuming the existence ofspatial relationships.Instead of reshaping the input and perform a 2-dimensionalconvolution we apply a 1-dimensional convolution with different kernel-sizes and filter-depths. The output of this convolution is a 2-dimensional feature map, which consist ofcorresponding height and depth. We further could apply a 2dimensional convolution. However, we still assume that thespatial relationships only exist among the first dimension.Therefore, we decide to use 2-dimensional 1x1 convolutionto reduce the shape back to a 1-dimensional outcome. To alsoreduce the shape among the first dimension we can apply the1-dimensional convolution with a specific stride. The featuremap shape of the 1 dimension is then given by the quotient ofthe 1 dimension of the input shape and the stride. These twoconvolutions represent the basic block of our CNN. The basicprinciple is visualized in Figure 3.Figure 3. Visualization of the basic CNN principle.We design the CNN by simply stacking those basic blocksand adjusting the hyperparameters, which are the stride, the3

PESARO 2018 : The Eighth International Conference on Performance, Safety and Robustness in Complex Systems and Applicationsnumber of feature maps, and the kernel size. The output ofthe last basic block is fed to a fully-connected layer with afixed number of neurons of 99 followed by the output layer.The exploration is done with five CNNs ranging from twoto six stacked basic blocks. We reduce the shape of the firstdimension by using a stride of five in the first basic block.We further use a stride of two in the last basic block exceptfor the first CNN, where just two basic blocks are involved.The stride parameter for all other basic blocks is set to one.The kernel size and number of kernels within a layer varies,starting with a wide kernel and a low number of kernels.Table II lists the chosen CNN configurations. While goingdeeper, we downscale the kernel size and increase the kerneldepth. The result of this evaluation can be seen in Figure 4.Contrary to the MLP exploration, the accuracy does not droprapidly as the network depth’s increases, instead the deviationon the test dataset is approximately the same. Therefore, wefollow that the influence of vanishing gradient is intensifiedin MLP networks. CNNs can be deeper, as the convolutionrequires less parameters, and owns an aggravated forward- andbackpropagation path, due to the spatial connections, comparedwith the MLP network. The overall accuracy is slightly lowercompared to those of the MLP exploration. To design deepernetworks, we add block-wise shortcuts to the MLP and CNNas introduced in the next Section.TABLE II. CNN CONFIGURATIONS EXPLOREDnumber of stackedbasic blocks23456List of layer configurationswith [(kernelsize,number of kernels)] in hierarchical 80),(1,300),(3,500)]Figure 5. Deviations for different ResNet and Highway configurations.replacement of the product with the sum of the output of thelayers in the backward path, and thereby reduces the problemof vanishing gradients rapidly. Thus, He et al. were able todesign a ResNet with 152 stacked layers which outperformedall previous plain networks. We adopt this principle to ourneeds extending our basic convolution pair described in Section III-B with an identity mapping. The resulting principleresidual block is shown in Figure 6. For simplicity reasons,we forego to reduce the shape of the first dimension. Thus,the dimensions of the input and the output of the block arethe same. If this would be not the case the input has to bedownscaled or upscaled, preferable without or a low number oftrainable parameters. To downscale the first dimension amongthe depth of the network at certain points, we apply the basicconvolution pair without shortcuts with a specific stride aftera significant number of residual blocks. This could also bereplaced with a pooling layer in the future.A similar principle is used with Highway Networks. Themain difference is the use of a gating function for identitymapping. However, this difference is only valid consideringthe original ResNet and the original Highway Network. Recentdevelopments regarding different types of ResNet blurred thesedifference [17]. In the original highway network, developed bySrivastava et al. [18], the output y of a basic highway blockis defined as follows:y F (x, WF ) · T (x, WT ) x · (1 T (x, WT ))Figure 4. Deviations for different CNN configurations.C. Residual and Highway NetworkResidual Networks or ResNets are the state-of-the-art networks for various applications and especially famous in imagerecognition tasks. He et al. [16] reformulated the layers tolearn residual functions with respect to the input of the layer.This basic principle eases the learning due to the stepwiseCopyright (c) IARIA, 2018.ISBN: 978-1-61208-628-6(2)Where x represents the input of the layer and F representsthe nonlinear activation function with the weight parametersWF . T is defined as the transform gate, while the term 1 Tis depicted as the carry gate. This gating units can be seenas a learnable dataflow control unit through the network. Weuse this principle to continue our exploration. Therefore, weextend fully-connected layers with this gating unit. Similar tothe ResNet approach we keep the dimensions equal insi

Chair for Embedded Systems of Information Technology Ruhr-University Bochum, Bochum, Germany Email: fFlorian.Kaestner, Benedikt.Janssen, Frederik.Kautz, Michael.Huebnerg@rub.de Abstract—Designing artificial neural networks is a challenging task due to the vast design space. In this paper, we present

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Deep Learning 1 Introduction Deep learning is a set of learning methods attempting to model data with complex architectures combining different non-linear transformations. The el-ementary bricks of deep learning are the neural networks, that are combined to form the deep neural networks.

Deep Neural Networks Convolutional Neural Networks (CNNs) Convolutional Neural Networks (CNN, ConvNet, DCN) CNN a multi‐layer neural network with – Local connectivity: Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions:

ConvoluMonal Neural Networks Input Image ConvoluMon (Learned) Non-linearity SpaMal pooling Feature maps ConvoluMonal Neural Networks . ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 . 6/1/17 1 5 AlexNet for image classificaMon “car” AlexNet Fixed input size: 224x224x3

A growing success of Artificial Neural Networks in the research field of Autonomous Driving, such as the ALVINN (Autonomous Land Vehicle in a Neural . From CMU, the ALVINN [6] (autonomous land vehicle in a neural . fluidity of neural networks permits 3.2.a portion of the neural network to be transplanted through Transfer Learning [12], and .

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid