Exploring Transferability In Deep Neural Networks With Functional Data .

1y ago
5 Views
2 Downloads
1.61 MB
10 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Brady Himes
Transcription

Exploring Transferability in Deep Neural Networkswith Functional Data Analysis and Spatial StatisticsRichard McAllisterJohn SheppardGianforte School of ComputingMontana State UniversityBozeman, MT, USArichard.mcallister@msu.montana.eduGianforte School of ComputingMontana State UniversityBozeman, MT, USAjohn.sheppard@montana.eduAbstract—Recent advances in machine learning have broughtwith them considerable attention in applying such methods tocomplex prediction problems. However, in extremely large dataspaces, a single neural network covering that space may not beeffective, and generating large numbers of deep neural networksis not feasible. In this paper, we analyze deep networks trainedfrom stacked autoencoders in a spatio-temporal application areato determine the extent to which knowledge can be transferredto similar regions. Our analysis applies methods from functionaldata analysis and spatial statistics to identify such correlation. Weapply this work in the context of numerical weather predictionin analyzing large-scale data from Hurricane Sandy. Results ofour analysis indicate high likelihood that spatial correlation canbe exploited if it can be identified prior to training.I. I NTRODUCTIONIt has long been known that the black-box nature of neuralnetworks introduces challenges to wide-spread adoption of thistechnology, especially in safety-critical domains. The DARPABroad Agency Announcement (BAA) for Explainable Artificial Intelligence [1] solicited research proposals for techniquesfor creating artificial intelligence models such as artificialneural networks (ANN) that, upon training, would enable usersto understand why such models make the decisions that theymake. In particular, a major factor that hinders the adoptionof ANNs in many domains is that it is very difficult tounderstand why they produce the answers that they produce[2]. What has been learned in a trained, functioning, reliableANN remains opaque; therefore, the model is limited in waysit can enhance the understanding of the governing processes ofthe system under examination. Thus one of the key focus areasin modern neural network research is in developing approachesto improve insight into what has been learned by these models,essentially opening the black box so that adopters can seeinside.The promise offered by the effectiveness of modern deeplearning methods also suggests potential wide applicability insolving many of these critical problems. Unfortunately, thecomputational complexity of training such models, combinedwith the large data requirements, further limit adoption ofthis technology. This motivates research in transfer learningwhereby trained models can be re-used as starting points inother problem areas, thus significantly reducing the computational burden in their training. Intuitively, efforts to applytransfer learning assume that it is possible for a model to havelearned something fundamental to a more abstract universethat encompasses both the domain within which the modelwas trained and the domain within which it will be applied.These two problem areas motivate the current work. Forour approach, we focus on a single type of deep networkand apply techniques from functional data analysis (FDA)and spatial statistics to develop insight into what the networkhas learned. We then apply that insight to select portions ofthe model to be transferred and test the effectiveness of thetransferred knowledge in a new setting. More specifically, wedefined a highly controlled environment in which we startedby creating a single stacked autoencoder and initializing theweights randomly. Then for each area of interest (AOI) inthe dataspace we cloned this single network and pre-trainedthe clones on the data from their respective AOIs in exactlythe same fashion, effectively removing any stochasticity inthe training process. This allowed us to remove uncontrollablesources of variation and uncertainty across the entire dataspaceand concentrate on how different areas of this dataspace affecttraining. We utilized this structure as the foundation for ourexperiments.For this analysis, we chose a problem from meteorologyas our test case. Weather modeling and prediction have beenin the domain of deterministic methods for many years [3].We assert that an opportunity exists for deep learning tosupplement the state of the art in weather modeling andprediction, particularly by informing traditional computationalmodels. To do this in a way that instills confidence in users oftraditional models, we need to gain an understanding of whatANNs learn when they are trained on this data.The main contributions coming from this work are asfollows. First, we provide a highly structured, highly controlled approach to evaluate learnability and transferability indeep ANNs. To support this, we draw on the disciplines offunctional data analysis and spatial statistics in a novel way.We then develop an approach to apply the results of spatialanalysis to determine what components of a deep networkare transferable. Finally we test the transferability of deepnetworks that have been trained based on this approach.This paper is organized as follows. Section II discussesrelevant literature to provide both background information and

results of similar research. In Section III, we describe the datacollection process for our domain of study. We then discusshow we apply techniques from FDA to prepare the data foranalysis in Section IV. In Section V, we provide a detailedexplanation of our experimental and analysis approach. Theresults of our analysis are given in Section VI, and we provideconclusions and areas of future work in Section VII.II. R ELATED W ORKThe idea behind stacked autoencoders is to use layers ofautoencoders to represent lower-dimensional encodings of adataset under study [4]. The layers are stacked together in afashion that facilitates constructing an abstraction hierarchyof features by having learned features derived directly on theresponse of the lower-level feature detectors. These featuredetectors are developed through an iterative process of unsupervised pre-training, where resulting autoencoders migratetowards important basins of attraction expressed in data [5].In this work, we control the training process to gain a betterunderstanding of these basins of attraction in the context ofa problem exhibiting high spatio-temporal correlation. Thispermits us to apply tools to support analysis of the learnedfeatures as a function of both space and time.Transfer learning in ANNs is an area of very active research[6], [7]. It exploits knowledge gained from auxilliary domainsin order to facilitate predictive modeling in the new domains[8]. There have been studies applying transfer learning inmeteorology, which is the domain of interest here. In oneexample, Hu, Zhang, and Zhou applied transfer learning withstacked autoencoders to improve wind speed prediction inareas lacking sufficient data to appropriately train models fromscratch [9].While praising the success that many AI models have had inrecent years, Samek, Wiegand, and Muller [2] state that “it isnot clear what information in the input data makes them arriveat their decisions.” In this paper we take steps to determinethis information, although we do not make the claim that thisinformation will be human-understandable.While there does appear to be considerable literature in recent years on transfer learning and explainable AI individually,there does not appear to be any significant work combining thetwo. The novelty of the work performed in this paper is usinganalysis for explanability approaches to facilitate the transferprocess.The process of analyzing data that is generated as a resultof some underlying process is referred to as “functional dataanalysis,” where the data can be modeled and represented asa function, often in time or space. Silverman and Ramsayprovide one of the first texts on the subject of FDA wherethey describe several analysis methods directly applicable tothe type of data analyzed here (namely, weather data) [10]. Inrecognizing that weather occurs as a function of both spaceand time, we use methods from FDA to assemble featuresexpressed in meteorological data as a hurricane moves overan area of the Earth. FDA is not without precedent in Earthscience. King [11] explored functional analytic methods inTable I: Features for the Hurricane Sandy DatasetReading SourceRadiometry MeasurementWind Speed IndicatorReading NameTemperaturePressureCloud DensityRain DensityIce DensitySnow DensityGraupel DensityWind u (East/West)Wind v (North/South)Wind w (Up/Down)analyzing climate change. In that work the author fit splinefunctions to temperature time series data in order to tracktemperature changes in US cities over the last few decades.She did not, however, apply FDA in a machine learningcontext.III. DATAThe type of data we consider here is highly multidimensional, spatial, temporal, and functional. It is multidimensionalin that for each AOI, we have temperature, pressure, precipitation, humidity, and wind data. It is spatial because we areexamining these same dimensions across a three-dimensionalphysical space, and these spatial relationships are a significantfactor in the behavior of the system. It is temporal becausedata at each location are represented as a time series as ittracks a storm over a 24-hour period. It is functional becausechanges in one area propagate through the space accordingto atmospheric forces and dynamics as a function of (amongother things) space and time.The data that we used for this investigation were generatedby Zhang and Gasiewski in [12]. The data comes from aWeather Research and Forecasting (WRF) model of one dayin the life of Hurricane Sandy from 2012 with data collectedevery 15 minutes. It is an aggregation of two datasets: oneconsisting of radiometric readings from space and one consisting of spatially and temporally located wind vectors. Sinceradiometers cannot measure wind speed, we considered ituseful to use deep learning to predict wind vector componentsfrom such inputs.Table I shows the data that were available for each point inour study. The radiometric and wind datasets are from differentsources, but all of their respective measurements have beenaligned with one another with respect to space and time.IV. T HE ROLE OF F UNCTIONAL DATA A NALYSISIn this paper we examine weather data as being functionalin nature. We assume that the behavior of the data fromeach of the points of interest is influence by common factorsthat influence all of these points together. Because of this,we hypothesize that the encodings that result from trainingnetworks on the data from each point of interest containtransferrable knowledge. We want the use the understandinggained through this examination to broaden the predictivecapability of similarly trained neural networks, and we furtherhypothesize that spatio-temporally correlated feature detectors

acbFigure 1: Pedestrians Along Road with Passing FiretruckabcVolumeFigure 3: Analysis Locations for Training the Neural NetworksV. A PPROACHA. OverviewTimeFigure 2: Volume Levels by Time for Each Pedestrianin a trained network can be extracted and used to trainnetworks for other parts of the dataspace more efficiently andmore accurately.To illustrate the role that FDA plays in our analysis, we usethe following toy example. Suppose three people are standingalong the side of a road as a firetruck passes by with itssiren blaring. When the three pedestrians are in the samerelative position to the street as the firetruck passes, eachof their experiences will be exactly the same in terms ofthe volume of the siren they perceive. However, when theyare at different distances from the edge of the street andat inconsistent intervals along the length of the street, thepower of the functional treatment of this data is much morereadily apparent. Figure 1 shows three pedestrians positionedin this way, and Figure 2 shows a notional plot of thecorresponding volume perceived by each pedestrian. Noticehow added distance causes the overall volume to be lowerand the volume change to be flatter, in contrast with theexperience of the pedestrian closest to the street. Also, thetime of experiencing the change in volume of the siren variesbased on the lateral position of the pedestrian.In analyzing this situation, functional data registrationwould cause the peaks of these curves to be aligned (shiftregistration) and the amplitudes to be modified (amplituderegistration) as much as possible to bring the overall shapesof the phenomena into alignment, while maintaining theindividual differences of the functions [10]. Our exampledataset characterizes the behavior of a hurricane, which likethe firetruck in the above example, is a spatio-temporal phenomenon that is moving through our area of interest. As thephenomenon makes its way through every location on the mapits “locations” in the phenomenon affect “locations” in thespace in a related way, like the siren on the firetruck.To extract information that we can use to generalize acrossthe dataspace, we perform a two-stage training of a set ofstacked autoencoders. The layers of these stacked autoencoders are trained using unsupervised pre-training. Except forthe primary random initialization of the prototype network, weremove all sources of randomness in the pre-training procedureto facilitate a controlled analysis of the learned features.The initial weights of each of the autoencoders are clonedfrom a single random initialization and replicated across thespatial area in our dataspace so that they are exactly thesame. The data used to do the pre-training are fed into eachautoencoder in the same order, so there is time-correspondenceacross the dataspace. Maintaining this numerical consistencyallowed us to trace the effects of the data using a consistentmodel, ensuring our focus on the dataspace rather than themodel itself. The resulting weights are then analyzed withrespect to how they vary across the dataspace. We use theresults of applying semi-variograms from spatial statistics todetermine information that may be shared across networks thatcorrespond to each area of interest.B. Data Preparation1) Area of Interest Instance Data: Figure 3 shows thedistribution of areas of interest where our data was collected.The left side of the figure shows the Eastern seaboard of theUnited States from Long Island on the upper left to Floridaand Grand Bahama in the lower left. This square region is thesection over which all of the data was collected, and the dotsshow the locations of each geographical area we analyzed.Each dot represents the center of a grid cell in a geodesicDiscrete Global Gridding System (DGGS) [13] superimposedover the entire planet.Figure 4 depicts one area of interest, corresponding to oneof the points from Figure 3. Each numbered cell is a 15kmresolution DGGS cell. We train networks to predict the windvector conditions in cell 0 at each location for each succeedingtime step in the dataset. We assume, as was assumed in [14],[15], that the wind vector values in cell 0 for the current time

2310645Figure 4: One DGGS Area of Interestslice (excluding those forces not represented in the data) canbe determined by the radiometric readings of that same cell forthe current time slice and the radiometric and wind readingsof all cells in the previous time slice.2) Time Shift Instance Data: The task before us is to takedata from radiometric readings at a particular point in time t,which we denote rt0 , . . . , rtn . We use this data to predict thewind vectors at time t 1, which we denote ut 1 (zonal,or east-west), vt 1 (meridional, or north-south), and wt 1(vertical, or up-down) respectively. Thus we set this up asa one-step time series prediction problem.3) Scale Data: Each input and output data variable variedin magnitude greatly. Each of the data variables was also captured using differing units, for example: kilometers per hourfor wind and degrees Celsius for temperature. To minimize theimpact such variability, we scaled the values for each of thevariables to a range between 0 and 1. Each of the data variableswas scaled individually so that the functional character of thedata represented was preserved.4) Separation of Input and Output Data: The radiometrymeasurements were the main features used for prediction. Inthe future we would like to use the wind vector predictionsfrom the prior time step as inputs for the next time step, creating a more comprehensive system with enhanced predictivecapabilities. But for now we wanted to provide the networkswith as little information as possible about the state of thewind vector, save for the ground truth outputs in training.5) Random Data Padding: To overcome an underflow issuewith the functional data registration procedure, we paddedeach input dimension with a random value between 0.0001and 0.001. This small value was adequate to prevent underflowwhile not being large enough to affect the overall patternswithin the data. This was not necessary for the output datadimensions.6) Functional Data Registration: To bring the shapes ofeach of the input dimensions into greater relief we performedfunctional data registration over the dataset. This includedshifting the functional form of the data to bring them intoalignment with regard to time and intensity. This is referredto as shift and amplitude registration—shift registration describing the adjustment of the function’s time dimension toalign features along the abscissa, and amplitude registrationdescribing the increase or reduction in intensity to alignfeatures along the ordinate.To perform registration on the data, it must be convertedinto a functional form. This means that we represented thedata using a set of basis functions and a corresponding set ofcoefficients. The two choices for the basis functions explainedin [10] are the Fourier basis and the spline basis. Since theFourier basis function is primarily used in periodic data, andour data only spanned a single day, we chose the spline basisas being better suited for our non-periodic dataset.7) Train, Validate, and Test Separation: Since this is astudy exploring the dataspace rather than an exhaustive validation of a training methodology, it was important that we keptthe treatment of the data consistent among training, validation,and testing sets. For this reason, the time indices for thetraining, validation, and testing segments of the data werepre-selected before any of these processes proceeded. Aftercompleting the aforementioned procedures, there were 93 datainstances for each area of interest. For training we reserved 73of these, selected at random, and 10 each for validation andtesting.8) Sequential Data Training: The data from all of thepoints of interest on the map are bound together by time.This means that the first instance for each location happensat the same time as the first instance in all other locations,and so on. Normally, during the training of neural networks,the examples are fed to the procedure randomly. This wouldhave the effect of scrambling the temporal sequence acrossthe dataspace, rendering the instances incomparable. Since wewanted total consistency in the training of these networks, andso that there was no randomization during the training process,exactly one ordering of data was used This was done to ensurethat the same time indices were used for all areas of interestand all of the training epochs corresponded with each other.This ordering was pre-selected and was applied system-wide.C. Layer Pre-Training1) Prototype Network: In the interest of removing allsources of variation in the pre-training procedure, the networkstrained on the data from each area of interest were cloned froma single, randomly initialized, autoencoder. Therefore, all pretraining had the same random starting point. The prototypeautoencoder was a single layer of 150 nodes and its layerweights were each initialized to random values between 1and 1.2) Node Profiles Pairwise Dot Products: After pre-training,each network was the result of the original, cloned autoencoderhaving been pre-trained on data from its respective area ofinterest. To compare what was learned by each node across thedataspace, we collected each node’s incoming weight vector.Specifically, for each autoencoder and for each area of interestwe collected the weight vector of each corresponding nodeand arranged them into a similarity matrix. Because of theway the networks were trained, we assert that the featurelearned by a particular node at one location corresponds tothe same feature learned by that node in another location.More formally, suppose we have two autoencoders A1 andA2 . Suppose we order the hidden nodes of each autoencoder ash1 , . . . , hn . Because of the controlled pre-training procedureand the fact each autoencoder started from the same state,12we assert that node hAand hAare examining the sameii

feature of the underlying data and are, therefore, comparable.We computed the pairwise dot products of the node weightvectors for each node as the measure of similarity betweeneach of the node profiles for each location.3) Semivariograms: Having the matrix of pairwise dotproducts based on location allowed us to analyze the varianceas a function of distance between each node. For this weused semi-variograms [16], [17], which depict the differencesin the dot products for all of the nodes for each location.This geostatistical tool enabled us to examine the extent towhich the results of unsupervised pre-training were spatiallydependent [18]. A geostatistical analysis is appropriate herebecause we have endeavored to remove all other randomnessfrom the model, and are instead analyzing the systems that remain: those of the pre-trained neural model and the mixture ofrandom and functional dynamics that are endemic to the stormsystem [18]. The mathematical definition of a semivariogramis as follows [17].ih1γ( h) V ar Z( x h) Z( x)2in Figure 6. The chart shows that, for this configuration, theoptimal HAC clustering to use is three clusters.We identify the clusters inside the clusterings by theirsilhouette scores, which is a quality measure based on pairwisedifferences of between and within-cluster distances. Afterusing the Calinski-Harabaz score to select the number ofclusters as an input parameter, the clusters that were producedusing HAC had different silhouette scores, indicating varyingcluster quality. Since we know that a maximal silhouettescore is to be preferred, we wanted to see if cluster qualityin this respect had an impact on the resulting convergenceand prediction performance. The results depicted in Figure 9are divided by average silhouette score for each cluster. Thesilhouette score is defined as:"#NCb(x) a(x)1 X 1 XSilN C N C i 1 nimax{b(x), a(x)}where γ, is derived from spatially distributed random variablesZ( x) and Z( x h), and x and x h are the spatial positionsseparated by h [17].Figure 5 shows three examples of semi-variograms that wereproduced during this procedure. Each of these charts usesa different scale on the ordinate, since the scale of each ofthe pairwise differences differed significantly. To remove theinfluence of the scale in the expression of the patterns in thedata we individually scaled each of the semi-variograms to arange between 0 and 1. This allowed us to pairwise comparethe patterns that were in the semivariograms, rather than thedata that generated them.4) Hierarchical Agglomerative Clustering: Since all of thesemi-variograms were now on the same scale, we used hierarchical agglomerative clustering (HAC) to determine interrelatedness among semi-variograms. Since the number ofclusters is an input parameter to HAC, we assessed eachclustering from 2 clusters through 9 clusters. To determinewhat we regarded as an optimal clustering we used theCalinski-Harabaz score (also known as the pseudo-F score)[19] as follows.andNCXFN C ni d2 (ci , c)/(N C 1)i 1NC XXd2 (x, ci )/(N N C)i 1 x Ciwhere N C is the number of clusters, N is the number ofobjects in the dataset, ni is the number of examples in theith cluster, d(x, y) is the Euclidean distance between x andy, ci is the center of the ith cluster, and c is the center ofthe dataset. The Calinski-Harabaz measure is an evaluation ofcluster validity based on intra-cluster distance and inter-clusterdistance. An example Calinski-Harabaz score profile is shownx Ciwherea(x) 1n 1Xd(x, y)x Ci ,y6 x X1b(x) min d(x, y) .i,jnjy CjAgain, N C is the number of clusters, n is the numberof objects in the dataset, d(x, y) is the Euclidean distancebetween x and y, and ci and c are the centers of an individualcluster and entire dataset respectively.The three charts in Figure 5 are semi-variograms of nodesthat were examples of three clusters, each with differentaverage Silhouette coefficients (ASC). When we examined thesemi-variograms across the space of all nodes, we observedthis variety of patterns. It is this data that allows us to separatenodes that we fix in the next step, as opposed to nodes thatwe allow to vary.5) Fixed Pre-Training: What we obtained from each clusteris the fixed-set, which is a list of nodes to transfer to otherlocations in the dataspace. This forms the basis for the transferlearning experiment. In this procedure, the original autoencoder layers were once again copied from the single prototypeautoencoder. The node weights for the fixed-set of nodes werethen transferred into the copy of this prototype. Holding theweights of the fixed-set constant throughout the pre-trainingprocedure, the resulting autoencoder layers were pre-trainedin this configuration. Figure 7 shows this situation for onelayer of the network. In this figure, the shaded (red) nodesare copies from another network identified from a cluster ofspatially correlated feature detector nodes. In this paper weonly used a one-layer stacked autoencoder; however, we intendto extend this to multiple layers in future work.For the “Surrounding POI Experimnts” described in sectionVI-A we refer back to Figure 3. The location in the upper leftindicated by the star shows an example area of interest whosenode weights were copied. The surrounding dots represent theareas of interest to which these node weights were copied.

4e 09 5.0e 10 15 05xp1015 05xp1015xpNode 1 (ASC 0.18)Node 19 (ASC 0.11) 1e 090e 00 10 5 0 0.0e 00 yp 2e 091.0e 09 2.0e 09 1.0e 09 yp3e 09 yp 0.0e 00 3.0e 091.5e 09 Node 18 (ASC 0.43)Figure 5: Example Semivariograms from Each Cluster In Selected Clustering. ASC Average Silhouette CoefficientFor the “Linear Cross Transfer Experiment,” whose resultsare described in section VI-B we refer to Figure 11, where theweights were copied from each corner location (locations 12,19, 82, and 89) into the networks corresponding to the lineof AOI’s leading to the respective diagonal opposite corners.For example, for location 12 in the Figure, the fixed-set wascopied to the networks to be trained on trained on data fromlocations 23, 34, 45, 56, 67, 78, and 89.35.032.530.0Score27.525.022.5D. Fine Tuning and Testing20.017.523456Number of Clusters789Figure 6: Calinski-Harabaz Score for Clusterings: 2–9 ClustersAfter transfer and during fine-tuning, the training data arefed into the network in the same order as they were fedin for the unsupervised pre-training procedure. This, again,is to remove as many sources of variation as we couldduring the entire procedure. The number of iterations andassociated mean squared error of the networks were trackedand compared to the original training to determine if thetransfer learning process was more efficient and effective ashypothesized.VI. R ESULTSA. Surrounding POI ExperimentsFigure 7: Autoencoder Fixed NodesFigure 8 shows the difference in convergence time that istypical for each of the locations surrounding the location fromwhich we transferred the fixed-set. As can be seen, fixing thenodes from the pre-training of the center cell had the effectof substantially reducing convergence times. It also showsconsistently lower overall mean squared error with respect toautoencoder reconstruction during the pre-training procedure.Figure 9 shows the test performance when predicting thewind vectors for each of four configurations that we used. Wetested configurations that were both regularized (L1 regularization) and unregularized, and we used both the ReLU andhyperbolic tangent (tanh) activation functions in the networksthat were assembled from the autoencoder labels. Of particularinterest is the configuration that used regularization and theReLU activation function. In this configuration, using fixed

The convergence behavior resembled that from the previousexperiment, whose results are shown in Figure 8, with minorvariation. The convergence time was substantially lower andthe MSE converged to a similarly lower value.Figure 12 shows the prediction performance of the resultingnetworks for the configuration where no regularization wasused and ReLU was used as the activation function. For thefirst two rows the locations moving to the right across the xaxis indicate a movement in the dataspace farther away fromthe network from which the trained nodes were copied. In thelast two rows, movement to the left along the x axis indicatesthis inedFixed Pre-Trained02550VII. C ONCLUSIONS AND F UTURE W ORK75100Epoch125150175Figure 8: Typical Convergence Comparison Plotnodes from the cluster corresponding to the higher averagesilhouette coefficient produced better predictive results, in general, than either those of the fine-tuned pre-trained networksor the fixed pre-trained configuration using

to atmospheric forces and dynamics as a function of (among other things) space and time. The data that we used for this investigation were generated by Zhang and Gasiewski in [12]. The data comes from a Weather Research and Forecasting (WRF) model of one day in the life of Hurricane Sandy from 2012 with data collected every 15 minutes.

Related Documents:

Deep Learning 1 Introduction Deep learning is a set of learning methods attempting to model data with complex architectures combining different non-linear transformations. The el-ementary bricks of deep learning are the neural networks, that are combined to form the deep neural networks.

Deep Neural Networks Convolutional Neural Networks (CNNs) Convolutional Neural Networks (CNN, ConvNet, DCN) CNN a multi‐layer neural network with – Local connectivity: Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions:

Neural Network, Power, Inference, Domain Specific Architecture ACM Reference Format: KiseokKwon,1,2 AlonAmid,1 AmirGholami,1 BichenWu,1 KrsteAsanovic,1 Kurt Keutzer1. 2018. Invited: Co-Design of Deep Neural Nets and Neural Net Accelerators f

Deep Convolutional Neural Network for Image . We note directly applying existing deep neural networks does not produce reasonable results. Our solution is to establish the connection between traditional optimization-based schemes and a neural network architecture where

Neuroblast: an immature neuron. Neuroepithelium: a single layer of rapidly dividing neural stem cells situated adjacent to the lumen of the neural tube (ventricular zone). Neuropore: open portions of the neural tube. The unclosed cephalic and caudal parts of the neural tube are called anterior (cranial) and posterior (caudal) neuropores .

A growing success of Artificial Neural Networks in the research field of Autonomous Driving, such as the ALVINN (Autonomous Land Vehicle in a Neural . From CMU, the ALVINN [6] (autonomous land vehicle in a neural . fluidity of neural networks permits 3.2.a portion of the neural network to be transplanted through Transfer Learning [12], and .

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

Fundamentals of deep learning and neural networks Serena Yeung BIODS 388. Deep learning: Machine learning models based on “deep” neural networks comprising millions (sometimes billions) of parameters organized into hierarchical layer