Deep Auto-Encoder Neural Networks In Reinforcement Learning

1y ago

5 Views

1 Downloads

3.19 MB

8 Pages

Last View : 29d ago

Last Download : 3m ago

Upload by : Halle Mcleod

Report this link

Download PDF

Transcription

Deep Auto-Encoder Neural Networks in Reinforcement LearningSascha Lange and Martin RiedmillerAbstract— This paper discusses the effectiveness of deep autoencoder neural networks in visual reinforcement learning (RL)tasks. We propose a framework for combining deep autoencoder neural networks (for learning compact feature spaces)with recently-proposed batch-mode RL algorithms (for learningpolicies). An emphasis is put on the data-efficiency of thiscombination and on studying the properties of the featurespaces automatically constructed by the deep auto-encoders.These feature spaces are empirically shown to adequatelyresemble existing similarities between observations and allow tolearn useful policies. We propose several methods for improvingthe topology of the feature spaces making use of task-dependentinformation in order to further facilitate the policy-learning.Finally, we present first results on successfully learning goodcontrol policies using synthesized and real images.I. I NTRODUCTIONRecently, there have been reported several impressive successes of appyling reinforcment learning to real-world systems [1], [2], [3]. But present algorithms are still limited tosolving tasks with state spaces of rather low dimensionality1 .Learning policies directly on visual input—e.g. raw imagesas captured by a camera—is still far from being possible.Usually, when dealing with visual sensory input, the originallearning task is split into two separate processing stages(see fig. 1). The first is for extracting and condensing therelevant information into a low-dimensional representationusing methods from image processing. The second stage isfor learning a policy on this particular encoding.classical solution:image processinghere:unsupervisedtraining of deepautoencodersLow-dimensionalFeature motoric LearningFig. 1.Classic decomposition of the visual reinforcement learning task.In oder to increase the autonomy of a learning system,letting it adapt to the environment and find suitable representations by itself, it will be necessary to eliminate the needfor manual engineering in the first stage. This is exactlythe setting where we see a big opportunity for integratingrecently proposed deep auto-encoders replacing hand-craftedpreprocessing and more classical learning in the first stage.Sascha Lange and Martin Riedmiller are with the Department ofComputer Science, Technical Faculty, Albert-Ludwigs University ofFreiburg, D-79194 Freiburg, Germany (phone: 49 761 203 8006; de).1 over-generalizing: less than 10 intrinsic dimensions for value-functionbased methods and less than 100 for policy gradient methodsNew methods for unsupervised training of very deeparchitectures with up to millions of weights have opened upcompletely new application areas for neural networks [4],[5], [6]. We now propose another application area, reportingon first results of applying Deep Learning (DL) to visualnavigation tasks in RL. Whereas most experiments conductedso far have concentrated on distinguishing different objects,more or less perfectly centred in small image patches, inthe task studied here, the position of one object of interestwandering around an image has to be extracted and encodedin a very low-dimensional feature vector that is suitable forthe later application of reinforcement learning.We will mainly concentrate on two open topics. Thefirst is about how to integrate unsupervised training ofdeep auto-encoders into RL in a data-efficient way, withoutintroducing much additional overhead. In this respect, a newframework for integrating the deep learning approach intorecently proposed memory-based batch-RL methods [7] willbe discussed in section III. We will show the auto-encoders inthis framework to produce good reconstructions of the inputimages in a simple navigation task after passing the highdimensional data through a bottle neck of only two neuronsin their innermost hidden layer.Nevertheless, the question remains whether or not theencoding implemented by these two inner-most neurons isuseful only in the original task, that is reconstructing theinput, or can also be used for learning a policy. Whereasproperties of deep neural networks have been thoroughlystudied in object classification tasks, their applicability tounsupervised learning a useful preprocessing layer in visualreinforcement learning tasks remains rather unclear. Theanswer to this question—the second main topic—mainlydepends on whether the feature space allows for abstractingfrom particular images and for generalizing what has beenlearned so far, to newly seen similar observations. In sectionV, we will name four evaluation criteria and do a thoroughexamination of the feature space in this respect and finallygive a positive answer to this question. Moreover, we willpresent some ideas on how to further optimize the topologyin the feature space using task specific information. Finally,we present first successes of learning control policies directlyon synthesized images and—for the very first time—using areal, noisy image formation process in section VI.II. R ELATED W ORK[8] was the first attempt of applying model-based batchRL directly to (synthesized) images. Ernst did a similarexperiment using model-free batch-RL algorithms [9]. Theinteresting work of [10] can be seen at the verge of fullyintegrating the learning of the image processing into RL.

Nevertheless, the extraction of the descriptive local features was still implemented by hand, learning just the taskdependent selection of the most discriminative features. Allthre [8], [9], [10] lacked realistic images, ignored noise andjust learned to memorize a finite set of observations, nottesting for generalization at all.Instead of using Restricted Boltzman Machines duringthe layer-wise pretraining of the deep auto-encoders [4]our own implementation completely relies on regular multilayer perceptrons, as proposed in chapter 9 of [11]. Previouspublications have concentrated on applying deep learningto classical image recognition tasks like face and letterrecognition [4], [5], [11]. The RL-tasks studied here alsoadd the complexity of tracking moving objects and encodingtheir positions adequately in very low-dimensional featurevectors.encoding of the images in a low-dimensional feature space.We advocate a combination with recently proposed modelfree batch-RL algorithms, such as Fitted Q-Iteration (FQI)[13], LSPI [14] and NFQ [15], because these methods havebeen successful on real-world continuous problems and, asthese sample-based batch-algorithms [7] already store andreuse state transitions (st , at , rr 1 , st 1 ), the training of theauto-encoders integrates very well (see fig. 3) into the batchRL-framework with episodic exploration as presented in [3].Sample Experience,Interactingwith the EnvironmentUnsupervised Training ofDeep Auto-encoderouter loopDeep EncoderChangedPolicyIII. D EEP F ITTED Q-I TERATIONSinner loopIn this section, we will present the new deep fitted qiteration algorithm (DFQ) that integrates unsupervised training of deep auto-encoders into memory-based batch-RL.Transition Data inObservation SpaceAppoximatedValueFunctionPrepare Pattern SetTransition Data inFeature SpaceA. General FrameworkIn the general reinforcement learning setting [12], anagent interacts with an environment in discrete time stepst, observing some state s S and some reward signal rto then respond with an action a A. We’re interested intasks that can be modelled as markov decision processes [12]with continous state spaces and finite action sets. The taskis to learn a strategyπ : S 7 A maximizing the expectationP of the sum Rt k 0 γ t rt k 1 of future rewards rt withdiscount factor γ [0, 1]. In the visual RL tasks consideredhere, the present state of the system is not directly observableby the agent. Instead, the agent receives a high-dimensional,continuous observation o O (image) in each time step.observationdeepencodersemantics unknownfeature vectorsemantics ntstateenvironmentq-valuessemantics knownactionselectionsystemdynamicsactionFig. 2. Extended agent-environment loop in visual RL tasks. In the deepRL framework proposed here, a deep-encoder network is used to transferthe high-dimensional observations into a lower-dimensional feature vectorwhich can be handled by available RL-algorithms.We will insert the training of deep auto-encoders on theagent’s side of the processing loop (fig. 2) in order to learn anBatch-ModeSupervised LearningFig. 3. Graphical sketch of he proposed framework for deep batch-RLwith episodic exploration.In the outer loop, the agent uses the present approximationof the Q-function [12] to derive a policy—e.g. by -greedyexploration—for collecting further experience. In the innerloop, the agent uses the present encoder to translate allcollected observations to the feature space and then appliessome batch-RL algorithm to improve an approximation ofthe Q-function. From time to time, the agent may retrain anew auto-encoder. The details of the processing steps will bediscussed in the following subsections.B. Training Deep Auto-Encoders with RPropTraining the weights of deep auto-encoder neural networksto encode image data has been thoroughly treated in theliterature [4], [11]. In our implementation, we use severalshallow auto-encoders for the layer-wise pre-training of thedeep network, starting with the first hidden layer, alwaystraining on reconstructing the output of the previous layer.After this pre-training, the whole network is unfolded andfine-tuned for several epochs by training on reconstructingthe the inputs. Differing from other implementations, wemake no use of RBMs but use multi-layer perceptrons(MLP) and standard gradient descent on units with sigmoidalactivations in both phases, as proposed in chapter 9 of [11].Weights are updated using the RProp learning rule [16]. AsRProp only considers the direction of the gradient and notits length, this update-rule is not as vulnerable to vanishinggradients as standard back-propagation. Furthermore, resultsdo not depend on extensive tuning of parameters [16].

C. DFQ: Integrating deep auto-encoders into batch-RLWe combined the deep auto-encoders with Ernst’s FittedQ-Iteration for learning a policy. The new DFQ algorithmconsists of the following steps realizing the inner and outerloop of figure 3:A. Initialization Set episode counter k 0. Set sample counterp 0. Create an initial (random) exploration strategy π 0 :z 7 a and an inital encoder ENC : o7 W 0 z with (random)weight vector W 0 . Start with an empty set FO oftransitions (ot , at , rt 1 , ot 1 )B. Episodic Exploration In each time step t calculate thefeature vector zt from the observed image ot by usingthe present encoder zt ENC(ot ; W k ). Select an actionat π k (zt ) and store the completed transition FO FO (op , ap , rp 1 , op 1 ) incrementing p with each observedtransition.C. Encoder Training Train an auto-encoder (see [4]) on the pobservations in FO using Rprop during layer-wise pretrainingand finetuning. Derive the encoder ENC( · ; W k 1 ) (first halfof the auto-encoder). Set k k 1.D. Encoding Apply the encoder ENC(o; W k ) to all transitions(ot , at , rt 1 , ot 1 ) FO , transfering them into the featurespace Z, constructing a set FZ {(zt , at , rt 1 , zt 1 ) t 1, . . . , p} with zt ENC(ot ; W k ).E. Inner Loop: FQI Call FQI with FZ . Starting with aninitial approximation Q̂0 (z, a) 0 (z, a) Z AFQI (details in [13]) iterates over a dynamic programming(DP) step creating a training set P i 1 {(zt , at ; q̄ti 1 ) t 1, ., p} with q̄ti 1 rt 1 γ maxa0 A Q̂i (zt 1 , a0 ) anda supervised learning step training a function approximatoron P i 1 , obtaining the approximated Q-function Q̂i 1 . Afterconvergence, the algorithm returns the unique fix-point Q̄k .F. Outer loop If satisfied return approximation Q̄k , greedypolicy π and encoder ENC( · ; W k ). Otherwise derive anexploration strategy π k from Q̄k and continue with step B.Each time a new encoder ENC( · ; W k 1 ) is learned inC, thus the feature space and its semantic are changed, thepresent approximation of the Q-function becomes invalid.Whereas online-RL would have to start over completely fromscratch, in the batch-approach the stored transitions can beused to immediately calculate a new Q-function in the newfeature space, without a single interaction with the system.When using an averager [8] or kernel-based approximator[7] for approximating the q-function, the series of approximations {Q̂i } produced by the FQI algorithm—under someassumptions—is guaranteed to converge to a unique fix-pointQ̄ that is within a specific bound of the optimal q-function Q [13], [7]. Since the non-linear encoding ENC : O 7 W k Zdoes not change during the inner loop, these results alsocover applying the FQI algorithm to the feature vectors. Theweights of the averager can be adapted easily to include thenon-linear mapping as well, as the only restriction on theweights are non-negativity and summing up to 1 [8], [7].D. Variations and OptimizationsThe following variations have been found useful, improving the data efficiency and speeding up the learning process.Sparse networks: Using receptive fields in the noutermost layers instead of a full connection structure canhelp to greatly reduce the total number of weights in theencoder, thus significantly decreasing the number of samples and time needed for training. Furthermore, receptivefields are an effective method of using the neighbourhoodstructure between individual image dimensions. This idea hassome motivation in biology and its usage in artificial neuralnetworks dates back to at least the neocognitron.Transfering information: If, after learning a new encoder ENC(o; W k ) in step C of DFQ, recalculating theapproximation Q̂k from scratch by iterating over all theavailable transitions is too expensive, the q-values can betransferred from the old approximation Q̄k 1 : (z k 1 , a) 7 R in the old feature space Z k 1 to the new space Z k . Thetrick is to do one DP-update by looking up the Q-valuesof zt 1 in the old approximation Q̄k 1 but then storing theresulting target value q̄ within the new feature space. Wesimply prepare a training set PZ k {(ztk , at ; q̄ti 1 ) t 1, . . . , p} with one sample (ztk , at ; q̄ti 1 ) for every transitionin FO {(ot , at , rt 1 , ot 1 ) t 1, . . . , p}, where ztk ENC(ot ; W k ). In this case q̄t is calculated using the oldk 1feature vector zt 1 ENC(ot 1 ; W k 1 ) and old approxik 1 0k 1mation Q̄as q̄t rt 1 γ maxa0 A Q̄k 1 (zt 1, a ). Thepatterns are then used to train a new, initial approximation Q̂kin the new feature space. In practice, the number of necessaryiterations until convergence of the FQI algorithm in step E isoften reduced, when starting from this initial approximation.Re-training: For an optimal learning curve, the autoencoder must be re-trained whenever new data is available.But since the expected improvement of the feature spacegiven just a few more observations is rather limited—as a faircompromise between optimal feature spaces and computingtime—the re-training of the auto-encoder is only triggeredwith every doubling of the number of collected observations.IV. P ROOF OF C ONCEPTIn a very first proof-of-concept experiment and throughoutthe evaluation of the feature spaces we will use a continuousgrid-world-like [12] problem using synthesized images, thushaving complete control on the image formation process (seefigure 4). After very careful consideration we have chosento use such a task with a rather simple optimal policy forour analysis, as the focus of this paper is on handling thecomplex, high-dimensional observations and not on learningvery difficult policies. At the same time, the task allows for athorough analysis of the proposed methods, as the decisionboundaries and optimal costs can be derived precisely.A. Continuous grid-world with noisy image formationIn this problem, the agent observes a rendered imageo [0, 1]n with n 30 30 900 pixels and added gaussiannoise N (0, 0.1) instead of the two-dimensional system states (x, y) [0, 6)2 . Due to the discrete nature of thepixels, the number of possible agent positions in the imagesis limited to 900 minus 125 positions blocked by walls. Eachaction moves the agent exactly 1m in one of four directions.The task of reaching the 1m 1m goal area has been modeledas a shortest-path problem [12], with a reward of 1 for anytransition not ending in the absorbing goal area.

0 1 2 3 4 5 60goalW1areaWG2 wallsW W W3Nsystem state4W AE(x,y) [0,6)25 agentS65pxfeature spaces under more realistic conditions in the nextsection.900N(0,0.1)reconstr.noisyobserv.2zV. E VALUATING THE F EATURE S PACE900Fig. 4. Continous grid-world with noisy image formation. Left: 6m 6mworld, walls (W), 1m2 goal area (G) and agent (A). Middle: rendered image.Right: an auto-encoder calculates feature vectors z from noisy observations.B. Simplified proof-of-concept experimentIn order to make a point about neglecting noise as donein previous publications [8], [10], the described system hasbeen further simplified for this very first experiment. Thesynthesized images were used without adding any noise andthe agent was always starting in one of only 30 differentstarting positions corresponding to the center of 1m 1mcells of a regular partition of the grid-world’s state space.In this version of the task there are only 31 differentobservations, thus making it comparable to rather simpleproblems with a small, finite set of states.When targeting a realistic image-formation process thatinvolves noise and many different observations, learningQ-values by heart is not possible (see fig. 6 column c).Abstraction from the pure-pixel values and generalizationamong similar observations in this case is a necessity forlearning good policies and at the same time being dataefficient, needing as few interactions as possible. Robustnessto noise and not confusing the underlying, non-observablesystem states is another challenge. In this respect, there are 4different criteria for evaluating the feature spaces constructedby the deep auto-encoders.a) Identification: There should be a clear correspondencebetween feature vectors and underlying system states,allowing a reliable identification and introducing aslittle performance-limiting state-aliasing as possible.b) Robustness: Encodings of several noisy images of theagent at the exact same position should be close to eachother, forming narrow clusters in the feature space.c) Generalization: Two images of the agent at onlyslightly differing positions should not result in vastlydiffering feature vectors but should have a tendency ofbeing close to each other.d) Topology: Somehow preserving the general relation among the underlying system states (e.g. x-ycoordinates) would provide additional benefit for approximators being able to generalize globally.C. ResultsIn analogy to the task’s internal state (x, y), we have chosen to learn a two-dimensional encoding of the observations.After several preliminary tests, the auto-encoder’s topology was fixed to 21 layers with -121-225-484-900-900 neuronsand 9 9-receptive fields between the 5 outermost layers. This auto-encoder had more than 350 000 connectionsin total. The average reconstruction error (total sum ofsquares / number of images) of the second generation autoencoder was 10.9 after 190 epochs of fine-tuning. Activationsof output-neurons representing immobile parts (walls, goalstate), have been observed to ’attach’ to the bias weightswithin the very first episodes of the training. Representingthe learned Q-function with a large 2D grid approximator(500 500 cells) that assigns a constant Q-value to all stateaction pairs residing in the same hyper-rectangle, the DFQalgorithm found the optimal policy within 65 episodes.Whereas the second and third properties would allow to relate new observations to what has been learned before, thus—as the first criterion—are a prerequisit for any succesfula)b)c)0.70.6(2,3)(2,4)(3,3)(3,4)0.5D. DiscussionAlthough the agent learned the optimal policy, the relevance of this result as well as the earlier results of Jodogneand Ernst [9], [10] is rather limited, when it comes to answering the central question of whether or not the automaticallyconstructed feature spaces will be useful for learning policieson real images. The simplified image-formation process inthis experiment being completely deterministic, without anynoise, each of the non-observable states is represented byexactly one observation. Even a random-initialized encoderwould produce different outputs for all of them, therebyidentifying the 31 system states uniqely. Q-values for thatfew observations can easily be learned by heart storing themin a hash-table, without the need for ever generalizing topreviously unseen observations. Hence, we will examine the0.40.30.20.100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45Fig. 6.Effects of noisy image-formation in the continous grid-worldtask. a) three sample-observations of the agent from the testing set. b)Reconstructions of these images after training on noisy training images (s.sec. V-A). c) Feeding several hundred noisy-samples of four neigbhoringagent positions to an encoder that was trained only on the noise-freeobservations produces feature vectors that are spread out in about a quarterof the feature space, forming fraying, overlapping clusters.

after pretraining20 epochs0.940 epochs0.90.90.80.8400 1100 epochs0.50.40.20.30.40.50.60.70.80.90.61200 .91Fig. 5. Evolution of the the feature space during finetuning the deep auto-encoder. Feature vectors of testing images depicting the agent in the same tileof an arbitrarily 6 6 tiling of the grid world are marked using the same color and symbol (used only for visualization-purposes). During the unsupervisedtraining, the feature space is ’unfolded’, gradually improving the ordering of neighboring observations towards a near-perfect result after 400 epochs.learning, the fourth property is not an absolute requirementbut could facilitate the learning process.A. Training of the auto-encoderBefore letting the agent explore the world on its own, weexamined the resulting feature spaces in several experimentsunder controlled, perfect sampling conditions. We sampleda total of 6200 noisy images, the agent’s position evenlydistributed throughout the maze (3100 images for training,3100 for testing). An auto-encoder of the same topology as insection IV-C achieved an average reconstruction error (RE)of 7.3 per testing image (down from 17 after pre-training).Some exemplary observations and their reconstructions canbe seen in figure 6 columns a and b.More interesting than the reconstruction error is the qualityof the learned feature space. Within DL, there seem to betwo established methods for assessing feature spaces; first, avisual analysis and second, training classifiers on the featurevectors [4], [11]. We used both methods keeping the fourcriteria mentioned above in mind.B. Visual analysis of the feature spaceThe encoder-part of the auto-encoders has been used tomap all testing images from the observation space to the twodimensional feature space spanned by the encoder network’soutput neurons (fig. 5). We have used the same color andsymbol to mark feature vectors residing in the same tile,arbitrarily superimposing a 6 6 tiling on the grid-world.Of course, these labels are not available during training butadded later for visualization purposes only. As can be seenin the first plot, the pre-training realized a good spreadingof the data in the feature-space but did not achieve a goodseparation. With the progress of the finetuning, structurebecomes more and more visible. After 400 epochs, imagesthat show the agent in the same cell of the maze form clearlyvisible clusters in the feature space (criteria b, c).C. Classification of feature vectorsIn order to further test the usefulness of the derived featurespace, we did a supervised-learning experiment. A secondneural net (“task net”) with two hidden layers was trainedon class labels corresponding to the previously introduced6 6 tiling using the feature vectors produced by the deepencoder as inputs. The optimal network size was determinedin a series of preliminary experiments.A reasonable number of hidden-layer neurons was needed,to achieve good results. Using 2-25-25-2 neurons in the tasknet, and training the output layer on the coordinates of thecenter of the corresponding tile the trained net classified80.10% of them correctly, whereas a classification is countedas correct, when the output of the task net is closer to thecenter of the correct tile than to the center of any other tile.As can be seen in table I (column CR Fixed) the qualityof the feature space depends heavily on the number anddistribution of samples. Experiments with 1550 and 3100samples did use ’perfect sampling’, evenly distributing theobservations among the possible agent-positions, experiments with less samples did select samples by random.TABLE IAVERAGE SQUAREDCLASSIFICATION RATESS PACEDLPCA 02PCA 04PCA 10RECONSTRUCTION ERROR(CR) ON TRAININGS .814.912.6(REC)ANDSETS OF DIFFERENT SIZE .CR F IXED27.31%59.61%61.42%80.10%39.58%70.80%88.29%CR A DAPTIV40.86%80.87%91.59%99.46%–––D. Comparison with Principle Component Analysis (PCA)For comparison reasons, we did the same experimentsusing a PCA on the exact same set of images. The first nprincipal components (PC) and a scatter plot of the first twocomponents are depicted in fig 7. Basic PCA clearly failsto construct a useful, compact representation, as the firsttwo principle components only explain 6% of the overallvariance. Training a classifier on supervised learning task

using the first 2 (PCA 2) and 4 (PCA 4) principal componentsled to unsatisfying results (see tab. 1). Even more PCs areneeded to allow for a better classification than the twodimensions found by DL (PCA 10).123410100in the RL-task. For example, an encoder that was only shortlytrained on 775 non-evenly distributed samples (RE: 12.0)could be improved from a classification rate (CR) of 59.61%with fixed-encoder weights to a CR of 80.87% (fig. 8).2n-th -1.5-1-0.500.511.5encoder240epochsFig. 7. Eigenimages of some of the principal components found by a PCA(top row) and reconstruction of the original (orig) image when using the nfirst PCs. DL needs only 2 dimensions for producing a reconstruction thatis clearly more accurate than those produced by the first 10 PCs.oE. Backpropagating error termsIn order to test whether the feature space constructed bythe auto-encoders could be further improved by letting theencoding adapt to a specific task, we attached a small ’tasknetwork’ with only three hidden layer neurons directly to theoutput-neurons of the original encoder, as can be seen in thetop row of figure 9. The key was to use only three hiddenlayer neurons, giving the task-net some expressive power, butby far not enough to learn a good mapping from the initialfeature space. During supervised training, the error was backpropagated through the whole net, including the encoder, andwas used to also slowly adapt the weights of the encoder, thusadapting the ’preprocessing’ itself in respect to the needs ofthe classification task at hand. This combined net achievedan impressive classification rate of 99.46% on the testingimages (left column of fig. 80.9100.10.20.30.40.50.60.70.80.9Fig. 8. Improvement of an encoder that has been derived from an imperfectly trained auto-encoder, after training the supervised x-y-coordinatestask. Distribution in the feature space at the beginning of the training (left)and after training (middle). Topology after training (right).Even more astonnishing is the quality of the improvement of the topology in the encoder-part of the combinednet. Whereas with the fixed weights we got mixed resultsregarding the preservation of the topology—there is somepreservation of the neighborhood-relations between the cellsof the maze, but there are several foldings and deformationsin the feature space—letting adapt the encoder’s weights tothe task at hand, helped to significantly improve the overalorganization of the feature space. The right column of 9depicts the gradual improvements of the topology in thethird layer counted from the back of the combined networkthat is the encoder’s output layer. The final result is a nearperfect preservation of the original topology with only fewdistortions and no crossings. Moreover, this technique couldbe used to improve a really weak initial encoding to be usefulFig. 9. Improvement of the topology during supervised learning. Top:Arbitrary grid structure and its initial mapping to the feature space. Middle:Topology of the feature space (right column) and absolute differencesbetween targets and outputs on the testing patterns (left column). The grayrectangular region marks the error that would not lead to a misclassifcation.Bottom: Errors on the training images (left) and the final distribution of thetesting images in the feature space (right).F. Training on the optimal value function.In analogy to what the encoder will be used within DFQ,we trained a c

for learning a policy on this particular encoding. Visiomotoric Learning Policy Low-dimensional Feature Space Action classical solution: image processing here: unsupervised training of deep autoencoders Reinforcement Learning Sensing Fig. 1. Classic decomposition of the visual reinforcement learning task. In oder to increase the autonomy of a .

Related Documents:

Neural Networks and Introduction to Bishop (1995) : Neural networks for ...

Deep Learning 1 Introduction Deep learning is a set of learning methods attempting to model data with complex architectures combining different non-linear transformations. The el-ementary bricks of deep learning are the neural networks, that are combined to form the deep neural networks.

14 Views

1y ago

Intel Digital Home Capabilities Assessment Tool Results

capture device Hauppauge WinTV PVR PCI II Capture capture device Hauppauge WinTV PVR PCI II Capture Encoder DivX 6.1.1 Codec (1 Logical CPU) Encoder DivX 6.1.1 YV12 Decoder Encoder WMVideo9 Encoder DMO Encoder WMVideo Advanced Encoder DMO Number of CPUs 1 Name AMD Athlon 64 3700 Code Nam

25 Views

2y ago

Deep neural networks I - University of California, Davis

Deep Neural Networks Convolutional Neural Networks (CNNs) Convolutional Neural Networks (CNN, ConvNet, DCN) CNN a multi‐layer neural network with – Local connectivity: Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions:

38 Views

2y ago

OWNER'S GUIDE - NinjaKitchen.com

auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. please keep these important safeguards in mind when using the . appliance: mportant: make sure that the .

286 Views

1y ago

Deep neural networks - University of California, Davis

ConvoluMonal Neural Networks Input Image ConvoluMon (Learned) Non-linearity SpaMal pooling Feature maps ConvoluMonal Neural Networks . ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 . 6/1/17 1 5 AlexNet for image classiﬁcaMon “car” AlexNet Fixed input size: 224x224x3

27 Views

2y ago

Now This is Podracing - Driving with Neural Networks

A growing success of Artificial Neural Networks in the research field of Autonomous Driving, such as the ALVINN (Autonomous Land Vehicle in a Neural . From CMU, the ALVINN [6] (autonomous land vehicle in a neural . fluidity of neural networks permits 3.2.a portion of the neural network to be transplanted through Transfer Learning [12], and .

57 Views

3y ago

Adobe Media Encoder Help

Adobe Media Encoder manual (PDF) Find a PDF of articles to learn how to use Adobe Media Encoder. Adobe Media Encoder manual (PDF) 7 Last updated 11/4/2019 Chapter 2: Encoding quick start and basics Using the Preset Browser The Preset Browser provides you with options that help streamline your workflow in Adobe Media Encoder.

92 Views

3y ago

EN / FSE-31 pulse encoder interface module user’s manual

The safety encoder must be attached to the motor shaft according to the instructions of the encoder manufacturer. FSE-31 does not detect mechanical failures outside of the encoder (for example, motor shaft slipping). WARNING! Use only connector X31 of the FSE-31 module to supply power to the encoder. If you supply power to the

28 Views

2y ago

Recent Views

MANAGERIAL FINANCE - GBV

of Managerial Finance page 2 Introduction to Managerial Finance 1 Starbucks—A Taste for Growth page 3 1.1 Finance and Business What Is Finance? 4 Major Areas and Opportunities in Finance 4 Legal Forms of Business Organization 5 Why Study Managerial Finance? Review Questions 9 1.2 The Managerial Finance Function 9 Organization of the Finance

3y ago

6.8K Views

Chapter 1 The roles of finance function in organisations

The roles of the finance function in organisations 4. The role of ethics in the role of the finance function Ethics is the system of moral principles that examines the concept of right and wrong. Ethics underpins an organisation’s sustained value creation. The roles that the finance function performs should be carried out in an .File Size: 888KBPage Count: 10Explore furtherRole of the Finance Function in the Financial Management .www.managementstudyguide.c Roles and Responsibilities of a Finance Department in a .www.pharmapproach.comRoles and Responsibilities of a Finance Department .www.smythecpa.comTop 10 – Functions of Business Finance in an om23 Functions and Duties of Accounting and Finance nded to you b

1y ago

335 Views

2017-2018 GRANDE ÉCOLE MSc in MANAGEMENT

Descriptif des cours Course Outlines 10 Catalogue des cours/ Course Catalog 2017-2018 FIN: Finance/Finance A : Actuariat/Actuarial, Insurance E : Finance d’entreprise/Corporate Finance The course liste tables and the course outlines G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d’Information, Sciences de la Décision et .

3y ago

312 Views

Behavioral Finance and Wealth L Management

Introduction to Behavioral Finance CHAPTER1 What Is Behavioral Finance? Behavioral Finance: The Big Picture Standard Finance versus Behavioral Finance The Role of Behavioral Finance with Private Clients How Practical Application of Behavioral Finance Can Create a Successful Advisory Rel

2y ago

377 Views

Catalogue des Cours Course Catalog - ESSEC Business School

10 Catalogue des cours/Course Catalog 2021-2022 FIN: Finance/Finance E : Finance d'entreprise/Corporate Finance G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d'Information, Sciences de la Décision et Statistiques/ Information Systems, Decision Sciences and Statistics

1y ago

222 Views

SINGAPORE - Kelly Services

FINANCE Chief Financial Officer Degree/Master 15 20,000 25,000 Finance Assistant Diploma 1-3 2,800 3,400 Finance Controller Degree 10-15 10,000 18,000 Finance Director Degree 15 15,000 20,000 Finance Executive/ Senior Finance Executive Degree 2-5 3,000 6,000 Finance Manager/ Assistan

2y ago

527 Views

Ministries of Finance and Nationally Determined Contributions

Rodrigo Rojo, IDB Sr. Consultant and advisor to Ministry of Finance of Chile. Colombia German Romero Otalora and Laura Marcela Ruiz Daza — Office of the Vice-Minister — Ministry of Finance. Ireland Paul Ryan — International Finance Division — Ministry of Finance Sean Judge — Department of Finance — Ministry of Finance

1y ago

232 Views

Trade Finance & Supply Chain Finance Awards 2022

In February 2022, Global Finance will publish its annual selections for the World's Best Trade Finance and Supply Chain Finance Providers. Global Finance will name the best trade finance providers in more than 100 countries and territories, eight global regions and

1y ago

215 Views

McKinsey on Finance

finance and strategy 23 How M&A practitioners enable their success Perspectives on Corporate Finance and Strategy Number 56, Autumn 2015 Finance McKinsey on. McKinsey on Finance. is a quarterly publication written by corporate-finance experts and practitioners at McKinsey & Company. This publication offers readers insights into value-creating .

3y ago

272 Views

SAP Simple Finance - tutorialspoint

SAP Simple Finance is only known as S/4 HANA Finance and this will be the only name of other releases of SAP Simple Finance. During the installation of SAP S/4 HANA Finance, various front-end and back-end components get installed. 2. SAP Simple Finance Introduction

2y ago

252 Views

pwc Finance Function Transformation

PwC’s finance effectiveness framework looks at 3 core areas within finance, to frame a programme of work that makes the finance function more effective, and to increase its interaction with the business: Finance efficiency Risk, Compliance and Control Finance Insights (the key lever in

2y ago

285 Views

Sustainable Finance: A Primer and Recent Developments

Social (impact) finance RBC Wealth Management Green finance Resonance Fund Impact finance Bridges Fund Management Socially responsible finance Nutmeg . Source: Author's own research. Despite this variety of definitions, some consistency of terminology has coalesced around the construct of "sustainable finance" in terms of a range of

1y ago

151 Views

The International Finance Corporation's Blended Finance Operations

The International Finance Corporation's Blended Finance Operations . 1. Context. Blended finance is a risk mitigation tool applied to investments for which it is difficult to attract commercial funding. Blended finance refers to the combination of concessional and commercial funding in private sector-led projects. Its rationale is

1y ago

187 Views

Agile Finance Reimagined Reimagining Finance for the New Normal

6 Agile Finance Reimagined: Reimagining finance for the new normal While the global impact of COVID-19 is still evolving, this much is clear: finance functions have been forced to deliver more value to the business, beyond simply driving down costs. "We are seeing that shift from finance being focused on efficiency to effectiveness," said the

1y ago

130 Views

Oracle Banking Supply Chain Finance User Guide

Oracle Banking Supply Chain Finance User Guide 7 2. Supply Chain Finance - An Overview 2.1 Supply Chain Finance Supply Chain Finance commonly known as (SCF) is a type of supplier finance which enables the supplier to cash his receivables early than the actual payment date, thereby freeing up its working capital.

1y ago

132 Views

Deep Auto-Encoder Neural Networks In Reinforcement Learning

It looks like you're using an ad-blocker