Target Propagation - UH

2y ago

74 Views

3 Downloads

521.03 KB

17 Pages

Last View : 5d ago

Last Download : 3m ago

Upload by : Fiona Harless

Report this link

Download PDF

Transcription

See discussions, stats, and author profiles for this publication at: arget PropagationARTICLE · DECEMBER 2014Source: arXivDOWNLOADSVIEWS561714 AUTHORS, INCLUDING:Saizheng ZhangY. BengioUniversité de MontréalUniversité de Montréal3 PUBLICATIONS 0 CITATIONS415 PUBLICATIONS 12,604 CITATIONSSEE PROFILESEE PROFILEAvailable from: Dong-Hyun LeeRetrieved on: 21 September 2015

Under review as a conference paper at ICLR 2015TARGET P ROPAGATIONDong-Hyun Lee1 , Saizheng Zhang1 , Antoine Biard12 , & Yoshua Bengio131Université de Montréal, 2 Ecole polytechnique, 3 CIFAR Senior FellowarXiv:1412.7525v1 [cs.LG] 23 Dec 2014A BSTRACTBack-propagation has been the workhorse of recent successes of deep learningbut it relies on infinitesimal effects (partial derivatives) in order to perform creditassignment. This could become a serious issue as one considers deeper and morenon-linear functions, e.g., consider the extreme case of non-linearity where the relation between parameters and cost is actually discrete. Inspired by the biologicalimplausibility of back-propagation, a few approaches have been proposed in thepast that could play a similar credit assignment role as backprop. In this spirit,we explore a novel approach to credit assignment in deep networks that we calltarget propagation. The main idea is to compute targets rather than gradients, ateach layer. Like gradients, they are propagated backwards. In a way that is related but different from previously proposed proxies for back-propagation whichrely on a backwards network with symmetric weights, target propagation relieson auto-encoders at each layer. Unlike back-propagation, it can be applied evenwhen units exchange stochastic bits rather than real numbers. We show that a linear correction for the imperfectness of the auto-encoders is very effective to maketarget propagation actually work, along with adaptive learning rates.1I NTRODUCTIONRecently, deep neural networks have achieved great success in hard AI tasks (Bengio, 2009; Hintonet al., 2012; Krizhevsky et al., 2012; Sutskever et al., 2014), mostly relying on back-propagation asthe main way of performing credit assignment over the different sets of parameters associated witheach layer of a deep net. Back-propagation exploits the chain rule of derivatives in order to converta loss gradient on the activations over layer l (or time t, for recurrent nets) into a loss gradient onthe activations over layer l 1 (respectively, time t 1). However, as we consider deeper networks– e.g., consider the recent best ImageNet competition entrants (Szegedy et al., 2014) with 19 or 22layers – longer-term dependencies, or stronger non-linearities, the composition of many non-linearoperations becomes more non-linear. To make this concrete, consider the composition of manyhyperbolic tangent units. In general, this means that derivatives obtained by backprop are becomingeither very small (most of the time) or very large (in a few places). In the extreme (very deepcomputations), one would get discrete functions, whose derivatives are 0 almost everywhere, andinfinite where the function changes discretely. Clearly, back-propagation would fail in that regime.In addition, from the point of view of low-energy hardware implementation, the ability to train deepnetworks whose units only communicate via bits would also be interesting.This limitation backprop to working with precise derivatives and smooth networks is the main machine learning motivation for this paper’s exploration into an alternative principle for credit assignment in deep networks. Another motivation arises from the lack of biological plausibility ofback-propagation, for the following reasons: (1) the back-propagation computation is purely linear,whereas biological neurons interleave linear and non-linear operations, (2) if the feedback pathswere used to propagate credit assignment by backprop, they would need precise knowledge of thederivatives of the non-linearities at the operating point used in the corresponding feedforward computation, (3) similarly, these feedback paths would have to use exact symmetric weights (with thesame connectivity, transposed) of the feedforward connections, (4) real neurons communicate by(possibly stochastic) binary values (spikes), (5) the computation would have to be precisely clockedto alternate between feedforward and back-propagation phases, and (6) it is not clear where theoutput targets would come from.1

Under review as a conference paper at ICLR 2015The main idea of target propagation is to associate with each feedforward unit’s activation valuea target value rather than a loss gradient. The target value is meant to be close to the activationvalue while being likely to have provided a smaller loss (if that value had been obtained in thefeedforward phase). In the limit where the target is very close to the feedforward value, targetpropagation should behave like back-propagation. This link was nicely made in (LeCun, 1986;1987), which introduced the idea of target propagation and connected it to back-propagation via aLagrange multipliers formulation (where the constraints require the output of one layer to equal theinput of the next layer). A similar idea was recently proposed where the constraints are relaxedinto penalties, yielding a different (iterative) way to optimize deep networks (Carreira-Perpinanand Wang, 2014). Once a good target is computed, a layer-local training criterion can be definedto update each layer separately, e.g., via the delta-rule (gradient descent update with respect to thecross-entropy loss).By its nature, target propagation can in principle handle stronger (and even discrete) non-linearities,and it deals with biological plausibility issues 1, 2, 3 and 4 described above. Extensions of theprecise scheme proposed here could handle 5 and 6, but this is left for future work.In this paper, we provide several experimental results on rather deep neural networks as well asdiscrete and stochastic networks. The results show that the proposed form of target propagationis comparable to back-propagation with RMSprop (Tieleman and Hinton, 2012) - a very popularsetting to train deep networks nowadays.2P ROPOSED TARGET P ROPAGATION I MPLEMENTATIONAlthough many variants of the general principle of target propagation can be devised, this paperfocuses on a specific approach, described below, which fixes a problem in the formulation introducedin an earlier technical report (Bengio, 2014).2.1F ORMULATING TARGETSLet us consider an ordinary deep network learning process. The unknown data distribution isp(x, y), from which the training data is sampled. The network structure is defined ashi fi (hi 1 ) si (Wi hi 1 ), i 1, . . . , M(1)where hi is the i th hidden layer, hM is the output of network, h0 is the input x, si is the nonlinearity (e.g. tanh or sigmoid) and Wi corresponds to the weights for layer i, fi is the i-th layerfeed-forward mapping. For simplicity (but an abuse) of notation, the bias term of each layer isi,ji,jincluded in Wi . We define θWas the subset of network parameters θW {Wk , k i 1, . . . , j}.i,j) for 0 i j M . We defineBy this notion, each hj is a function of hi where hj hj (hi ; θW0,Mthe global loss function for one sample (x, y) as L(x, y; θW), where0,ML(x, y; θW)0,M loss(hM (x; θW), y)0,ii,M loss(hM (hi (x; θW); θW), y), i 1, . . . , M 1(2)Here loss(·) can be any kind of loss measure function (e.g. MSE, Binomial cross-entropy). Thenthe expected loss function over the whole data distribution p(x, y) is written0,ML E{L(x, y; θW)}.p(3)Training a network with back-propagation corresponds to propagating error signals through the network, signals which indicate how the unit activations or parameters of the network could be updatedto decrease the expected loss L. In very deep networks with strong non-linearities, error propagationcould become useless in lower layers due to the difficulties associated with strong non-linearities,e.g. exploding or vanishing gradients, as explained above. Given a data sample (x, y) and the corre0,isponding activations of the hidden layers hi (x; θW), a possible solution to avoid these issues could0,ibe to assign a nearby value ĥi for each hi (x; θW) that could lead to a lower global loss. For asample (x, y), we name such value ĥi a target, with the objective thati,M0,ii,Mloss(hM (ĥi ; θW), y) loss(hM (hi (x; θW); θW), y)2(4)

Under review as a conference paper at ICLR 2015In local layer i, we hope to train the network to make hi move towards ĥi . As hi approaches ĥi ,0,Mif the path leading from hi to ĥi is smooth enough, we expect that the global loss L(x, y; θW)would then decrease. To update the Wi , instead of using the error signals propagated from global0,Mloss L(x, y; θW) with back-propagation, we define a layer-local target loss Li . For example, usinga MSE loss gives :0,i 2Li (ĥi , hi ) ĥi hi (x; θW) 2 .(5)In such a case, Wi is updated locally within its layer via stochastic gradient descent, where ĥi isconsidered as a constant with respect to Wi(t 1)Wi(t) Wi ηfi0,i) Li (ĥi , hi ) Li (ĥi , hi ) hi (x; θW(t) Wi ηfi. Wi hi Wi(6)In this context, derivatives can be used within a local layer because they typically correspond tocomputation performed inside each neuron. The severe non-linearity that may originate from thechain rule arises mostly when it is applied through many layers. This motivates target propagationmethods to serve as alternative credit assignment in the context of a composition of many nonlinearities. What a target propagation method requires is a way to compute the target ĥi 1 from thehigher-level target ĥi and from hi , such that it is likely to respect the constraint defined by Eq.4 andat least satisfies weaker assumptions, like for example :Li (ĥi , fi (ĥi 1 )) Li (ĥi , fi (hi 1 ))2.2(7)H OW TO ASSIGN A PROPER TARGET TO EACH LAYERThe problem of credit assignment is the following: how should each unit change its output so as toincrease the likelihood of reducing the global loss?With the back-propagation algorithm, we compute the gradient of the loss with respect to the outputof each layer, and we can interpret that gradient as an error signal. That error signal is propagatedrecursively from the top layer to the bottom layer using the chain rule. L L hi hiδhi 1 δhi(8) hi 1 hi hi 1 hi 1In the target-prop setting, the signal that gives the direction for the update is the difference ĥ h.So we can rewrite the first and the last terms of the previous equation and we get :1 ĥi hi 22 hi (9) hi 12 hi 1Still in the target-prop framework, the parameter update at a specific layer is obtain by a stochasticgradient descent (sgd) step to minimize the layer wise cost and can be written :ĥi 1 hi 1 (ĥi hi ) ĥi hi 22(10) WiWith back-propagation to compute the gradients at each layer, we can consider that the target of alower layer is computed from the target of an upper layer as if gradient descent had been applied(non-parametrically) to the layer’s activations, such that Li (ĥi , fi (ĥi 1 )) Li (ĥi , fi (hi 1 )).This could be called “target propagation through optimization” and reminiscent of (CarreiraPerpinan and Wang, 2014).(t 1)Wi(t) Wi ηHowever, in order to avoid the chain of derivatives through many layers, another option, introducedin (Bengio, 2014), is to take advantage of an “approximate inverse”. For example, suppose that wehave a function gi such thatfi (gi (ĥi )) ĥi ,(11)then choosing ĥi 1 gi (ĥi ) would have the consequence that the level i loss Li (to make the outputmatch the target at level i) would be minimized. This is the vanilla target propagation introduced in(Bengio, 2014):ĥi 1 gi (ĥi )(12)3

Under review as a conference paper at ICLR 2015Note that gi does not need to invert fi everywhere, only in the vicinity of the targets. If the feedbackmappings were the perfect inverses of the feed-forward mappings (gi fi 1 ), we would get directlyLi (ĥi , fi (ĥi 1 )) Li (ĥi , fi (gi (ĥi ))) Li (ĥi , ĥi ) 0.(13)This would be ideal for target propagation. In fact, we have the following proposition for the caseof a perfect inverse:Proposition 1. Assume that gi is a perfect inverse of fi , where gi fi 1 , i 1, ., M 1 and fisatisfies: 1. fi is a linear mapping or, 2. hi fi (hi 1 ) Wi si (hi 1 ), which is another way toobtain a non-linear deep network structure (here si can be any differentiable monotonically increasing element-wise function). Consider one update for both target propagation and back-propagation,with the target propagation update (with perfect inverse) in ith layer being δWitp , and the backpropagation update being δWibp . Then the angle αi between δWitp and δWibp is bounded by0 αi cos 1 (λmin)λmax(14)Here λmax and λmin are the largest and smallest singular values of (JfM 1 . . . Jfi 1 )T , where Jfkis the Jacobian matrix of fk .See proof in Appendix A1 . Proposition 1 says that if fi has the assumed structures, the descentdirection of target propagation with perfect inverse at least partly matches with the gradient descentdirection, which makes the global loss always decrease. But a perfect inverse may be impractical forcomputational reasons and unstable (there is no guarantee that fi 1 applied to a target would yielda value that is in the domain of fi 1 ). So here we prefer to learn an approximate inverse gi , makingthe fi / gi pair look like an auto-encoder. This suggests parametrizing gi as follows:ĥi 1 gi (ĥi ) si (Vi hi ),i 0, ., M(15)where si is a non-linearity associated with the decoder and Vi the matrix of feedback weights forlayer i. With such a parametrization, it is unlikely that the auto-encoder will achieve zero reconstruction error. The decoder could be trained via an additional auto-encoder-like loss at each layer:Linv fi (gi (ĥi )) ĥi 22i(16)This makes fi (ĥi 1 ) closer to ĥi , thus making Li (ĥi , fi (ĥi 1 )) closer to zero. But we should getinverse mapping around the targets. This could help to compute targets which have never been seenbefore. For this, we can modify inverse loss using noise injection.Linv fi (gi (ĥi )) (ĥi ) 22 ,i N (0, σ)(17)However, the imperfection of the inverse yields severe optimization problems which has brought usto propose the following linearly corrected formula for the target propagation:ĥi 1 hi 1 gi (ĥi ) gi (hi )(18)We call this variant “difference target propagation” and we found in the experiments describedbelow that it can significantly reduce the optimization problems associated with Eq. 12. Note that ifgi was an inverse of fi , then difference target propagation would be equivalent to the vanilla targetpropagation of Eq. 12. For the “difference target propagation”, we have following proposition:(t)Proposition 2. During the t 1 th update in difference target propagation, we use Linvi (ĥi (t)(t 1)(t) ; Vi , Wi ) to update Viand we define t)(t)loss function over all possible ĥi with Wi fixed,(t)(t)(t)invL̄invi (Vi , Wi ) E {Li (ĥi ; Vi , Wi )}(t)ĥi , 1In the arXiv version of this paper.4(19)

Under review as a conference paper at ICLR 2015(t)(t) If 1.L̄invi (Vi , Wi ) has only one minimum with optimal Vi (Wi ); 2. proper learning rates forVi and Wi are given; 3. All the Jacobian and Hessian like matrices are bounded during learning;(t)(t)(t 1)(t) 4. Vi L̄inv) Vi (Wi ) i (Vi , Wi ) always points towards optimal Vi (Wi ); 5. E{Vi (Wi(t)(t)(t)Wi } 0. Then Vi Vi (Wi ) will almost surely converge to 0 at t th update when t goes toinfinity. Condition 1, 2, 4 follow the settings of stochastic gradient descent convergence similar to(Bottou, 1998).See proof in Appendix2 . Proposition 2 says that in difference target propagation, gi can learn a goodapproximation of fi ’s inverse, which will quickly minimize the auto-encoder-like error of each layer.The top layer does not have a layer above it and it has its own loss function which is also the globalloss function. In our experiments we chose to set the first target of the target-prop chain such thatL(ĥM 1 ) L(hM 1 ). This can be achieved for classification loss as follows:ĥM 1 hM 1 η0 L hM 1(20)where η0 is a “target-prop” learning rate for making the first target – i.e. one of the hyper-parameters.Making the first target at layer M 1 with the specific output and loss function instead of the outputlayer can reduce algorithm’s dependence on specific type of output and loss function. So we canapply consistent formulation to compute target in lower layers. And then, once we have a methodto assign proper targets to each layer, we only have to optimize layer-local target losses to decreaseglobal loss function.2.3T HE ADVANTAGE OF DIFFERENCE TARGET PROPAGATIONIn order to make optimization stable in target propagation, hi 1 should approach to ĥi 1 as hiapproaches to ĥi . If not, even though optimization is finished in upper layers, the weights in lowerlayers would continue to be updated. As a result, the target losses in upper layers as well as theglobal loss can increase even after we reach the optimum situation. So we found the followingcondition to greatly improve the stability of the optimization.hi ĥi hi 1 ĥi 1If we have the perfect inverse gi fi 1 , it holds with vanilla target propagation because(21)hi 1 fi 1 (hi ) gi (ĥi ) ĥi 1 .(22) 1Although it is not guaranteed with an imperfect inverse mapping gi 6 fi in vanilla target propagation, with difference target propagation, it naturally holds by construction.ĥi 1 hi 1 gi (ĥi ) gi (hi )(23)More precisely, we can show that the when the input of a layer become the target of lower layercomputed by difference target propagation, the output of the layer moves toward the side of itstargetfi (ĥi 1 ) fi (hi 1 gi (ĥi ) gi (hi )) hi fi0 (hi 1 )gi0 (hi )(ĥi hi )(24)(ĥi hi )T (fi (ĥi 1 ) hi ) (ĥi hi )T fi0 (hi 1 )gi0 (hi )(ĥi hi )) 0(25)if ĥi hi and fi0 (hi 1 )gi0 (hi ) (fi (gi (hi )))0 is positive definite. It is far more flexible conditionthan the perfect inverseness. Even when gi is a random mapping, this condition can be satisfied.Actually, if fi and gi are linear mappings and gi has a random matrix, difference target propagationis equivalent to feedback alignment (Lillicrap et al., 2014) which works well on many datasets. Asa target framework, we also can show that the output of the layer get closer to its target ĥi fi (ĥi 1 ) 22 ĥi hi 22 fi0 (hi 1 )gi0 (hi ))T (I(26) fi0 (hi 1 )gi0 (hi ))if ĥi hi and the maximum eigenvalue of (Iis less than1 because ĥi fi (ĥi 1 ) [I fi0 (hi 1 )gi0 (hi )](ĥi hi ) . Moreover, as gi approaches to fi 1 ,this approaches to vanilla target propagation formula in (Bengio, 2014).gi (hi ) hi 1 ĥi 1 hi 1 gi (hi ) gi (ĥi ) gi (ĥi )2In the arXiv version of this paper.5(27)

Under review as a conference paper at ICLR 201533.1E XPERIMENTSVERY DEEP NETWORKSAs a primary objective, we investigated whether one can train ordinary deep networks on the MNISTdataset. The network has 7 hidden layers and the number of hidden units is 240. The activationfunction is the hyperbolic tangent (tanh). we use RMSprop as a adaptive learning rate algorithmbecause we do not have a global loss to optimize. Instead, we have the local layer-wise target lossesthat might need their learning rates to be on different scales (this is actually what we find when wedo hyper-parameter optimization over the separate learning rates for each layer). To get this result,we chose the optimal hyper-parameters for the best training cost using random search. And theweights are initialized with orthogonal random matrices.To improve optimization results, layers are updated one at a time from the bottom layer to the toplayer, thus avoiding issues with the current input of each layer being invalid if we update all layersat once.As a baseline, back-propagation with RMSprop is used. The same weight initialization and adaptivelearning rate and hyper-parameter searching method are used as with target-prop. We report ourresults in figure 1. We got test error 1.92% in target propagation, 1.88% in back propagation. Andwe got negative log-likelihood 3.38 10 6 in target propagation, 1.81 10 5 in back propagation.These results are averaged over 5 trials using chosen hyper-parameters.Figure 1: Training cost (left) and train/test classification error (right) with target-prop and backprop.Target propagation can converge to lower values of cost with the similar generalization performanceto backprop.3.2N ETWORKS WITH D ISCRETIZED T RANSMISSION BETWEEN U NITSAs an example of extremely non-linear networks, we investigated whether one can train even discretenetworks on the MNIST dataset. The network architecture is 784-500-500-10 and only the 1sthidden layer is discretized. Instead of just using the step activation function, we have normal neurallayers with tanh, and signals are discretized when transporting between layer 1 and layer 2, based onbiological considerations and the objective of reducing the communication cost between neurons.h1 f1 (x) tanh(W1 x)(28)h2 f2 (h1 ) tanh(W2 sign(h1 ))(29)p(y x) f3 (h2 ) sof tmax(W3 h2 )(30)where sign(x) 1 if x 0, 0 if x 0. We also use feedback mapping with inverse loss. Butin this case, we cannot optimize full auto-encoding loss because it is not differentiable. Instead, wecan use only reconstruction loss given the input and the output of feed-forward mapping.g2 (h2 ) tanh(V2 sign(h2 ))6(31)

Under review as a conference paper at ICLR 2015Linv g2 (f2 (h1 )) (h1 ) 22 ,2 N (0, σ)(32)If only feed-forward mapping is discrete, we can train the network using back-propagation with biased gradient estimator as if we train continuous networks with tanh. However, if training signalsalso should be discrete, it is very hard to train using back-propagation. So we compare our result totwo backprop baselines. One baseline is to train the discrete networks directly so we cannot train W1using backprop. It still can make training error be zero but we cannot learn any meaningful representation on h1 , so test error is poor in Figure 3 (left). Another baseline is to train continuous-activationnetworks with tanh and to test with the discrete networks (that is, indirect training). Though theestimated gradient is biased so training error does not converge to zero, generalization performanceis fairly good, as seen in Figure 2 (right), 3 (left).Figure 2: Training cost (left) and train error (right) while training discrete networks. (backprop disc)Because training signals cannot go across a discretization step, layer 1 cannot be trained by backprop. Though training cost is very low, it overfits, and test error is high. (backprop conti) An optionis to use a biased gradient estimator when we train the network as if it were continuous, and teston the discretized version of the network. It is an indirect training, not overcoming the discretenessduring training. Training error cannot approach zero due to the biased estimator. (diff target-prop)Target propagation can train discrete networks directly, so training error actually approaches zero.Moreover, test error is comparable to (backprop conti). It clearly suggests that using target-prop,training signals can go across a discretization step successfully.Figure 3: Test error (left) and diagram of the discrete networks (right). The output of h1 is discretizedbecause signals must be communicated from h1 to h2 through a long cable, so binary representationsare preferred in order to conserve energy. Training signals are also discretized through this cable(since feedback paths are computed by bona-fide neurons), so it is very difficult to train the networkdirectly. The test error of diff target-prop is comparable to (backprop conti) even though both feedforward signals and training signals are discretized.7

Under review as a conference paper at ICLR 2015However, with target propagation, because we can learn an inverse mapping with a discrete layerand we do not use derivatives through layers, we can successfully train discrete networks directly.Though training convergence is slower, training error approaches zero, unlike the biased gradientestimator with backprop and continuous networks. The remarkable thing is that test error is comparable to biased gradient estimator with backprop and continuous networks. We can train W1properly, that is, training signals can go across the discrete region successfully. Of course, as shownon the figure, the generalization performance is much better than the vanilla backprop baseline.3.3STOCHASTIC NETWORKSAnother interesting learning problem which backprop cannot deal with well is stochastic networkswith discrete units. Recently such networks have attracted attention (Bengio, 2013; Tang andSalakhutdinov, 2013; Bengio et al., 2013) because a stochastic network can learn a multi-modalconditional distribution P (Y X), which is important for structured output predictions. Trainingnetworks of stochastic binary units is also motivated from biology, i.e., they resemble networks ofspiking neurons. Here, we investigate whether one can train networks of stochastic binary unitson MNIST for classification using target propagation. Following Raiko et al. (2014), the networkarchitecture is 784-200-200-10 and the hidden units are stochastic binary units with the probabilityof turning on given by a sigmoid activation.hpi σ(Wi hi 1 ), hi sample(hpi )(33)where sample(p) is a binary random variable which is 1 with probability p.As a baseline, we consider a biased gradient estimator in which we do back-propagation as if itwere just continuous sigmoid networks. This baseline showed the best performance in Raiko et al.(2014). hpiδhpi 1 δhpi σ 0 (Wi hi 1 )WiT δhpi(34) hpi 1In target propagation, we can train this network directly. Lĥp2 hp2 η, ĥp1 hp1 g2 (ĥp2 ) g2 (hp2 )(35) h2gi (hpi ) tanh(Vi hpi ),Linv gi (fi (hi 1 )) (hi 1 ) 22 ,iUsing layer-local target losses Li ĥpi hpi 22 , N (0, σ)(36)we can update all the weights.We obtained a test error of 1.51% using target propagation and 1.71% using the baseline method. Inthe evalution, we averaged the output probabilities of an example over 100 noise samples, and thenclassify the example accordingly, following Raiko et al. (2014) This suggests that target propagationcan directly deal with networks of binary stochastic units.MethodDifference Target-Propagation, M 1Biased gradient estimator like backprop(followed by Raiko and Berglund, 2014, M 1)Tang and Salakhutdinov, 2013, M 20Raiko and Berglund, 2014, M 20Test Error(%)1.51%1.71%3.99%1.63%Table 1: Test Error on MNIST with stochastoc networks. The first row shows the results in ourexperiments. These are averaged results over 5 trials using the same hyper-parameter combinationwhich is chosen for the best valid error. The second row shows the results from (Raiko et al., 2014).In our experiment, we used RMS-prop and maximum epochs is 1000 different from (Raiko et al.,2014). M is the number of samples when computing output probability. we use M 100 at test time.3.4BACKPROP - FREE AUTO - ENCODERAuto-encoders are interesting building blocks for learning representations, especially deep ones (Erhan et al., 2010). In addition, as we have seen, training an auto-encoder is also part of what is8

Under review as a conference paper at ICLR 2015required for target propagation according to the approach presented here, in order to train the feedback paths that propagate the targets. We show here how a regularized auto-encoder can be trainedusing difference target propagation, without backprop.Like in the work on denoising auto-encoders (Vincent et al., 2010) and Generative Stochastic Networks (Bengio et al., 2014), we consider the denoising auto-encoder like a stochastic network withnoise injected in input and hidden units, trained to minimize a reconstruction loss.h f (x) sigm(Wx b)(37)z g(h) sigm(WT (h ) c), N (0, σ)(38)L z x 22 f (x ) h 22 , N (0, σ)(39)where we also use regularization to obtain contractive mappings. In order to train this networkwithout backprop (that is, chain rule), we can use difference target propagation. At first, the targetof z is just x, so we can train reconstruction mapping g with Lg g(h) x 22 in which h isconsidered as a constant. And then, we compute the target ĥ of hidden units following differencetarget propagation.ĥ h f (ẑ) f (z) 2h f (z)(40)where f is used as a inverse mapping of g without additional functions, and f (ẑ) f (x) h.As a target loss for the hidden layer, we can use Lf f (x ) ĥ 22 in which regularizationfor contractive mapping is also incorporated and ĥ is considered as a constant. Using layer-localtarget losses Lf and Lg , we train on MNIST a denoising auto-encoder whose architecture is 7841000-784. Stroke-like filters can be obtained (See Figure 4) and after supervised fine-tuning (usingbackprop), we get 1.35% test error. That is, our auto-encoder can train a good initial representationas good as the one obtained by regularized auto-encoders trained by backprop on the reconstructionerror.Figure 4: Diagram of the evaluated backprop-free auto-encoder (left) and its trained filters, i.e., layer1 weights (right). Even though we train the networks using only layer-local target losses instead ofa global loss (reconstruction error), we obtain stroke filters, similar to those usually obtained byregularized auto-encoders. Moreover, we can pre-train good hidden representations for initializationa classifier, which achieved a test error of 1.35% (after fine-tuning the whole net

a target value rather than a loss gradient. The target value is meant to be close to the activation value while being likely to have provided a smaller loss (if that value had been obtained in the feedforward phase). In the limit where the target is very close to the feedforward value, target

Related Documents:

The Generalized Weapon Target Assignment Problem

Therefore, target 1 has three target drops, i.e., target 1-A-1, target 1-B-1 and target 1-C-2. In this manner we can enumerate all possible target drops from target information. From source and target information we can set all possible assignments, and each of them is composed of a source and sequence of target drops, called a target drop set .

377 Views

3y ago

Radio Wave Propagation Handbook for Communication on and ...

on radio propagation. This handbook also provides basic information about the entire telecommunications environment on and around Mars for propagation researchers, system . 1.2 Radio Wave Propagation Parameters. 4 2. Martian Ionosphere and Its Effects on Propagation (Plasma and Magnetic Field). 7

77 Views

3y ago

HARTMANN AND KESTER’S PLANT PROPAGATION

1 How Plant Propagation Evolved in Human Society 2 2 Biology of Plant Propagation 14 3 The Propagation Environment 49. part two. Seed Propagation. 4 Seed Development 110 5 Principles and Practices of Seed Selection 140 6 Techniques of Seed Production and Handling 162 7 Principles of Propagati

52 Views

2y ago

Antennas & Propagation

Ground Wave Propagation Follows contour of the earth Can Propagate considerable distances Frequencies up to 2 MHz Example oAM radio. Sky Wave Propagation. Sky Wave Propagation Signal reflected from ionized layer of atmosphere back down to earth Signal can travel a number of hops, back and

34 Views

3y ago

Antennas and Wave Propagation - WordPress.com

wave propagation, including ground wave and ionospheric propagation, goes on to make this text a useful and self-contained reference on antennas and radio wave propagation. While a rigorous analysis of an antenna is highly mathematical, often a simpliﬁed analysis is suﬃcient for understanding the basic principles of operation of an antenna.

42 Views

3y ago

By Tim Kuhlman, PEBy Tim Kuhlman, PE KD7RUS

Ground Wave Ground wave propagation occurs at low frequencies. Typically 4 MHz and below. In ground wave propagation, the magnetic field ofIn ground wave propagation, the magnetic field of the RF signal couples with the earth. A vertically polarized antenna works well for this type of propagation.

63 Views

3y ago

UNDERSTANDING RF PROPAGATION - W7AIA

ground wave ground wave propagation occurs at low frequencies. typically 4 mhz and below. (think 80m, 160m bands and am broadcast). in ground wave propagation, the magnetic field of the rf signal couples with the earth. a vertically polarized antenna works well for this type of propagation.

36 Views

3y ago

HF Radio Wave Propagation.ppt - N3UJJ

Overview of HF Propagation Characteristics of HF radio propagation – Propagation is possible over thousands of miles. – It is highly variable. It has daily and seasonal variation, as well as a much longer 11 year cycle. HF radio waves may travel by any of the following modes: – Ground Wave – Direct

27 Views

2y ago

Recent Views

Career Options for In-House Counsel

Association of Corporate Counsel 1025 Connecticut Avenue, NW, Suite 200 Washington, DC 20036 USA tel 1 202.293.4103, fax 1 202.293.4701 www.acc.com By in-house counsel, for in-house counsel. Association of Corporate Counsel 1025 Connecticut Avenue, NW, Suite 200 Washington, DC 20036 USA tel 1 202.293.4

2y ago

181 Views

Corporate Counsel College

CORPORATE COUNSEL TRAINING ACADEMY For in-house counsel newer to the role. For more information, please view the Corporate Counsel Training Academy brochure on www.iadclaw.org. 5:00 - 6:30 p.m. COCKTAIL RECEPTION THURSDAY, APRIL 7, 2022 7:15 - 8:00 a.m. BREAKFAST 8:00 - 8:15 a.m. OPENING REMARKS John T. Lay, Jr., Corporate Counsel College Dean .

1y ago

115 Views

Session 102 How to Become Insurance Panel Counsel & Tips on Ethical .

The retained counsel maintains a relationship between the insured client(s) and the carrier with the common goal of resolving the litigation or claim(s) asserted against the insured. In such a relationship, the carrier pays the defense cost and the legal fees of the panel counsel. However, the panel counsel/staff counsel

1y ago

124 Views

OFFICE OF THE GENERAL COUNSEL MEMORANDUM GC 15- 04 March 18, 2015

OFFICE OF THE GENERAL COUNSEL MEMORANDUM GC 15- 04 March 18, 2015 TO: All Regional Directors, Officers-in-Charge, and Resident Officers FROM: Richard F. Griffin, Jr., General Counsel SUBJECT: Report of the General Counsel Concerning Employer Rules Attached is a report from the General Counsel concerning recent employer rule cases. Attachment

1y ago

108 Views

Corporate Counsel: In the Crosshairs of a Criminal Ivestigation

Corporate counsel are expected, and in some cases required, to act independently of the very executives to whom they report. The fiduciary duties of corporate counsel now dic-tate that, at the first signs of suspicious activity, corporate counsel are expected to consult with outside counsel, initi-

1y ago

102 Views

Summaries of Published Successful Ineffective Assistance of Counsel .

innocence; counsel thought petitioner believed what he was saying but counsel disbelieved it, and counsel's approach was not designed to avoid suborning perjury but rather to avoid a death sentence. SCOTUS not apply did . Strickland. here "[b]ecause a client's autonomy, not counsel's competence, is in issue." 138 S. Ct. at 1510- 11.

1y ago

85 Views

SM Recruiting & Retaining In-House Counsel

May 30, 2013 · By in-house counsel, for in-house counsel. Association of Corporate Counsel 1025 Connecticut Avenue, NW, Suite 200 Washington, DC 20036 USA tel 1 202.2

2y ago

125 Views

Assistant General Counsel for Litigation, Employment and .

The Assistant General Counsel for Litigation, Employment, and Oversight (AGC/LEO) is the principal assistant and advisor to the General Counsel and Deputy General Counsel on legal aspects of the Department’s activities in the fields of employment, labo

2y ago

109 Views

Case: 15-6397 Document: 24 Filed: 02/04/2016 Page: 1 .

AMICUS CURIAE IN SUPPORT OF THE APPELLANT . ANNE K. SMALL General Counsel . SANKET J. BULSARA Deputy General Counsel . MICHAEL A. CONLEY Solicitor . WILLIAM K. SHIREY Assistant General Counsel . STEPHEN G. YODER Senior Litigation Counsel . Securities and Exchange

2y ago

105 Views

USCA Case #13-5252 Document #1455974 Filed: 09/11/2013 .

1615 H St., NW Washington, DC 20062 202.463.5337 Counsel for Appellant the Chamber of Commerce of the United States of America Of Counsel: Quentin Riegel National Association of Manufacturers 733 10th St., NW Suite 700 Washington, DC 20001 202.637.3000 Counsel for Appellant the National Association of Manufacturers Of Counsel: Maria Ghazal

2y ago

322 Views

OUTSIDE COUNSEL GUIDELINES - Government of New Jersey

counsel shall designate a Relationship Attorney to be the Designated Attorney's principal contact. Outside counsel may expect the Designated Attorney to provide clear, specific instructions; communicate the State's objectives; closely monitor the management plan and budget; follow the progress of the matter; keep outside counsel informed of .

1y ago

104 Views

Waiver of Counsel in Juvenile Court

Waiver of Counsel . 3 Waiver of Counsel in Juvenile Court . The Sixth Amendment states "[i]n all criminal prosecutions, the accused shall enjoy the right . . . to have the Assistance of Counsel for his defence." (U.S. Constit, amend. VI). This right is part of the Constitutional jurisdiction of the Court (Johnson v. Zerbst, 1938). Without it, the

1y ago

113 Views

Should Compliance Report to the General Counsel?

than 800 responses, 88% are opposed to the corporate counsel serving as the compliance officer, and 80% oppose having com-pliance report to the corporate counsel's office. Detailed Findings o Survey respondents were strongly opposed to the idea of corporate counsel also serving as the compliance officer.

1y ago

112 Views

The General Counsel Report 2021 Rising To Today's Challenges and .

general counsel evolved from the office of "no," to one of significant strategic influence. Once largely viewed as a cost center, or barrier to corporate progress, the general counsel of today are business drivers in their own right. This evolution for the general counsel came in the nick of time for the turmoil of 2020.

1y ago

105 Views

Leveraging Legal Leadership: The General Counsel as a Corporate Culture .

counsel and legal department, but the failure to draw that link may prove shortsighted on the part of the board. Given the importance of the general counsel in matters of ethics, compliance, corporate governance, and risk and reputation management, the general counsel should be a key ally and partner in establishing a

1y ago

150 Views

Target Propagation - UH

It looks like you're using an ad-blocker