Tutorial: Learning Deep Architectures

7m ago
2.61 MB
29 Pages
Last View : 27d ago
Last Download : 2m ago
Upload by : Emanuel Batten

Tutorial:Learning Deep ArchitecturesYoshua Bengio, U. MontrealYann LeCun, NYUICML Workshop on Learning Feature Hierarchies,June 18th, 2009, Montreal

Deep Motivations Brains have a deep architecture Humans organize their ideas hierarchically, throughcomposition of simpler ideas Unsufficiently deep architectures can be exponentiallyinefficient Distributed (possibly sparse) representations are necessary toachieve non-local generalization Intermediate representations allow sharing statistical strength

Deep Architecture in the BrainArea V4Higher level visualabstractionsArea V2Primitive shape detectorsArea V1Edge detectorsRetinapixels

Deep Architecture in our Mind Humans organize their ideas and concepts hierarchically Humans first learn simpler concepts and then compose themto represent more abstract ones Engineers break-up solutions into multiple levels of abstractionand processing

Architecture DepthDepth 4Depth 3

Good News, Bad NewsTheoretical arguments: deep architectures can be2 layers oflogic gatesformal neuronsRBF units universal approximatorTheorems for all 3:(Hastad et al 86 & 91, Bengio et al 2007)Functions representablecompactly with k layers mayrequire exponential size withk-1 layers 2n1 2 3 1 2 3n

The Deep Breakthrough Before 2006, training deep architectures was unsuccessful,except for convolutional neural nets Hinton, Osindero & Teh « A Fast Learning Algorithm for DeepBelief Nets », Neural Computation, 2006 Bengio, Lamblin, Popovici, Larochelle « Greedy Layer-WiseTraining of Deep Networks », NIPS’2006 Ranzato, Poultney, Chopra, LeCun « Efficient Learning ofSparse Representations with an Energy-Based Model »,NIPS’2006

Greedy Layer-Wise Pre-TrainingStacking Restricted Boltzmann Machines (RBM) Deep Belief Network (DBN)

Stacking Auto-Encoders

Greedy Layerwise Supervised TrainingGenerally worse than unsupervised pre-training but better thanordinary training of a deep neural network (Bengio et al. 2007).

Supervised Fine-Tuning is Important Greedy layer-wiseunsupervised pre-trainingphase with RBMs or autoencoders on MNIST Supervised phase with orwithout unsupervisedupdates, with or withoutfine-tuning of hiddenlayers

Denoising Auto-Encoder Corrupt the input Reconstruct the uncorrupted inputHidden code (representation)Corrupted inputRaw inputKL(reconstruction raw input)reconstruction

Denoising Auto-Encoder Learns a vector field towards higherprobability regions Minimizes variational lower bound on agenerative model Similar to pseudo-likelihoodCorrupted inputCorrupted input

Stacked Denoising Auto-Encoders No partition function,can measure trainingcriterion Encoder & decoder:any parametrization Performs as well orbetter than stackingRBMs for usupervisedpre-trainingInfinite MNIST

Deep Architectures and SharingStatistical Strength, Multi-Task Learning Generalizing better tonew tasks is crucial toapproach AItask 1 output y1task 2output y2 Deep architectureslearn goodintermediaterepresentations thatcan be shared acrosstasks A good representationis one that makes sensefor many taskstask 3 output y3sharedintermediaterepresentation hraw input x

Why is Unsupervised Pre-TrainingWorking So Well? Regularization hypothesis: Unsupervised component forces model close to P(x) Representations good for P(x) are good for P(y x) Optimization hypothesis: Unsupervised initialization near better local minimum of P(y x) Can reach lower local minimum otherwise not achievable byrandom initialization Easier to train each layer using a layer-local criterion

Learning Trajectories in Function Space Each point a modelin function space Color epoch Top: trajectories w/opre-training Each trajectoryconverges indifferent local min. No overlap ofregions with and w/opre-training

Unsupervised learning as regularizer Adding extraregularization(reducing # hiddenunits) hurts more thepre-trained models Pre-trained modelshave less variance wrttraining sample Regularizer infinitepenalty outside ofregion compatiblewith unsupervised pretraining

Better optimization of online error Both training and onlineerror are smaller withunsupervised pre-training As # samples training err. online err. generalization err. Without unsup. pretraining: can’t exploitcapacity to capturecomplexity in targetfunction from training data

Before fine-tuningAfter fine-tuningLearning Dynamics of Deep Nets As weights become larger, gettrapped in basin of attraction(“quadrant” does not change) Initial updates have a crucialinfluence (“critical period”),explain more of the variance Unsupervised pre-training initializesin basin of attraction with goodgeneralization properties0

Restricted Boltzmann Machines The most popular building block for deep architectures Main advantage over auto-encoders: can sample fromthe model Bipartite undirected graphical model.x observed, h hidden P(h x) and P(x h) factorize:Convenient Gibbs sampling x h x h In practice, Gibbs sampling does not always mix well

Boltzmann Machine Gradient Gradient has two components:‘positive phase’ and ‘negative phase’ In RBMs, easy to sample or sum over h x: Difficult part: sampling from P(x), typically with a Markov chain

Training RBMs Contrastive Divergence (CD-k): start negative Gibbs chain atobserved x, run k Gibbs steps. Persistent CD (PCD): run negative Gibbs chain in backgroundwhile weights slowly change Fast PCD: two sets of weights, one with a large learning rateonly used for negative phase, quickly exploring modes Herding (see Max Welling’s ICML, UAI and workshop talks)

Deep Belief Networks Sampling: Sample from top RBM Sample from level k given k 1h3Top-level RBMh2 Estimating log-likelihood (not easy)(Salakhutdinov & Murray,ICML’2008, NIPS’2008)h1 Training: Variational bound justifies greedylayerwise training of RBMsHow to train all levels together?observed x

Deep Boltzmann Machines(Salakhutdinov et al, AISTATS 2009, Lee et al, ICML 2009) Positive phase: variationalapproximation (mean-field) Negative phase: persistent chain h3Guarantees (Younes 89,2000; Yuille 2004)If learning rate decreases in 1/t, chainmixes before parameters change toomuch, chain stays converged whenparameters change.h2h1 Can (must) initialize from stacked RBMs Salakhutdinov et al improved performanceon MNIST from 1.2% to .95% error Can apply AIS with 2 hidden layersobserved x

Level-local learning is important Initializing each layer of an unsupervised deep Boltzmannmachine helps a lot Initializing each layer of a supervised neural network as an RBMhelps a lot Helps most the layers further away from the target Not just an effect of unsupervised prior Jointly training all the levels of a deep architecture is difficult Initializing using a level-local learning algorithm (RBM, autoencoders, etc.) is a useful trick

Estimating Log-Likelihood RBMs: requires estimating partition function Reconstruction error provides a cheap proxylog Z tractable analytically for 25 binary inputs or hiddenLower-bounded with Annealed Importance Sampling (AIS) Deep Belief Networks: Extensions of AIS (Salakhutdinov et al 2008)

Open Problems Why is it difficult to train deep architectures? What is important in the learning dynamics? How to improve joint training of all layers? How to sample better from RBMs and deep generative models? Monitoring unsupervised learning quality in deep nets? Other ways to guide training of intermediate representations? Getting rid of learning rates?

THANK YOU! Questions? Comments?

The Deep Breakthrough Before 2006, training deep architectures was unsuccessful, except for convolutional neural nets Hinton, Osindero & Teh « A Fast Learning Algorithm for Deep Belief Nets », Neural Computation, 2006 Bengio, Lamblin, Popovici, Larochelle « Greedy Layer-Wise Training of Deep Networks », NIPS'2006

Related Documents:

Microservice-based architectures. Using containerisation in hybrid cloud architectures: Docker, Kubernetes, OpenShift: Designing microservice architectures. Managing microservice architectures. Continuous integration and continuous delivery (CI/CD) in containerised architectures. Cloud-native microservice architectures: serverless.

As the deep learning architectures are becoming more mature, they gradually outperform previous state-of-the-art classical machine learning algorithms. This review aims to provide an over-view of current deep learning-based segmentation ap-proaches for quantitative brain MRI. First we review the current deep learning architectures used for .

Deep Learning: Top 7 Ways to Get Started with MATLAB Deep Learning with MATLAB: Quick-Start Videos Start Deep Learning Faster Using Transfer Learning Transfer Learning Using AlexNet Introduction to Convolutional Neural Networks Create a Simple Deep Learning Network for Classification Deep Learning for Computer Vision with MATLAB

Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text. For more about deep learning algorithms, see for example: The monograph or review paper Learning Deep Architectures for AI (Foundations & Trends in Ma-chine Learning, 2009). The ICML 2009 .

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

2.2 Deep Learning Recently, deep learning methods have been successfully applied to a variety of language and information retrieval applications [1][4][7][19][22][23][25]. By exploiting deep architectures, deep learning techniques are able to discover from training data the

Deep Learning Personal assistant Personalised learning Recommendations Réponse automatique Deep learning and Big data for cardiology. 4 2017 Deep Learning. 5 2017 Overview Machine Learning Deep Learning DeLTA. 6 2017 AI The science and engineering of making intelligent machines.

The Adventures of Tom Sawyer 4 of 353 She went to the open door and stood in it and looked out among the tomato vines and ‘jimpson’ weeds that constituted the garden. No Tom. So she lifted up her voice at an angle calculated for distance and shouted: ‘Y-o-u-u TOM!’ There was a slight noise behind her and she turned just