Lecture 5: Multilayer Perceptrons

3y ago

52 Views

2 Downloads

1.02 MB

7 Pages

Last View : 26d ago

Last Download : 3m ago

Upload by : Abby Duckworth

Report this link

Download PDF

Transcription

Lecture 5: Multilayer PerceptronsRoger Grosse1IntroductionSo far, we’ve only talked about linear models: linear regression and linearbinary classifiers. We noted that there are functions that can’t be represented by linear models; for instance, linear regression can’t representquadratic functions, and linear classifiers can’t represent XOR. We also sawone particular way around this issue: by defining features, or basis functions. E.g., linear regression can represent a cubic polynomial if we use thefeature map ψ(x) (1, x, x2 , x3 ). We also observed that this isn’t a verysatisfying solution, for two reasons:1. The features need to be specified in advance, and this can require alot of engineering work.2. It might require a very large number of features to represent a certainset of functions; e.g. the feature representation for cubic polynomialsis cubic in the number of input features.In this lecture, and for the rest of the course, we’ll take a different approach. We’ll represent complex nonlinear functions by connecting togetherlots of simple processing units into a neural network, each of which computes a linear function, possibly followed by a nonlinearity. In aggregate,these units can compute some surprisingly complex functions. By historicalaccident, these networks are called multilayer perceptrons.1.1Learning Goals Know the basic terminology for neural nets Given the weights and biases for a neural net, be able to compute itsoutput from its input Be able to hand-design the weights of a neural net to represent functions like XOR Understand how a hard threshold can be approximated with a softthreshold Understand why shallow neural nets are universal, and why this isn’tnecessarily very interesting1Some people would claim that themethods covered in this course arereally “just” adaptive basisfunction representations. I’venever found this a very useful wayof looking at things.

Figure 1: A multilayer perceptron with two hidden layers. Left: with theunits written out explicitly. Right: representing layers as boxes.2Multilayer PerceptronsIn the first lecture, we introduced our general neuron-like processing unit: Xa φ wj xj b ,jwhere the xj are the inputs to the unit, the wj are the weights, b is the bias,φ is the nonlinear activation function, and a is the unit’s activation. We’veseen a bunch of examples of such units: Linear regression uses a linear model, so φ(z) z. In binary linear classifiers, φ is a hard threshold at zero. In logistic regression, φ is the logistic function σ(z) 1/(1 e z ).A neural network is just a combination of lots of these units. Each oneperforms a very simple and stereotyped function, but in aggregate they cando some very useful computations. For now, we’ll concern ourselves withfeed-forward neural networks, where the units are arranged into a graphwithout any cycles, so that all the computation can be done sequentially.This is in contrast with recurrent neural networks, where the graph canhave cycles, so the processing can feed into itself. These are much morecomplicated, and we’ll cover them later in the course.The simplest kind of feed-forward network is a multilayer perceptron(MLP), as shown in Figure 1. Here, the units are arranged into a set oflayers, and each layer contains some number of identical units. Every unitin one layer is connected to every unit in the next layer; we say that thenetwork is fully connected. The first layer is the input layer, and itsunits take the values of the input features. The last layer is the outputlayer, and it has one unit for each value the network outputs (i.e. a singleunit in the case of regression or binary classifiation, or K units in the caseof K-class classification). All the layers in between these are known ashidden layers, because we don’t know ahead of time what these unitsshould compute, and this needs to be discovered during learning. The units2MLP is an unfortunate name. Theperceptron was a particularalgorithm for binary classification,invented in the 1950s. Mostmultilayer perceptrons have verylittle to do with the originalperceptron algorithm.

Figure 2: An MLP that computes the XOR function. All activation functions are binary thresholds at 0.in these layers are known as input units, output units, and hiddenunits, respectively. The number of layers is known as the depth, and thenumber of units in a layer is known as the width. As you might guess,“deep learning” refers to training neural nets with many layers.As an example to illustrate the power of MLPs, let’s design one thatcomputes the XOR function. Remember, we showed that linear modelscannot do this. We can verbally describe XOR as “one of the inputs is 1,but not both of them.” So let’s have hidden unit h1 detect if at least oneof the inputs is 1, and have h2 detect if they are both 1. We can easily dothis if we use a hard threshold activation function. You know how to designsuch units — it’s an exercise of designing a binary linear classifier. Thenthe output unit will activate only if h1 1 and h2 0. A network whichdoes this is shown in Figure 2.Let’s write out the MLP computations mathematically. Conceptually,there’s nothing new here; we just have to pick a notation to refer to variousparts of the network. As with the linear case, we’ll refer to the activationsof the input units as xj and the activation of the output unit as y. The units( )in the th hidden layer will be denoted hi . Our network is fully connected,so each unit receives connections from all the units in the previous layer.This means each unit has its own bias, and there’s a weight for every pairof units in two consecutive layers. Therefore, the network’s computationscan be written out as: X(1)(1)(1)h φ(1) w xj b iijij(2)hi X (2) (1)(2) φ(2) w h b ijji(1)j X (3) (2)(3)yi φ(3) wij hj bi jNote that we distinguish φ(1) and φ(2) because different layers may havedifferent activation functions.Since all these summations and indices can be cumbersome, we usually3Terminology for the depth is veryinconsistent. A network with onehidden layer could be called aone-layer, two-layer, or three-layernetwork, depending if you countthe input and output layers.

write the computations in vectorized form. Since each layer contains multiple units, we represent the activations of all its units with an activationvector h( ) . Since there is a weight for every pair of units in two consecutivelayers, we represent each layer’s weights with a weight matrix W( ) . Eachlayer also has a bias vector b( ) . The above computations are thereforewritten in vectorized form as: h(1) φ(1) W(1) x b(1) (2)h(2) φ(2) W(2) h(1) b(2) y φ(3) W(3) h(2) b(3)When we write the activation function applied to a vector, this means it’sapplied independently to all the entries.Recall how in linear regression, we combined all the training examplesinto a single matrix X, so that we could compute all the predictions using asingle matrix multiplication. We can do the same thing here. We can storeall of each layer’s hidden units for all the training examples as a matrix H( ) .Each row contains the hidden units for one example. The computations arewritten as follows (note the transposes): H(1) φ(1) XW(1) 1b(1) (3)H(2) φ(2) H(1) W(2) 1b(2) Y φ(3) H(2) W(3) 1b(3) These equations can be translated directly into NumPy code which efficiently computes the predictions over the whole dataset.3Feature LearningWe already saw that linear regression could be made more powerful using afeature mapping. For instance, the feature mapping ψ(x) (1, x, x2 , xe ) canrepresent third-degree polynomials. But static feature mappings were limited because it can be hard to design all the relevant features, and becausethe mappings might be impractically large. Neural nets can be thoughtof as a way of learning nonlinear feature mappings. E.g., in Figure 1, thelast hidden layer can be thought of as a feature map ψ(x), and the outputlayer weights can be thought of as a linear model using those features. Butthe whole thing can be trained end-to-end with backpropagation, whichwe’ll cover in the next lecture. The hope is that we can learn a featurerepresentation where the data become linearly separable:4If it’s hard to remember when amatrix or vector is transposed, fearnot. You can usually figure it outby making sure the dimensionsmatch up.

Figure 3: Left: Some training examples from the MNIST handwritten digitdataset. Each input is a 28 28 grayscale image, which we treat as a 784dimensional vector. Right: A subset of the learned first-layer features.Observe that many of them pick up oriented edges.Consider training an MLP to recognize handwritten digits. (This willbe a running example for much of the course.) The input is a 28 28grayscale image, and all the pixels take values between 0 and 1. We’ll ignorethe spatial structure, and treat each input as a 784-dimensional vector.This is a multiway classification task with 10 categories, one for each digitclass. Suppose we train an MLP with two hidden layers. We can try tounderstand what the first layer of hidden units is computing by visualizingthe weights. Each hidden unit receives inputs from each of the pixels, whichmeans the weights feeding into each hidden unit can be represented as a 784dimensional vector, the same as the input size. In Figure 3, we display thesevectors as images.In this visualization, positive values are lighter, and negative values aredarker. Each hidden unit computes the dot product of these vectors withthe input image, and then passes the result through the activation function.So if the light regions of the filter overlap the light regions of the image,and the dark regions of the filter overlap the dark region of the image,then the unit will activate. E.g., look at the third filter in the second row.This corresponds to an oriented edge: it detects vertical edges in theupper right part of the image. This is a useful sort of feature, since it givesinformation about the locations and orientation of strokes. Many of thefeatures are similar to this; in fact, oriented edges are a very commonlylearned by the first layers of neural nets for visual processing tasks.It’s harder to visualize what the second layer is doing. We’ll see sometricks for visualizing this in a few weeks. We’ll see that higher layers of aneural net can learn increasingly high-level and complex features.4Expressive PowerLinear models are fundamentally limited in their expressive power: theycan’t represent functions like XOR. Are there similar limitations for MLPs?It depends on the activation function.5Later on, we’ll talk aboutconvolutional networks, which usethe spatial structure of the image.

Figure 4: Designing a binary threshold network to compute a particularfunction.4.1Linear networksDeep linear networks are no more powerful than shallow ones. The reasonis simple: if we use the linear activation function φ(x) x (and forgetthe biases for simplicity), the network’s function can be expanded out asy W(L) W(L 1) · · · W(1) x. But this could be viewed as a single linearlayer with weights given by W W(L) W(L 1) · · · W(1) . Therefore, a deeplinear network is no more powerful than a single linear layer, i.e. a linearmodel.4.2UniversalityAs it turns out, nonlinear activation functions give us much more power:under certain technical conditions, even a shallow MLP (i.e. one with asingle hidden layer) can represent arbitrary functions. Therefore, we say itis universal.Let’s demonstrate universality in the case of binary inputs. We do thisusing the following game: suppose we’re given a function mapping inputvectors to outputs; we will need to produce a neural network (i.e. specifythe weights and biases) which matches that function. The function can begiven to us as a table which lists the output corresponding to every possibleinput vector. If there are D inputs, this table will have 2D rows. An exampleis shown in Figure 4. For convenience, let’s suppose these inputs are 1,rather than 0 or 1. All of our hidden units will use a hard threshold at 0(but we’ll see shortly that these can easily be converted to soft thresholds),and the output unit will be linear.Our strategy will be as follows: we will have 2D hidden units, eachof which recognizes one possible input vector. We can then specify thefunction by specifying the weights connecting each of these hidden unitsto the outputs. For instance, suppose we want a hidden unit to recognizethe input ( 1, 1, 1). This can be done using the weights ( 1, 1, 1) andbias 2.5, and this unit will be connected to the output unit with weight 1.(Can you come up with the general rule?) Using these weights, any inputpattern will produce a set of hidden activations where exactly one of theunits is active. The weights connecting inputs to outputs can be set basedon the input-output table. Part of the network is shown in Figure 4.6This argument can easily be madeinto a rigorous proof, but thiscourse won’t be concerned withmathematical rigor.

Universality is a neat property, but it has a major catch: the networkrequired to represent a given function might have to be extremely large (inparticular, exponential). In other words, not all functions can be representedcompactly. We desire compact representations for two reasons:1. We want to be able to compute predictions in a reasonable amount oftime.2. We want to be able to train a network to generalize from a limitednumber of training examples; from this perspective, universality simply implies that a large enough network can memorize the trainingset, which isn’t very interesting.4.3Soft thresholdsIn the previous section, our activation function was a step function, whichgives a hard threshold at 0. This was convenient for designing the weights ofa network by hand. But recall from last lecture that it’s very hard to directlylearn a linear classifier with a hard threshold, because the loss derivativesare 0 almost everywhere. The same holds true for multilayer perceptrons.If the activation function for any unit is a hard threshold, we won’t be ableto learn that unit’s weights using gradient descent. The solution is the sameas it was in last lecture: we replace the hard threshold with a soft one.Does this cost us anything in terms of the network’s expressive power?No it doesn’t, because we can approximate a hard threshold using a softthreshold. In particular, if we use the logistic nonlinearity, we can approximate a hard threshold by scaling up the weights and biases:4.4The power of depthIf shallow networks are universal, why do we need deep ones? One importantreason is that deep nets can represent some functions more compactly thanshallow ones. For instance, consider the parity function (on binary-valuedinputs): P1 if j xj is oddfpar (x1 , . . . , xD ) (4)0 if it is even.We won’t prove this, but it requires an exponentially large shallow networkto represent the parity function. On the other hand, it can be computedby a deep network whose size is linear in the number of inputs. Designingsuch a network is a good exercise.7

in one layer is connected to every unit in the next layer; we say that the network is fully connected. The rst layer is the input layer, and its units take the values of the input features. The last layer is the output layer, and it has one unit for each value the network outputs (i.e. a single

Related Documents:

Lecture 2, Part 1: Multilayer Perceptrons

Lecture 2, Part 1: Multilayer Perceptrons Roger Grosse 1 Introduction So far, we’ve only talked about linear models: linear regression and linear binary classi ers. We noted that there are functions that can’t be rep-resented by linear models; for ins

19 Views

2y ago

CHEMICAL REACTION ENGINEERING

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

99 Views

2y ago

Multilayer Film Applications - Rutgers University

Multilayer Film Applications 8.1 Multilayer Dielectric Structures at Oblique Incidence Using the matching and propagation matrices for transverse ﬁelds that we discussed in Sec. 7.3, we derive here the layer recursions for multiple dielectric slabs at oblique incidence. Fig. 8.1.1 shows such a multilayer structure.

22 Views

1y ago

LECTURE NOTES on PROGRAMMING & DATA STRUCTURE Course Code : BCS101

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

58 Views

1y ago

Artificial Neural Networks - Uni Ulm

9 Artificial Neural Networks Rise and fall of Neural NetworksRise and fall of Neural Networks In the 70’s and 80's, it was shown that multilevel perceptrons don’t have These shortcomings Paul J. Werbos invented 1974 the back-propagation having the ability to perform classification tasks beyond simple Perceptrons

36 Views

3y ago

Pater Perceptrons and Syntactic Structures at 60

founded, generative linguistics. The book he published in 1957, Syntactic Structures, has been ranked as the most influential work in cognitive science from the 20th century.1 The other one, Frank Rosenblatt, had by the late 1960s largely moved on from his research on perceptrons – now ca

17 Views

2y ago

San José State University College of Engineering ...

This course is about fundamentals of neural networks and learning process. In particular, the following topics will be covered. Principles of neural networks. Models of a neuron. Learning process. Perceptrons. Model building through regression. Linear Mean Square Algorithm. Multi layer perceptrons. Back Prop

5 Views

2y ago

Learning in Multi-Layer Perceptrons - Back-Propagation

Multi-Layer Perceptrons (MLPs) Conventionally, the input layer is layer 0, and when we talk of an N layer network we mean there are N layers of weights and N non-input layers of processing units. Thus a two layer Multi-Layer Perceptron takes the form: It is clear how we can add in further layers, though for most practical purposes two

10 Views

1y ago

Recent Views

Saint Robert Bellarmine - WordPress

Aug 08, 2018 · Sister Laura Gorman Sister Anna Frances Portisch Sister Mary Edward Haren Sister Dolores Priske (Helen Julie) Sister Scholastica Healy Sister olette Marie Quinn Sister lara . S. Heidelman Sister Alice Mary Reilly Sister Genevieve Henneberry (Fidelis) Sister Genevieve Rigney

2y ago

160 Views

Sunday, September 12, 2021 10:00 a.m.

Sep 12, 2021 · On our 154th Church Anniversary, We salute the members of Mount Pleasant Baptist Church who have served for 50 years or more. Sister Brenda Bradley Sister Mary Lockett Sister Aaronita Brown Sister June Marshall Deacon Carlton Brown Sister Barbara Moore Sister Gwendolyn Brown Sister Frances Robinson Deaconess Josephine Byrd Sister Frances Ross

2y ago

344 Views

MRS Title 21-A. ELECTIONS - Maine Legislature

stepgrandchild, stepsister, stepbrother, mother-in-law, father-in-law, brother-in-law, sister-in-law, son-in-law, daughter-in-law, guardian, former guardian, domestic partner, the half-brother or half-sister of a person's spouse, or the spouse of a person's half-brother or half-sister. [PL 2009, c. 253, §2 (AMD).] 21. Incoming voting list.

1y ago

118 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Immaculata, Pennsylvania 19345-0200 Catholic Schools

Fall, 2012 Cover Sister Monica Therese Sicilia, I.H.M. IHM Best Practices Sister Margaret Rose Adams, I.H.M For Teachers: Sister Adrienne Saybolt, I.H.M. “Helping K-2 Students Struggling with Reading and Writing” Prime Times Sister Rita James Murphy, I.H.M. Good Writer

2y ago

117 Views

Winter 2012 - IHM EDUCATIONAL RESOURCES - Home

IHM Best Practices Sister Margaret Rose Adams, IHM For Teachers: Sister Adrienne Saybolt, IHM “Helping K-2 Students Struggling with Reading and Writing” Prime Times Sister Elaine deChantal Brookes, IHM Sister

2y ago

138 Views

Tributes in Honor of: SISTER JANET AHLER, CSA CSA SISTERS .

Everett & Jeannine Solon SISTER CORINNE HEIMANN, CSA St Mary's Hospital Board of Directors Teresa Hebble John & Mary Sterba SISTER MARY VERONICA HEIMANN, CSA Sybil Teehan Teresa Hebble Rebecca & Gary Tirevold MR EDWARD HELSTOSKY Bonnie Young Barbara Britz SISTER JOELLEN FLYNN, CSA RAY HINZ Susan Flynn Carol Hinz Fran Frigo JEAN W HOFF

2y ago

341 Views

How to Use These “Snippets” and Poems

For Sale By Shel Silverstein One sister for sale! One sister for sale! One crying and spying young sister for sale! I’m really not kidding, So who’ll start the bidding? Do I hear the dollar? A nickel? A penny? Oh, isn’t there, isn’t there, isn’t there any One kid that will buy this old sister for sale,

2y ago

367 Views

CODIS2006 - Mixture Interpretation - Butler FINAL

“Things we do not do: Calculate mixture ratios for casework – Calculation used for this study: Find loci with 4 alleles (2 sets of sister alleles). Make sure sister alleles fall within 70%, then take the ratio of one allele from one sister set to one allele of the second sister set, figure ratios for all combinations and average.

2y ago

315 Views

CONSECRATA

Salesian Sisters of St. John Bosco Sister Marie Amata D’Amico, C.K. School Sisters of Christ the King Sister Mary Stephany Rose, O.S.H.J. Oblate Sisters of the Sacred Heart of Jesus Sister Brigid Mary Meeks, R.S.M. Religious Sisters of Mercy of Alma Sister Hae-Jin Lim, F.M.A. Salesian Sisters of

2y ago

111 Views

Sister Makes House Calls During the Pandemic

Sister Patricia Deckert, RSM. As an elementary school teacher, Sister Patricia (Pat) taught in the Trenton, Metuchen and Camden dioceses in New Jersey, serving eight years at Cathedral School in Trenton, and seven years at St. James School in Red Bank. Attending nursing school at the age of 50, Sister Pat first ministered at McAuley Hall

11m ago

86 Views

Lecture 5: Multilayer Perceptrons

It looks like you're using an ad-blocker