Introduction To Neural Networks - MT Class

7m ago
9 Views
1 Downloads
984.21 KB
49 Pages
Last View : 19d ago
Last Download : 3m ago
Upload by : Audrey Hope
Transcription

Introduction to Neural Networks Philipp Koehn 22 September 2022 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Linear Models 1 We used before weighted linear combination of feature values hj and weights λj X score(λ, di) λj hj (di) j Such models can be illustrated as a ”network” Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Limits of Linearity 2 We can give each feature a weight But not more complex value relationships, e.g, – any value in the range [0;5] is equally good – values over 8 are bad – higher than 10 is not worse Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

XOR 3 Linear models cannot model XOR Philipp Koehn good bad bad good Machine Translation: Introduction to Neural Networks 22 September 2022

Multiple Layers 4 Add an intermediate (”hidden”) layer of processing (each arrow is a weight) x h y Have we gained anything so far? Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Non-Linearity 5 Instead of computing a linear combination score(λ, di) X λj hj (di) j Add a non-linear function score(λ, di) f X λj hj (di) j Popular choices tanh(x) sigmoid(x) 1 1 e x relu(x) max(0,x) (sigmoid is also called the ”logistic function”) Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Deep Learning 6 More layers deep learning Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

What Depths Holds 7 Each layer is a processing step Having multiple processing steps allows complex functions Metaphor: NN and computing circuits – computer sequence of Boolean gates – neural computer sequence of layers Deep neural networks can implement complex functions e.g., sorting on input values Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

8 example Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Simple Neural Network 3.7 2.9 4.5 3.7 2.9 1 9 -5.2 .5 -1 -4.6 1 -2.0 One innovation: bias units (no inputs, always value 1) Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Sample Input 3.7 2.9 1.0 4.5 3.7 0.0 2.9 1 10 -5.2 .5 -1 -4.6 1 -2.0 Try out two input values Hidden unit computation sigmoid(1.0 3.7 0.0 3.7 1 1.5) sigmoid(2.2) 1 0.90 2.2 1 e 1 sigmoid(1.0 2.9 0.0 2.9 1 4.5) sigmoid( 1.6) 0.17 1 e1.6 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Computed Hidden 3.7 2.9 1.0 .90 3.7 0.0 2.9 1 .17 .5 -1 -4.6 1 11 4.5 -5.2 -2.0 Try out two input values Hidden unit computation sigmoid(1.0 3.7 0.0 3.7 1 1.5) sigmoid(2.2) 1 0.90 2.2 1 e 1 sigmoid(1.0 2.9 0.0 2.9 1 4.5) sigmoid( 1.6) 0.17 1 e1.6 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Compute Output 3.7 2.9 1.0 .90 3.7 0.0 1 2.9 .17 .5 -1 -4.6 1 12 4.5 -5.2 -2.0 Output unit computation 1 sigmoid(.90 4.5 .17 5.2 1 2.0) sigmoid(1.17) 0.76 1 e 1.17 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Computed Output 3.7 2.9 1.0 .90 3.7 0.0 1 2.9 .17 .5 -1 -4.6 1 13 4.5 -5.2 .76 -2.0 Output unit computation 1 sigmoid(.90 4.5 .17 5.2 1 2.0) sigmoid(1.17) 0.76 1 e 1.17 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Output for all Binary Inputs Input x0 0 0 1 1 Input x1 0 1 0 1 Hidden h0 0.12 0.88 0.73 0.99 Hidden h1 0.02 0.27 0.12 0.73 14 Output y0 0.18 0 0.74 1 0.74 1 0.33 0 Network implements XOR – hidden node h0 is OR – hidden node h1 is AND – final layer operation is h0 h1 Power of deep neural networks: chaining of processing steps just as: more Boolean circuits more complex computations possible Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

15 why ”neural” networks? Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Neuron in the Brain 16 The human brain is made up of about 100 billion neurons Dendrite Axon terminal Soma Axon Nucleus Neurons receive electric signals at the dendrites and send them to the axon Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Neural Communication 17 The axon of the neuron is connected to the dendrites of many other neurons Neurotransmitter Synaptic vesicle Neurotransmitter transporter Axon terminal Voltage gated Ca channel Postsynaptic density Receptor Synaptic cleft Dendrite Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

The Brain vs. Artificial Neural Networks 18 Similarities – Neurons, connections between neurons – Learning change of connections, not change of neurons – Massive parallel processing But artificial neural networks are much simpler – computation within neuron vastly simplified – discrete time steps – typically some form of supervised learning with massive number of stimuli Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

19 back-propagation training Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Error 3.7 2.9 1.0 .90 3.7 0.0 2.9 1 .17 .5 -1 -4.6 1 20 4.5 -5.2 .76 -2.0 Computed output: y .76 Correct output: t 1.0 How do we adjust the weights? Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Key Concepts 21 Gradient descent – – – – – error is a function of the weights we want to reduce the error gradient descent: move towards the error minimum compute gradient get direction to the error minimum adjust weights towards direction of lower error Back-propagation – first adjust last set of weights – propagate error back to each previous layer – adjust their weights Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Gradient Descent 22 error(λ) a gr optimal λ Philipp Koehn n e i d 1 t λ current λ Machine Translation: Introduction to Neural Networks 22 September 2022

Gradient Descent Gradient for w1 Current Point dient a r G ined Comb Philipp Koehn Machine Translation: Introduction to Neural Networks Gradient for w2 Optimum 23 22 September 2022

Derivative of Sigmoid 24 1 sigmoid(x) 1 e x Sigmoid Reminder: quotient rule f (x) 0 g(x) Derivative g(x)f 0(x) f (x)g 0(x) g(x)2 d d sigmoid(x) 1 dx dx 1 e x 0 (1 e x) ( e x) (1 e x)2 e x 1 1 e x 1 e x 1 1 1 1 e x 1 e x sigmoid(x)(1 sigmoid(x)) Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Final Layer Update Linear combination of weights s 25 P k wk hk Activation function y sigmoid(s) Error (L2 norm) E 12 (t y)2 Derivative of error with regard to one weight wk dE dE dy ds dwk dy ds dwk Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Final Layer Update (1) Linear combination of weights s 26 P k wk hk Activation function y sigmoid(s) Error (L2 norm) E 12 (t y)2 Derivative of error with regard to one weight wk dE dE dy ds dwk dy ds dwk Error E is defined with respect to y d 1 dE (t y)2 (t y) dy dy 2 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Final Layer Update (2) Linear combination of weights s 27 P k wk hk Activation function y sigmoid(s) Error (L2 norm) E 12 (t y)2 Derivative of error with regard to one weight wk dE dE dy ds dwk dy ds dwk y with respect to x is sigmoid(s) dy d sigmoid(s) sigmoid(s)(1 sigmoid(s)) y(1 y) ds ds Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Final Layer Update (3) Linear combination of weights s 28 P k wk hk Activation function y sigmoid(s) Error (L2 norm) E 12 (t y)2 Derivative of error with regard to one weight wk dE dE dy ds dwk dy ds dwk x is weighted linear combination of hidden node values hk ds d X wk hk hk dwk dwk k Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Putting it All Together 29 Derivative of error with regard to one weight wk dE dE dy ds dwk dy ds dwk (t y) y(1 y) hk – error – derivative of sigmoid: y 0 Weight adjustment will be scaled by a fixed learning rate µ wk µ (t y) y 0 hk Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Multiple Output Nodes 30 Our example only had one output node Typically neural networks have multiple output nodes Error is computed over all j output nodes E X1 j 2 (tj yj )2 Weights k j are adjusted according to the node they point to wj k µ(tj yj ) yj0 hk Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Hidden Layer Update 31 In a hidden layer, we do not have a target output value But we can compute how much each node contributed to downstream error Definition of error term of each node δj (tj yj ) yj0 Back-propagate the error term (why this way? there is math to back it up.) δi X wj iδj yi0 j Universal update formula wj k µ δj hk Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Our Example A 1.0 3.7 B 0.0 C 3.7 2.9 1 2.9 D .90 E .17 .5 -1 -4.6 32 F 1 4.5 -5.2 G .76 -2.0 Computed output: y .76 Correct output: t 1.0 Final layer weight updates (learning rate µ 10) – δG (t y) y 0 (1 .76) 0.181 .0434 – wGD µ δG hD 10 .0434 .90 .391 – wGE µ δG hE 10 .0434 .17 .074 – wGF µ δG hF 10 .0434 1 .434 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Our Example A 1.0 3.7 B 0.0 C 3.7 2.9 1 2.9 D .90 E .17 .5 -1 -4.6 F 1 4.8 91 4— .5 -5.126 -5.2 —— 33 G .76 0 -2.— — 6 56 -1. Computed output: y .76 Correct output: t 1.0 Final layer weight updates (learning rate µ 10) – δG (t y) y 0 (1 .76) 0.181 .0434 – wGD µ δG hD 10 .0434 .90 .391 – wGE µ δG hE 10 .0434 .17 .074 – wGF µ δG hF 10 .0434 1 .434 Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Hidden Layer Updates A 1.0 3.7 B 0.0 C 3.7 2.9 1 2.9 D .90 E .17 .5 -1 -4.6 F 1 4.8 91 4— .5 -5.126 -5.2 —— 34 G .76 0 -2.— — 6 56 -1. Hidden node D P 0 0 – δD j wj i δj yD wGD δG yD 4.5 .0434 .0898 .0175 – wDA µ δD hA 10 .0175 1.0 .175 – wDB µ δD hB 10 .0175 0.0 0 – wDC µ δD hC 10 .0175 1 .175 Hidden node E P 0 0 – δE j wj i δj yE wGE δG yE 5.2 .0434 0.2055 .0464 – wEA µ δE hA 10 .0464 1.0 .464 – etc. Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

35 some additional aspects Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Initialization of Weights 36 Weights are initialized randomly e.g., uniformly from interval [ 0.01, 0.01] Glorot and Bengio (2010) suggest – for shallow neural networks 1 1 , n n n is the size of the previous layer – for deep neural networks 6 6 , nj nj 1 nj nj 1 nj is the size of the previous layer, nj size of next layer Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Neural Networks for Classification 37 Predict class: one output node per class Training data output: ”One-hot vector”, e.g., y (0, 0, 1)T Prediction – predicted class is output node yi with highest value – obtain posterior probability distribution by soft-max eyi softmax(yi) P yj je Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Problems with Gradient Descent Training 38 error(λ) λ Too high learning rate Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Problems with Gradient Descent Training 39 error(λ) λ Bad initialization Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Problems with Gradient Descent Training 40 error(λ) local optimum global optimum λ Local optimum Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Speedup: Momentum Term 41 Updates may move a weight slowly in one direction To speed this up, we can keep a memory of prior updates wj k (n 1) . and add these to any new updates (with decay factor ρ) wj k (n) µ δj hk ρ wj k (n 1) Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Adagrad 42 Typically reduce the learning rate µ over time – at the beginning, things have to change a lot – later, just fine-tuning Adapting learning rate per parameter Adagrad update based on error E with respect to the weight w at time t gt µ wt qP t dE dw gt 2 τ 1 gτ Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Dropout 43 A general problem of machine learning: overfitting to training data (very good on train, bad on unseen test) Solution: regularization, e.g., keeping weights from having extreme values Dropout: randomly remove some hidden units during training – mask: set of hidden units dropped – randomly generate, say, 10–20 masks – alternate between the masks during training Why does that work? bagging, ensemble, . Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Mini Batches 44 Each training example yields a set of weight updates wi. Batch up several training examples – sum up their updates – apply sum to model Mostly done or speed reasons Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

45 computational aspects Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Vector and Matrix Multiplications 46 Forward computation: s W h Activation function: y sigmoid( h) Error term: δ ( t y ) sigmoid’( s) Propagation of error term: δi W δi 1 · sigmoid’( s) Weight updates: W µ δ hT Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

GPU 47 Neural network layers may have, say, 200 nodes Computations such as W h require 200 200 40, 000 multiplications Graphics Processing Units (GPU) are designed for such computations – image rendering requires such vector and matrix operations – massively mulit-core but lean processing units – example: NVIDIA Tesla K20c GPU provides 2496 thread processors Extensions to C to support programming of GPUs, such as CUDA Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022

Toolkits 48 Machine Translation: Introduction to Neural Networks 22 September 2022 Theano Tensorflow (Google) PyTorch (Facebook) MXNet (Amazon) DyNet Philipp Koehn

Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022. 8 example Philipp Koehn Machine Translation: Introduction to Neural Networks 22 September 2022. Simple Neural Network 9 1 1 4.5-5.2-4.6 -2.0-1.5 3.7 2.9 3.7 2.9 One innovation: bias units (no inputs, always value 1)

Related Documents:

A growing success of Artificial Neural Networks in the research field of Autonomous Driving, such as the ALVINN (Autonomous Land Vehicle in a Neural . From CMU, the ALVINN [6] (autonomous land vehicle in a neural . fluidity of neural networks permits 3.2.a portion of the neural network to be transplanted through Transfer Learning [12], and .

Deep Neural Networks Convolutional Neural Networks (CNNs) Convolutional Neural Networks (CNN, ConvNet, DCN) CNN a multi‐layer neural network with – Local connectivity: Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions:

neural networks using genetic algorithms" has explained that multilayered feedforward neural networks posses a number of properties which make them particularly suited to complex pattern classification problem. Along with they also explained the concept of genetics and neural networks. (D. Arjona, 1996) in "Hybrid artificial neural

4 Graph Neural Networks for Node Classification 43 4.2.1 General Framework of Graph Neural Networks The essential idea of graph neural networks is to iteratively update the node repre-sentations by combining the representations of their neighbors and their own repre-sentations. In this section, we introduce a general framework of graph neural net-

Deep Learning 1 Introduction Deep learning is a set of learning methods attempting to model data with complex architectures combining different non-linear transformations. The el-ementary bricks of deep learning are the neural networks, that are combined to form the deep neural networks.

Artificial Neural Networks Develop abstractionof function of actual neurons Simulate large, massively parallel artificial neural networks on conventional computers Some have tried to build the hardware too Try to approximate human learning, robustness to noise, robustness to damage, etc. Early Uses of neural networks

Neural networks—an overview The term "Neural networks" is a very evocative one. It suggests machines that are something like brains and is potentially laden with the science fiction connotations of the Frankenstein mythos. One of the main tasks of this book is to demystify neural networks

Ability Commerce P.O. Box 519 Spicer, MN 56288 Abinette, Jennifer A 1 Innovation Way Merrimack, NH 03054 Abir Yono 4514 Northridge Ct West Bloomfield, MI 48323 Ablajan, Uighur 1 Innovation Way Merrimack, NH 03054 Able Planet Incorporated 10601 W.I-70 Frontage Rd Wheat Ridge, CO 80033 Ables, Amanda 1 Innovation Way Merrimack, NH 03054 Abm Engineering/ Linc Facility Services Los Angeles, CA .