Neural Networks Language Models - MT Class

2y ago
16 Views
2 Downloads
2.45 MB
38 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Gia Hauser
Transcription

Neural Networks Language ModelsPhilipp Koehn1 October 2020Philipp KoehnMachine Translation: Neural Networks1 October 2020

N-Gram Backoff Language Model1 Previously, we approximatedp(W ) p(w1, w2, ., wn) . by applying the chain rulep(W ) Xp(wi w1, ., wi 1)i . and limiting the history (Markov order)p(wi w1, ., wi 1) ' p(wi wi 4, wi 3, wi 2, wi 1) Each p(wi wi 4, wi 3, wi 2, wi 1) may not have enough statistics to estimate we back off to p(wi wi 3, wi 2, wi 1), p(wi wi 2, wi 1), etc., all the way to p(wi)– exact details of backing off get complicated — ”interpolated Kneser-Ney”Philipp KoehnMachine Translation: Neural Networks1 October 2020

Refinements2 A whole family of back-off schemes Skip-n gram models that may back off to p(wi wi 2) Class-based models p(C(wi) C(wi 4), C(wi 3), C(wi 2), C(wi 1)) We are wrestling here with– using as much relevant evidence as possible– pooling evidence between wordsPhilipp KoehnMachine Translation: Neural Networks1 October 2020

First Sketch3wiSoftmaxOutput WordhFFHidden Layerwi-4Philipp Koehnwi-3wi-2wi-1Machine Translation: Neural NetworksHistory1 October 2020

Representing Words4 Words are represented with a one-hot vector, e.g.,– dog (0,0,0,0,1,0,0,0,0,.)– cat (0,0,0,0,0,0,0,1,0,.)– eat (0,1,0,0,0,0,0,0,0,.) That’s a large vector! Remedies– limit to, say, 20,000 most frequent words, rest are OTHER – place words in n classes, so each word is represented by 1 class label 1 word in class labelPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Word Classes for Two-Hot Representations5 WordNet classes Brown clusters Frequency binning––– sort words by frequencyplace them in order into classeseach class has same token countvery frequent words have their own classrare words share class with many other words Anything goes: assign words randomly to classesPhilipp KoehnMachine Translation: Neural Networks1 October 2020

6word embeddingsPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Add a Hidden Layer7wiSoftmaxOutput WordhFFHidden -1History Map each word first into a lower-dimensional real-valued space Shared weight matrix EPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Details (Bengio et al., 2003)8 Add direct connections from embedding layer to output layer Activation functions– input embedding: none– embedding hidden: tanh– hidden output: softmax Training– loop through the entire corpus– update between predicted probabilities and 1-hot vector for output wordPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Word EmbeddingsWord9EmbeddingC By-product: embedding of word into continuous space Similar contexts similar embedding Recall: distributional semanticsPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Word EmbeddingsPhilipp KoehnMachine Translation: Neural Networks101 October 2020

Word EmbeddingsPhilipp KoehnMachine Translation: Neural Networks111 October 2020

Are Word Embeddings Magic?12 Morphosyntactic regularities (Mikolov et al., 2013)– adjectives base form vs. comparative, e.g., good, better– nouns singular vs. plural, e.g., year, years– verbs present tense vs. past tense, e.g., see, saw Semantic regularities– clothing is to shirt as dish is to bowl– evaluated on human judgment data of semantic similaritiesPhilipp KoehnMachine Translation: Neural Networks1 October 2020

13recurrent neural networksPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Recurrent Neural Networks014SoftmaxOutput WordtanhHidden LayerEmbedEmbeddingw1History Start: predict second word from first Mystery layer with nodes all with value 1Philipp KoehnMachine Translation: Neural Networks1 October 2020

Recurrent Neural NetworksSoftmaxOutput WordtanhHidden lipp Koehn15copyMachine Translation: Neural Networks1 October 2020

Recurrent Neural NetworksSoftmaxtanh0Philipp KoehnSoftmaxcopytanhcopy16SoftmaxOutput WordtanhHidden LayerEmbedEmbedEmbedEmbeddingw1w2w3HistoryMachine Translation: Neural Networks1 October 2020

Training17CostytSoftmaxOutput WordRNNHidden LayerEwtEmbedEmbeddingwtw1Historyht0 Process first training example Update weights with back-propagationPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Training18CostytSoftmaxOutput WordRNNHidden LayerEwtEmbedEmbeddingwtw2HistoryhtRNN Process second training example Update weights with back-propagation And so on. But: no feedback to previous historyPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Back-Propagation Through Time19CostCostCostSoftmaxSoftmaxSoftmaxOutput WordRNNRNNRNNHidden t0 After processing a few training examples,update through the unfolded recurrent neural networkPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Back-Propagation Through Time20 Carry out back-propagation though time (BPTT) after each training example– 5 time steps seems to be sufficient– network learns to store information for more than 5 time steps Or: update in mini-batches– process 10-20 training examples– update backwards through all examples– removes need for multiple steps for each training examplePhilipp KoehnMachine Translation: Neural Networks1 October 2020

21long short term memoryPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Vanishing Gradients22 Error is propagated to previous steps Updates consider– prediction at that time step– impact on future time steps Vanishing gradient: propagated error disappearsPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Recent vs. Early History23 Hidden layer plays double duty– memory of the network– continuous space representation used to predict output words Sometimes only recent context importantAfter much economic progress over the years, the country has Sometimes much earlier context importantThe country which has made much economic progress over the years still hasPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Long Short Term Memory (LSTM)24 Design quite elaborate, although not very complicated to use Basic building block: LSTM cell– similar to a node in a hidden layer– but: has a explicit memory state Output and memory state change depends on gates– input gate: how much new input changes memory state– forget gate: how much of prior memory state is retained– output gate: how strongly memory state is passed on to next layer. Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2)Philipp KoehnMachine Translation: Neural Networks1 October 2020

LSTM Cell25LSTM Layer Time t-1m Xiinput gate moutput gatePreceding Layerforget gate Next LayerohYLSTM Layer Time tPhilipp KoehnMachine Translation: Neural Networks1 October 2020

LSTM Cell (Math)26 Memory and output values at time step tmemoryt gateinput inputt gateforget memoryt 1outputt gateoutput memoryt Hidden node value ht passed on to next layer applies activation function fht f (outputt) Input computed as input to recurrent neural network node– given node values for prior layer xt (xt1, ., xtX )t 1– given values for hidden layer from previous time step ht 1 (ht 1,.,h1H )– input value is combination of matrix multiplication with weights wx and whand activation function g!XHXXtx tinput gwi xi wihht 1ii 1Philipp Koehni 1Machine Translation: Neural Networks1 October 2020

Values for Gates27 Gates are very important How do we compute their value? with a neural network layer! For each gate a (input, forget, output)––––weight matrix W xa to consider node values in previous layer xtweight matrix W ha to consider hidden layer ht 1 at previous time stept 1 weight matrix W ma to consider memory at previous time step memoryactivation function hgatea hXXi 1Philipp Koehnwixaxti HXi 1 wihaht 1iHX!wimamemoryt 1ii 1Machine Translation: Neural Networks1 October 2020

Training28 LSTM are trained the same way as recurrent neural networks Back-propagation through time This looks all very complex, but:– all the operations are still based on matrix multiplications differentiable activation functions we can compute gradients for objective function with respect to all parameters we can compute update functionsPhilipp KoehnMachine Translation: Neural Networks1 October 2020

What is the Point?29(from Tran, Bisazza, Monz, 2016) Each node has memory memoryi independent from current output hi Memory may be carried through unchanged (gateiinput 0, gateimemory 1) can remember important features over long time span(capture long distance dependencies)Philipp KoehnMachine Translation: Neural Networks1 October 2020

Visualizing Individual Cells30Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks”Philipp KoehnMachine Translation: Neural Networks1 October 2020

Visualizing Individual CellsPhilipp KoehnMachine Translation: Neural Networks311 October 2020

Gated Recurrent Unit (GRU)32GRU Layer Time t-1h Preceding LayerXxupdate gatereset gate Next LayerhYGRU Layer Time tPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Gated Recurrent Unit (Math)33 Two Gatesupdatet g(Wupdate inputt Uupdate statet 1 biasupdate)resett g(Wreset inputt Uresetstatet 1 biasreset) Combination of input and previous state(similar to traditional recurrent neural network)combinationt f (W inputt U (resett statet 1)) Interpolation with previous statestatet (1 updatet) statet 1 updatetPhilipp Koehn combinationt) biasMachine Translation: Neural Networks1 October 2020

34deeper modelsPhilipp KoehnMachine Translation: Neural Networks1 October 2020

Deep NHiddenLayerE xtEmbedEmbedEmbedInput WordEmbeddingShallow Not much deep learning so far Between prediction from input to output: only 1 hidden layer How about more hidden layers?Philipp KoehnMachine Translation: Neural Networks1 October 2020

Deep tmaxOutputht,3RNNRNNRNNht,3RNNRNNRNNHiddenLayer 1ht,2RNNRNNRNNht,2RNNRNNRNNHiddenLayer 2ht,1RNNRNNRNNht,1RNNRNNRNNHiddenLayer 3E xiEmbedEmbedEmbedE xiEmbedEmbedEmbedInput WordEmbeddingDeep StackedPhilipp KoehnDeep TransitionalMachine Translation: Neural Networks1 October 2020

37questions?Philipp KoehnMachine Translation: Neural Networks1 October 2020

– removes need for multiple steps for each training example Philipp Koehn Machine Translation: Neural Networks 1 October 2020. 21 long short term memory Philipp Koehn Machine Translation: Neural Networks 1 October 2020 . – all the operations are still based on m

Related Documents:

A growing success of Artificial Neural Networks in the research field of Autonomous Driving, such as the ALVINN (Autonomous Land Vehicle in a Neural . From CMU, the ALVINN [6] (autonomous land vehicle in a neural . fluidity of neural networks permits 3.2.a portion of the neural network to be transplanted through Transfer Learning [12], and .

Deep Neural Networks Convolutional Neural Networks (CNNs) Convolutional Neural Networks (CNN, ConvNet, DCN) CNN a multi‐layer neural network with – Local connectivity: Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions:

neural networks using genetic algorithms" has explained that multilayered feedforward neural networks posses a number of properties which make them particularly suited to complex pattern classification problem. Along with they also explained the concept of genetics and neural networks. (D. Arjona, 1996) in "Hybrid artificial neural

4 Graph Neural Networks for Node Classification 43 4.2.1 General Framework of Graph Neural Networks The essential idea of graph neural networks is to iteratively update the node repre-sentations by combining the representations of their neighbors and their own repre-sentations. In this section, we introduce a general framework of graph neural net-

values of z is 1 rather than very close to 0. 7.2 The XOR problem Early in the history of neural networks it was realized that the power of neural net-works, as with the real neurons that inspired them, comes from combining these units into larger networks. One of the most clever demonstrations of the need for multi-layer networks was

processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques.

Neuro-physiologists use neural networks to describe and explore medium-level brain function (e.g. memory, sensory system, motorics). Physicists use neural networks to model phenomena in statistical mechanics and for a lot of other tasks. Biologists use Neural Networks to interpret nucleotide sequences.

Artificial Neural Networks Develop abstractionof function of actual neurons Simulate large, massively parallel artificial neural networks on conventional computers Some have tried to build the hardware too Try to approximate human learning, robustness to noise, robustness to damage, etc. Early Uses of neural networks