Natural Language Processing - University Of California, Berkeley

1y ago
7 Views
2 Downloads
4.37 MB
59 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Francisco Tran
Transcription

Natural Language ProcessingInfo 159/259Midterm review (Mar 9, 2021)In-class questions:http://bit.ly/nlpqsDavid Bamman, UC Berkeley

Midterm Thursday 3/11, 2:10-3:30pm PST on bCourses. This midterm must be carriedout entirely independently! You are free to use any of your notes, readings, or lecture material; themidterm will cover all material through 3/4 lecture on parsing. Questions will resemble what you have already experienced with the quizzes(e.g., multiple choice) along with some numerical answer/fill-in-the-blankquestions as well. Have your calculator ready! You can expect more attention to the topics we cover in lectures and in thehomeworks, but everything in the lectures and readings is fair game.

Big ideas Classification Naive Bayes, Logisticregression, feedforward neuralnetworks, CNN, BERT Language modelingMarkov assumption,featurized, neuralProbability/statistics in NLPWhere does NLP data come from? Annotation process Interannotator agreementChain rule of probability,independence, Bayes’ rule

Big ideas Lexical semantics and wordrepresentations Distributional hypothesis Distributed representations Subword embedding models Contextualized wordrepresentations (ELMO/BERT)Evaluation metrics (accuracy,precision, recall, F score, perplexity,parseval) Sequence labeling POS, NER Methods: HMM, MEMM, CRF,RNN, BiRNN, BERT Phrase-structure parsing,CFG, PCFG CKY for recognition, parsingTrees

Big ideas What defines the models we’ve seen so far? What formally distinguishes anHMM from an MEMM? How do we train those models? For all of the problems we’ve seen (sentiment analysis, POS tagging, phrasestructure parsing), how do we evaluate the performance of different models? If faced with a new NLP problem, how would you decide between thealternatives you know about? How would you adapt an MEMM, for example,to a new problem?

Can we get a high-level overview of what distinguishes each neural network (ex: CNN, RNN,bidirectional, etc.) and an example/case of when we would use each? Feedforward NN (e.g., multi-layer perceptron). Inputs: fixed-dimensionalfeature vector (e.g., bag of words representation; not original wordsequence). Output: single prediction for that sequence. CNN. Inputs: original word sequence. Performs the same action(convolution) over each window in the original sequence. Learns to identifyimportant ngrams in the original sequence. Can be used for documentclassification (by adding a softmax layer on top), or to generate arepresentation for the entire sequence.

Can we get a high-level overview of what distinguishes each neural network (ex: CNN, RNN,bidirectional, etc.) and an example/case of when we would use each? RNN. Inputs: original word sequence. Generates a representation for eachtoken in a sequence that is aware of words on the left context (forward RNN),right context (backward RNN) or both contexts (bidirectional RNN).Information from word i needs to pass through the hidden states for words[i 1, , j-1] to influence word j. Can be used for document classification (byadding a softmax layer on top of the final word representation), sequencelabeling (by adding a softmax layer on top of each token representation), orgenerating representations for each token in the sequence.

Can we get a high-level overview of what distinguishes each neural network (ex: CNN, RNN,bidirectional, etc.) and an example/case of when we would use each? Transformer. Inputs: original word sequence. Generates a representation foreach token in a sequence, for each layer in the transformer, by usingattention over all words in the previous layer. The representation for word iand layer k has simultaneous access to all word representations at layer k-1(representations not diluted as they are for an RNN). Can be used fordocument classification (by adding a softmax layer on top of the final wordrepresentation), sequence labeling (by adding a softmax layer on top of eachtoken representation), or generating representations for each token in thesequence.

Can we get a high-level overview of what distinguishes each neural network (ex: CNN, RNN,bidirectional, etc.) and an example/case of when we would use each?FeedforwardNN (e.g. MLP)InputGeneratesrepresentation foreach token?Generatesrepresentation forentire input?Common tasksCNNFixed-dimensionalOriginal sequencefeature vectorNoYes (e.g., hiddenlayer in l sequenceOriginal sequenceYesYesYes (e.g.,Yes (e.g., hiddenconcatenation of allstate for last tokenfilters afterin ntationlearningYes (e.g., [CLS]token in BERT)DocumentSequence labeling,classification,representationsequence labeling,learningrepresentationlearning

RBBiLSTM for each word; concatenatefinal state of forward LSTM, backwardLSTM, and word embedding asrepresentation for a word.bigly4 3 -2 -1 4 9 0 0 0 0 0Lample et al. (2016), “Neural Architectures for Named EntityRecognition”000000.7 -1.1 -5.40.7 -1.1 -5.4word embeddingbibcharacter BiLSTM2.7gi3.1-1.4-2.30.72.7lg3.1-1.4-2.30.7 2.7yl3.1-1.4-2.30.72.7y3.1-1.4-2.30.7 2.73.1-1.4-2.30.710

RBCharacter CNN for each word; concatenatecharacter CNN output and wordembedding as representation for a word.biglyChu et al. (2016), “Named Entity Recognition withBidirectional LSTM-CNNs”42.7max -2.30.7 0-2.33.1l3 -2 -1 40.7 2.7-1.4-2.33.1-1.4-2.30.70.7 2.73.1-1.40000word embedding-2.30.7y11

Can you summarize when to use each model (classification, language modeling, etc.), and which arediscriminative vs. generative? Classification: predict a label for each document. Language modeling: applications that need information about fluency (autocorrect, translation,speech recognition, OCR); and a general framework for learning representations of words. Word representations: learn representations of words (symbols generally) that are sensitive totheir context of use, both aggregate context (e.g. word2vec) and local sentence context (e.g.,BERT). Sequence labeling: predict a label for each token. Parsing: predict syntactic structure (generally, any tree structure).

Generative vs.Discriminative modelsGenerative models specify a joint distribution over the labels and the data. With thisyou could generate new data P(X, Y ) P(Y ) P(X Y ) Discriminative models specify the conditional distribution of the label y given thedata x. These models focus on how to discriminate between the classesP(Y X )

Naive BayesBinary logistic regressionP (Y y)P (X x Y y)P (Y y X x) Py P (Y y)P (X x Y y) , ) ( expn 1HMMP (x1 , . . . , xn , y1 , . . . , yn )i 1 nP (yi yi1)i 1P (xi yi )nMEMMi 1CRFP (y x, ) P (yi yi1 , x,)exp( (x, y) )y Y exp( (x, y ))

Can you go over BERT vs ELMO, and how attention works?

ELMoBERTStacked BiRNN trained to predict nextword in language modeling ters et al. 2018Transformer-based model to predict maskedword using bidirectional context next thethethethethetheDevlin et al. 2019

ELMo Peters et al. (2018), “Deep Contextualized Word Representations” (NAACL) Big idea: transform the representation of a word (e.g., from a static wordembedding) to be sensitive to its local context in a sentence and optimizedfor a specific NLP task. Output word representations that can be plugged into just about anyarchitecture a word embedding can be used.

ELMo Train a bidirectional RNN language model with L layers on a bunch of text. Learn parameters to combine the RNN output across all layers for eachword in a sentence for a specific task (NER, semantic role labeling,question answering etc.). Large improvements over SOTA for lots of NLPproblems.

BERT Transformer-based model (Vaswani et al. 2017) to predict masked wordusing bidirectional context next sentence prediction. Generates multiple layers of representations for each token sensitive to itscontext of use.

Each token in the input starts out representedby token and position -0.9-0.7e1,1e1,2e1,3Thedogbarked0.2

The value for time step j at layer i is the resultof attention over all time steps in the previouslayer 2

.7-1.11.6-0.3-0.9-0.7e1,1e1,2e1,3Thedogbarked0.2

1,3Thedogbarked0.2

.7-1.11.6-0.3-0.9e1,1e1,2e1,3Thedogbarked

hedogbarked

1.6-0.3-0.9e1,1e1,2e1,3Thedogbarked

gbarked

At the end of this process, we have onerepresentation for each layer for each Thedogbarked

BERT Learn the parameters of this model with two objectives: Masked language modeling Next sentence prediction

Masked LM Mask one word from the input and try to predict that word as the output More powerful than an RNN LM (or even a BiRNN LM) since it can reasonabout context on both sides of the word being predicted. A BiRNN models context on both sides, but each RNN only has access toinformation from one direction.

0.3 0.2 0.70e3,1-1.7 -0.6 -0.5 -1.6-1.6 -0.6 -0.3 -0.4-1 -0.6 2.3 0.9-0.1 0.2 0.4 -0.61.1 -1.5 0.3 0.4-0.4 -1.1 -0.6 -0.3e3,2e3,3e3,4e3,5e3,60.5 1.9 -1.2 -0.2-0.6 -0.7 -1.4 -2.10.6 0.220.9-1.10-1.6 -0.71.9 0.6 -0.4 -0.3e2,1e2,2e2,3e2,4e2,5e2,6-0.5 -0.5 0.6 0.70.1 0.7 -0.5 0.82.5 -1.7 -0.9 -2.8-0.5 -1.1 -0.6 1.40.6 -1.7 1.6 2.11.1 -0.9 0.5 [SEP]

dog0.3 0.2 0.70e3,1-1.7 -0.6 -0.5 -1.6-1.6 -0.6 -0.3 -0.4-1 -0.6 2.3 0.9-0.1 0.2 0.4 -0.61.1 -1.5 0.3 0.4-0.4 -1.1 -0.6 -0.3e3,2e3,3e3,4e3,5e3,60.5 1.9 -1.2 -0.2-0.6 -0.7 -1.4 -2.10.6 0.220.9-1.10-1.6 -0.71.9 0.6 -0.4 -0.3e2,1e2,2e2,3e2,4e2,5e2,6-0.5 -0.5 0.6 0.70.1 0.7 -0.5 0.82.5 -1.7 -0.9 -2.8-0.5 -1.1 -0.6 1.40.6 -1.7 1.6 2.11.1 -0.9 0.5 [SEP]

bark0.3 0.2 0.70e3,1-1.7 -0.6 -0.5 -1.6-1.6 -0.6 -0.3 -0.4-1 -0.6 2.3 0.9-0.1 0.2 0.4 -0.61.1 -1.5 0.3 0.4-0.4 -1.1 -0.6 -0.3e3,2e3,3e3,4e3,5e3,60.5 1.9 -1.2 -0.2-0.6 -0.7 -1.4 -2.10.6 0.220.9-1.10-1.6 -0.71.9 0.6 -0.4 -0.3e2,1e2,2e2,3e2,4e2,5e2,6-0.5 -0.5 0.6 0.70.1 0.7 -0.5 0.82.5 -1.7 -0.9 -2.8-0.5 -1.1 -0.6 1.40.6 -1.7 1.6 2.11.1 -0.9 0.5 SEP]

BERT Deep layers (12 for BERT base, 24 for BERT large) Large representation sizes (768 per layer) Pretrained on English Wikipedia (2.5B words) and BooksCorpus (800Mwords)

Attention Let’s incorporate structure (and parameters) into a network thatcaptures which elements in the input we should be attending to (andwhich we can ignore).

v ℛH2.73.1 -1.4 -2.3 0.7Define v to be a vector to be learned; think of it as an “importantword” vector. The dot product here measures how similar each inputvector is to that “important word” vector2.7-0.7 -0.8 -1.3 -0.2 -0.92.3 1.5 1.1 1.4 1.3-0.9 -1.5 -0.7 0.9 0.2-0.1 -0.7 -1.6 0.2 0.6Ilovedthemovie!x1x2x3x4x53.1 -1.4 -2.3 0.7

v ℛH2.73.1 -1.4 -2.3 0.7-3.42.4-0.8-1.21.7r1 v x1r2 v x2r3 v x3r4 v x4r5 v x52.7-0.7 -0.8 -1.3 -0.2 -0.92.3 1.5 1.1 1.4 1.3-0.9 -1.5 -0.7 0.9 0.2-0.1 -0.7 -1.6 0.2 0.6Ilovedthemovie!x1x2x3x4x53.1 -1.4 -2.3 0.7

Convert r into a vector of normalized weights that sum to 1.a softmax(r)a00.640.020.020.32r-3.42.4-0.8-1.21.7r1 v x1r2 v x2r3 v x3r4 v x4r5 v x52.7-0.7 -0.8 -1.3 -0.2 -0.92.3 1.5 1.1 1.4 1.3-0.9 -1.5 -0.7 0.9 0.2-0.1 -0.7 -1.6 0.2 0.6Ilovedthemovie!x1x2x3x4x53.1 -1.4 -2.3 0.7

y1.9 -0.2 -1.1 -0.2 -0.7x1a1 x2a2 x3a3 x4a4 x5a5weighted sum2.73.1 -1.4 -2.3 0.7I-0.7 -0.8 -1.3 -0.2 -0.92.3 1.5 1.1 1.4 1.3-0.9 -1.5 -0.7 0.9 0.2-0.1 -0.7 -1.6 0.2 0.6lovedthemovie!

Attention Lots of variations on attention: Linear transformation of x into before dotting with v Non-linearities after each operation. “Multi-head attention”: multiple v vectors to capture different phenomenathat can be attended to in the input. Hierarchical attention (sentence representation with attention over words document representation with attention over sentences).

For CNNs, what is the purpose of max pooling and global pooling? Maxpooling down-samples a layer by selecting a single point fromsome set Max-pooling over time (global max pooling) selects the largest valueover an entire sequence. This provides a mechanism to have fixeddimensional representation for variable-sized inputs. Global max pooling is very common for NLP problems.

For CNNs, what is the purpose of max pooling and globalpooling?Thisx1moviex23c1 convolution (x1,x2,x3)wasx34c2 convolution (x2,x3,x4)thex41c3 convolution (x3,x4,x5)worstx50c4 convolution (x4,x5,x6)moviex65c5 convolution (x5,x6,x7)everx79c6 convolution (x6,x7,x8)!x811c7 convolution (x7,x8,x9)!x9Here’s the view for one filter, with awindow size of 3. Let’s assume we makea prediction (e.g., for sentiment) from thisone filter alone11g global max pooling(c1, , c7)P(y x) sigmoid(g W )W ℝ1 10y

For CNNs, what is the purpose of max pooling and globalpooling?Thisx1moviex23c1 convolution (x1,x2,x3)wasx34c2 convolution (x2,x3,x4)thex41c3 convolution (x3,x4,x5)worstx50c4 convolution (x4,x5,x6)moviex65c5 convolution (x5,x6,x7)everx79c6 convolution (x6,x7,x8)!x811c7 convolution (x7,x8,x9)!x9Let’s say we didn’t use global maxpoolingP(y x) sigmoid(g W )W ℝ7 10y

For CNNs, what is the purpose of max pooling and globalpooling?Lovedx1thisx2moviex33c1 convolution (x1,x2,x3)Let’s say we didn’t use global maxpooling1yP(y x) sigmoid(g W )W ℝ? 1Totallyx1thex23c1 convolution (x1,x2,x3)worstx34c2 convolution (x2,x3,x4)moviex41c3 convolution (x3,x4,x5)everx50c4 convolution (x4,x5,x6)!x60y

Could you go over these concepts in terms of their similarities and differences (skip gram, MaskedLM, Gapping n-gram)? I occasionally get those confused. Skip-gram: word2vec model for learning representations of word types. Masked language model: “masking” out a word in an input sequenceand predicting it in order to learn word representations (BERT). Gappy ngrams (aka “skip ngrams”). A feature based on set of wordsthat may have gaps between them.

It was a dark and stormy night It waswas aa darkdark andand stormystormy night Bigram featuresSkip bigram features It waswas aa darkdark andand stormystormy nightIt aIt darkIt andIt stormyIt nightwas darkwas andwas stormywas nightdark stormydark nightand night

Could you go over, generally, what transformers are and what makes them so powerful? Transformers provide simultaneous access to all other tokens ina sequence when generating a representation (unlike RNNs) Attention as a mechanism allows for learning what is importantfor each word to pay attention to its context-0.7-1.30.4-0.4-0.7-0.8-1.1 When used for language modeling, they provide access tocontext on both sides of a word being predicted (allowing formasked language modeling as an 9-0.7e1,1e1,2e1,3Thedogbarked0.2

Could you go over dropout in 10 Let’s assume a FFNN (like aMLP) with a single hiddenlayer. Dropout removes nodesduring training to encouragethe model to not rely onthem. Only active during train time,not at test time.y

Could you go over dropout in 10 Let’s assume a FFNN (like aMLP) with a single hiddenlayer. Dropout removes nodesduring training to encouragethe model to not rely onthem. Only active during train time,not at test time.y

Could you go over dropout in regularization?contains “like”f11f233f324f411f561f605f729f8311f910 Let’s assume a FFNN (like aMLP) with a single hiddenlayer. Dropout removes nodesduring training to encouragethe model to not rely onthem. Only active during train time,not at test time.y

Could you go over dropout in 0 Let’s assume a FFNN (like aMLP) with a single hiddenlayer. Dropout removes nodesduring training to encouragethe model to not rely onthem. Only active during train time,not at test time.y

Could you go over dropout in 0 Let’s assume a FFNN (like aMLP) with a single hiddenlayer. Dropout removes nodesduring training to encouragethe model to not rely onthem. Only active during train time,not at test time.y

What's the difference between HMM, MEMM, and CRF? And when do we know to use which one?modelHidden MarkovModelsformN i 1NMEMM i 1CRFP(xi yi ) P(yi yi 1)P(yi yi 1, x, β )P(y x, β)NRNN i 1P(yi x1:i, β )label dependencyrich wise throughentire sequenceyesnonedistributed

Can you go over the differences in the intuitions and implementations of Laplace andKneser-Ney smoothing?Laplace smoothing:α 1P (wi ) P (wi wi1) c(wi ) N Vc(wi 1 , wi ) c(wi 1 ) V

Kneser-Ney smoothing Intuition: When backing off to a lower-order ngram, maybe the overallngram frequency is not our best guess.I can’t see without my readingP(“Francisco”) P(“glasses”) Francisco is more frequent, but shows up in fewer unique bigrams (“SanFrancisco”) — so we shouldn’t expect it in new contexts; glasses,however, does show up in many different bigrams

Kneser-Ney smoothingdiscounted massmax{c(wi 1, wi) d,0} λ(wi 1) PCONTINUATION (wi)c(wi 1)discounted bigram probabilitycontinuition probability

Kneser-Ney smoothingwi-1wiC(wi-1, wi)C(wi-1, wi) - d(1)redhook32redcar21redwatch1091512sum(red) 131512/15 of the probability mass stayswith the original counts;3/15 is reallocated

Can you go over the information theoretic view and how it related to language modeling?Y“One morning I shot an elephant in mypajamas”encode(Y)decode(encode(Y))Shannon 1948

Noisy ChannelXYASRspeech signaltranscriptionMTtarget textsource textOCRpixel densitiestranscriptionP (Y X)P (X Y )P (Y )channel model source model

Midterm Thursday 3/11, 2:10-3:30pm PST on bCourses. This midterm must be carried out entirely independently! You are free to use any of your notes, readings, or lecture material; the midterm will cover all material through 3/4 lecture on parsing. Questions will resemble what you have already experienced with the quizzes (e.g., multiple choice) along with some numerical answer/fill .

Related Documents:

Rudolf Rosa - Deep Neural Networks in Natural Language Processing 14/116 ML in Natual Language Processing Before: complex multistep pipelines Preprocessing, low-level processing, high-level processing, classification, post-processing Massive feature engineering, linguistic knowledge Now: monolitic end-to-end systems (or nearly)

processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques.

understanding of language. Natural language processing is used to translate text, summarize large files, and provide sentiment analysis, among other applications. Natural Language Processing Overview 7 In Insurance: Natural language processing is often used in conjunction with machine learning models to extract information from unstructured data.

71 Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics PERATHAM WIRIYATHAMMABHUM, University of Maryland, College Park DOUGLAS SUMMERS-STAY, U.S. Army Research Laboratory, Adelphi CORNELIA FERMULLER and YIANNIS ALOIMONOS , University of Maryland, College Park Integrating computer vision and natural language processing is a novel interdisciplinary .

This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages such as Eng-lish, Hindi, or Portuguese. In contrast to artificial languages such as programming lan-guages and mathematical notations, natural languages have evolved as they pass from

based on NLP can be applied to complete tasks but differ apparently from the human language processing system. The . presenting what actually happens within a Natural Language Processing system is by means of the 'levels of language' approach. This is also referred to as the synchronic model of language.(Liddy, 2001) [2]. This

Natural Language Processing applied to Literary Studies Borja Navarro-Colorado Department of Software and Computing Systems University of Alicante borja@dlsi.ua.es Abstract This paper presents how Natural Language Processing is taught to students of a Mas-ter's Degree in Literary Studies. These students' background is solely humanistic,

Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming algorithm for English . I Only the first matching rule in each step is applied I Later steps may "clean up" unfortunate side effects Natural Language Processing 3(9) Example: Step 1 Rule Condition Example Exception 1.1 (X)-sses ! -ss caresses .