Attention And Transformers Lecture 11 - Stanford University

1y ago
12 Views
2 Downloads
1.27 MB
86 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Macey Ridenour
Transcription

Lecture 11:Attention and TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 1May 06, 2021

Administrative: Midterm- Midterm was this Tuesday- We will be grading this week and you should havegrades by next week.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 2May 06, 2021

Administrative: Assignment 3- A3 is due Friday May 25th, 11:59pm Lots of applications of ConvNets Also contains an extra credit notebook, which isworth an additional 5% of the A3 grade. Extra credit will not be used when curving the classgrades.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 3May 06, 2021

Last Time: Recurrent Neural NetworksFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 4May 06, 2021

Last Time: Variable length computationgraph with shared weightsh0WfWx1Ly1L1y2L2y3L3yTh1fWh2fWh3 hTx2Fei-Fei Li, Ranjay Krishna, Danfei XuLTx3Lecture 11 - 5May 06, 2021

Let's jump to lecture 10 - slide 43Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 6May 06, 2021

Today's Agenda:- Attention with RNNs- In Computer Vision- In NLP- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention- TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 7May 06, 2021

Today's Agenda:- Attention with RNNs- In Computer Vision- In NLP- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention- TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 8May 06, 2021

Image Captioning using spatial featuresInput: Image IOutput: Sequence y y1, y2,., yTz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 9May 06, 2021

Image Captioning using spatial featuresInput: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPh0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 10May 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPpersony1z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPh0h1cy0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]Lecture 11 - 11May 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPpersonwearingy1y2h0h1h2cy0y1z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 12May 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPpersonwearinghaty1y2y3h0h1h2h3cy0y1y2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 13wearingMay 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 1y2y3wearinghatz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 14May 06, 2021

Image Captioning using spatial featuresProblem: Input is "bottlenecked" through c- Model needs to encode everything itwants to say within earinghatThis is a problem if we want to generatereally long descriptions? 100s of words longz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 15May 06, 2021

Image Captioning with RNNs & Attentiongif sourceAttention idea: New context vector at everytime step.Each context vector will attend to differentimage regionsz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2Attention Saccades in humansh0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 16May 06, 2021

Image Captioning with RNNs & AttentionAlignment scores:Compute alignmentsscores (scalars):HxWe1,0,0 e1,0,1 e1,0,2e1,1,0 e1,1,1 e1,1,2fatt(.) is an MLPe1,2,0 e1,2,1 e1,2,2z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 17May 06, 2021

Image Captioning with RNNs & AttentionCompute alignmentsscores (scalars):fatt(.) is an MLPAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2Normalize to getattention weights:0 at, i, j 1,attention values sumto 1h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 18May 06, 2021

Image Captioning with RNNs & AttentionCompute alignmentsscores (scalars):fatt(.) is an MLPAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2z0,0 z0,1 z0,2CNNNormalize to getattention weights:Compute context vector:0 at, i, j 1,attention values sum to 1h0z1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDc1XXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 19May 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagepersony1z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2h1h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]Lecture 11 - 20May 06, 2021

Image Captioning with RNNs & AttentionAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2Decoder: yt gV(yt-1, ht-1, ct)New context vector at every time steppersony1z0,0 z0,1 z0,2CNNh1h0z1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDc1y0c2XXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]Lecture 11 - 21May 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagez0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2personwearingy1y2h1h2h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]c2y1personLecture 11 - 22May 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagez0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2personwearinghaty1y2y3h1h2h3h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]c2y1personLecture 11 - 23c3y2wearingMay 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagez0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2personwearinghat[END]y1y2y3y4h1h2h3h4h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]c2y1personLecture 11 - 24c3y2wearingc4y3hatMay 06, 2021

Image Captioning with RNNs & AttentionAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2z0,0 z0,1 z0,2CNNThis entire process is differentiable.- model chooses its ownattention weights. No attentionsupervision is 0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDc1y0c2y1c3y2c4y3XXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 25wearinghatMay 06, 2021

Image Captioning with AttentionSoft attentionHard attention(requiresreinforcementlearning)Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 26May 06, 2021

Image Captioning with AttentionXu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 27May 06, 2021

Attention can detect Gender BiasAll images are CC0 Public domain:dog,Burns et al. “Women also Snowboard: Overcoming Bias in Captioning Models” ECCV 2018Figures from Burns et al, copyright 2018. Reproduced with permission.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 28May 06, 2021

Similar tasks in NLP - Language translation exampleInput: Sequence x x1, x2,., xTOutput: Sequence y y1, y2,., yTx0x1personne portantx2x3unchapeauFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 29May 06, 2021

Similar tasks in NLP - Language translation exampleInput: Sequence x x1, x2,., xTOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where zt RNN(xt, ut-1)fW(.) is MLPu is the hidden RNN statez0z1z2z3x0x1x2x3unchapeaupersonne portantFei-Fei Li, Ranjay Krishna, Danfei Xuh0Lecture 11 - 30May 06, 2021

Similar tasks in NLP - Language translation exampleDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Sequence x x1, x2,., xTOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where zt RNN(xt, ut-1)fW(.) is MLPu is the hidden RNN h4x0x1x2x3cy0y1y2y3unchapeauwearinghatpersonne portant[START]Fei-Fei Li, Ranjay Krishna, Danfei XupersonLecture 11 - 31May 06, 2021

Attention in NLP - Language translation exampleCompute alignmentsscores (scalars):fatt(.) is an MLPe0e1e2e3z0z1z2z3x0x1x2x3unchapeaupersonne portanth0Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 32May 06, 2021

Attention in NLP - Language translation examplea0a1a2Compute alignmentsscores (scalars):a3softmaxfatt(.) is an MLPe0e1e2e3z0z1z2z3x0x1x2x3unchapeaupersonne portantNormalize to getattention weights:0 at, i, j 1,attention values sum to 1h0Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 33May 06, 2021

Attention in NLP - Language translation examplea0a1a2Compute alignmentsscores (scalars):a3softmaxfatt(.) is an MLPe0e1e2e3z0z1z2z3Normalize to getattention weights:0 at, i, j 1,attention values sum to 1h0Compute context vector:x0x1personne portantx2x3unchapeauc1XBahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 34May 06, 2021

Attention in NLP - Language translation examplea0a1a2Decoder: yt gV(yt-1, ht-1, c)where context vector c is often c ne c2y1c3y2c4y3XBahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 35wearinghatMay 06, 2021

Similar visualization of attention weightsEnglish to French translationexample:Input: "The agreement on theEuropean Economic Areawas signed in August 1992."Output: "L’accord sur la zoneéconomique européenne aété signé en août 1992."Without any attentionsupervision, model learnsdifferent word orderings fordifferent languagesBahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 36May 06, 2021

Today's Agenda:- Attention with RNNs- In Computer Vision- In NLP- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention- TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 37May 06, 2021

FeaturesAttention we just saw in image captioningz0,0 z0,1 z0,2z1,0 z1,1 z1,2z2,0 z2,1 z2,2hInputs:Features: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 38May 06, 2021

Attention we just saw in image captioningz0,0 z0,1 z0,2e0,0e0,1e0,2z1,0 z1,1 z1,2e1,0e1,1e1,2z2,0 z2,1 ment: ei,j fatt(h, zi,j)Inputs:Features: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 39May 06, 2021

tion we just saw in image captioningOperations:Alignment: ei,j fatt(h, zi,j)Attention: a softmax(e)z0,0 z0,1 z0,2e0,0e0,1e0,2z1,0 z1,1 z1,2e1,0e1,1e1,2z2,0 z2,1 atures: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 40May 06, 2021

Attention we just saw in image captioningcOutputs:context vector: c (shape: D)a0,0a0,1a0,2a1,0a1,1a1,2a2,0a2,1a2,2Attentionmul addz0,0 z0,1 z0,2e0,0e0,1e0,2z1,0 z1,1 z1,2e1,0e1,1e1,2z2,0 z2,1 s:Alignment: ei,j fatt(h, zi,j)Attention: a softmax(e)Output: c i,j ai,jzi,jInputs:Features: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 41May 06, 2021

General attention layercOutputs:context vector: c (shape: D)a0a1a2Attentionmul addx0e0x1e1x2e2hAlignmentInput vectorssoftmaxOperations:Alignment: ei fatt(h, xi)Attention: a softmax(e)Output: c i ai xiInputs:Input vectors: x (shape: N x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuAttention operation is permutation invariant.Doesn't care about ordering of thefeaturesStretch H x W N into N vectorsLecture 11 - 42May 06, 2021

General attention layercOutputs:context vector: c (shape: D)a0a1a2Attentionmul addx0e0x1e1x2e2hOperations:Alignment: ei h ᐧ xiAttention: a softmax(e)Output: c i ai xiAlignmentInput vectorssoftmaxChange fatt(.) to a simple dot productonly works well with key & valuetransformation trick (will mention in afew slides)Inputs:Input vectors: x (shape: N x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 43May 06, 2021

General attention layercOutputs:context vector: c (shape: D)a0a1a2Attentionmul addx0e0x1e1x2e2hOperations:Alignment: ei h ᐧ xi / DAttention: a softmax(e)Output: c i ai xiAlignmentInput vectorssoftmaxChange fatt(.) to a scaled simple dot productLarger dimensions means more terms inthe dot product sum.So, the variance of the logits is higher.Large magnitude vectors will producemuch higher logits.So, the post-softmax distribution haslower-entropy, assuming logits are IID.Ultimately, these large magnitudevectors will cause softmax to peak andassign very little weight to all othersDivide by D to reduce effect of largemagnitude vectorsInputs:Input vectors: x (shape: N x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 44May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: ( ) add ( perations:Alignment: ei,j qj ᐧ xi / DAttention: a softmax(e)Output: yj i ai,j xiAlignmentInput vectorssoftmax ( )Multiple query vectorseach query creates a new outputcontext vectorInputs:Input vectors: x (shape: N x D)Queries: q (shape: M x D)Fei-Fei Li, Ranjay Krishna, Danfei XuMultiple query vectorsLecture 11 - 45May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: ( ) add ( perations:Alignment: ei,j qj ᐧ xi / DAttention: a softmax(e)Output: yj i ai,j xiAlignmentInput vectorssoftmax ( )Notice that the input vectors are usedfor both the alignment as well as theattention calculations.We can add more expressivity tothe layer by adding a differentFC layer before each of the twosteps.Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 46May 06, 2021

General attention layerv0v1Operations:Key vectors: k xWkValue vectors: v xWvInput vectorsv2x0k0x1k1x2k2Notice that the input vectors are usedfor both the alignment as well as theattention calculations.We can add more expressivity tothe layer by adding a differentFC layer before each of the twosteps.q0q1q2Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 47May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2Notice that the input vectors are usedfor both the alignment as well as theattention calculations.We can add more expressivity tothe layer by adding a differentFC layer before each of the twosteps.Operations:Key vectors: k xWkValue vectors: v xWvAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viAlignmentInput vectorssoftmax ( )The input and output dimensions cannow change depending on the key andvalue FC layersInputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 48May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2Operations:Key vectors: k xWkValue vectors: v xWvAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNAlignmentInput vectorssoftmax ( )Recall that the query vector was afunction of the input vectorsz1,0 z1,1 z1,2MLPh0z2,0 z2,1 z2,2Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 49May 06, 2021

Input vectorsSelf attention layerWe can calculate the query vectorsfrom the input vectors, therefore,defining a "self-attention" layer.Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j vix0Instead, query vectors arecalculated using a FC layer.x1x2No input query vectors anymoreq0q1q2Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 50May 06, 2021

Self attention layery0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2AlignmentInput vectorssoftmax ( )Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viInputs:Input vectors: x (shape: N x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 51May 06, 2021

Self attention layer - attends over sets of inputsy0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2AlignmentInput vectorssoftmax ( )y0Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viy1y2self-attentionx0x1x2Inputs:Input vectors: x (shape: N x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 52May 06, 2021

Self attention layer - attends over sets of inputsPermutation invariantProblem: how can we encode ordered sequences likelanguage or spatially ordered image features?y1y0y2self-attentionx1x0y2y1y0y0Fei-Fei Li, Ranjay Krishna, Danfei e 11 - 53x1x2May 06, 2021

Positional encodingy0y1y2self-attentionx0x1x2p0p1p2position encodingx0x1x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorDesiderata of pos(.) :1.It should output a unique encoding for eachtime-step (word’s position in a sentence)2.Distance between any two time-steps should beconsistent across sentences with different lengths.3.Our model should generalize to longer sentenceswithout any efforts. Its values should be bounded.4.It must be deterministic.So, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 54May 06, 2021

Positional encodingOptions for pos(.)y0y1y2self-attentionx0x1x2p0p1p21.Learn a lookup table: Learn parameters to use for pos(t)for t ε [0, T) Lookup table contains T x dparameters.position encodingx0x1x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorSo, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei XuDesiderata of pos(.) :1.It should output a unique encoding for eachtime-step (word’s position in a sentence)2.Distance between any two time-steps should beconsistent across sentences with different lengths.3.Our model should generalize to longer sentenceswithout any efforts. Its values should be bounded.4.It must be deterministic.Vaswani et al, “Attention is all you need”, NeurIPS 2017Lecture 11 - 55May 06, 2021

Positional encodingOptions for pos(.)y0y1y21.Learn a lookup table: Learn parameters to use for pos(t)for t ε [0, T) Lookup table contains T x dparameters.2.Design a fixed function with thedesiderata self-attentionx0x1x2p0p1p2position encodingx1x0x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorSo, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei Xup(t) whereVaswani et al, “Attention is all you need”, NeurIPS 2017Lecture 11 - 56May 06, 2021

Positional encodingOptions for pos(.)y0y1y21.Learn a lookup table: Learn parameters to use for pos(t)for t ε [0, T) Lookup table contains T x dparameters.2.Design a fixed function with thedesiderata Intuition:self-attentionx0x1x2p0p1p2position encodingx0x1x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorSo, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei Xup(t) whereimage sourceVaswani et al, “Attention is all you need”, NeurIPS 2017Lecture 11 - 57May 06, 2021

Masked self-attention layery0y1y2Outputs:context vectors: y (shape: Dv)v0a0,0a0,1a0,2v10a1,1a1,2v200a2,2Attentionmul( ) add ( )x0k0e0,0e0,1e0,2x1k1- e1,1e1,2x2k2- - e2,2q0q1q2AlignmentInput vectorssoftmax ( )-Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j vi-Prevent vectors fromlooking at future vectors.Manually set alignmentscores to -infinityInputs:Input vectors: x (shape: N x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 58May 06, 2021

Multi-head self attention layer- Multiple self-attention heads in parallely0y1y2Add or x0x1x2Splitx0Fei-Fei Li, Ranjay Krishna, Danfei Xux1x2Lecture 11 - 59May 06, 2021

General attention versus f-attentionq0q1Fei-Fei Li, Ranjay Krishna, Danfei Xuq2x0x1Lecture 11 - 60x2May 06, 2021

Comparing RNNs to TransformersRNNs( ) LSTMs work reasonably well for long sequences.(-) Expects an ordered sequences of inputs(-) Sequential computation: subsequent hidden states can only be computed after theprevious ones are done.Transformers:( ) Good at long sequences. Each attention calculation looks at all inputs.( ) Can operate over unordered sets or ordered sequences with positional encodings.( ) Parallel computation: All alignment and attention scores for all inputs can be done inparallel.(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculatedand stored for a single self-attention head. (but GPUs are getting bigger and better)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 61May 06, 2021

Today's Agenda:- Attention with RNNs- In Computer Vision- In NLP- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention- TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 62May 06, 2021

Image Captioning using transformersInput: Image IOutput: Sequence y y1, y2,., yTz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 63May 06, 2021

Image Captioning using transformersInput: Image IOutput: Sequence y y1, y2,., yTEncoder: c TW(z)where z is spatial CNN featuresTW(.) is the transformer encoderz0,0 z0,1 z0,2CNNc0,0c0,2. c2,2z1,0 z1,1 z1,2Transformer encoderz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNc0,1Features:HxWxDz0,0Fei-Fei Li, Ranjay Krishna, Danfei Xuz0,1z0,2. z2,2Lecture 11 - 64May 06, 2021

Image Captioning using transformersDecoder: yt TD(y0:t-1, c)where TD(.) is the transformer decoderInput: Image IOutput: Sequence y y1, y2,., yTperson wearingEncoder: c TW(z)where z is spatial CNN featuresTW(.) is the transformer encoderz0,0 z0,1 z0,2CNNy1c0,0c0,2. c2,2Features:HxWxD[END]y3y4Transformer decoderz1,0 z1,1 z1,2Transformer encoderz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNc0,1y2hatz0,0Fei-Fei Li, Ranjay Krishna, Danfei Xuz0,1z0,2. z2,2y0y1[START] personLecture 11 - 65y2y3wearinghatMay 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNMade up of N encoder blocks.In vaswani et al. N 6, Dq 512z0,0z0,1z0,2. z2,2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 66May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNLet's dive into one encoder blockz0,0z0,1z0,2. z2,2x0x1x2x2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 67May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNPositional encodingz0,0z0,1z0,2. z2,2x0x1x2Add positional encodingx2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 68May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNMulti-head self-attentionPositional encodingz0,0z0,1z0,2. z2,2x0x1x2Attention attends over all the vectorsAdd positional encodingx2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 69May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNResidual connection Multi-head self-attentionPositional encodingz0,0z0,1z0,2. z2,2x0x1x2Attention attends over all the vectorsAdd positi

- Transformers. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021 Extract spatial features from a pretrained CNN Image Captioning using spatial features 9 CNN Features: H x W x D Xu et al, "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention", ICML 2015 z 0,0 z 0,1 z 0,2 z 1,0 z 1,1 z 1,2 z 2,0 z 2,1 z 2,2

Related Documents:

applications including generator step-up (GSU) transformers, substation step-down transformers, auto transformers, HVDC converter transformers, rectifier transformers, arc furnace transformers, railway traction transformers, shunt reactors, phase shifting transformers and r

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

7.8 Distribution transformers 707 7.9 Scott and Le Blanc connected transformers 729 7.10 Rectifier transformers 736 7.11 AC arc furnace transformers 739 7.12 Traction transformers 745 7.13 Generator neutral earthing transformers 750 7.14 Transformers for electrostatic precipitators 756 7.15 Series reactors 758 8 Transformer enquiries and .

2.5 MVA and a voltage up to 36 kV are referred to as distribution transformers; all transformers of higher ratings are classified as power transformers. 0.05-2.5 2.5-3000 .10-20 36 36-1500 36 Rated power Max. operating voltage [MVA] [kV] Oil distribution transformers GEAFOL-cast-resin transformers Power transformers 5/13- 5 .

cation and for the testing of the transformers. – IEC 61378-1 (ed. 2.0): 2011, converter transformers, Part 1, Transformers for industrial applications – IEC 60076 series for power transformers and IEC 60076-11 for dry-type transformers – IEEE Std, C57.18.10-1998, IEEE Standard Practices and Requirements for Semiconductor Power Rectifier

Transformers (Dry-Type). CSA C9-M1981: Dry-Type Transformers. CSA C22.2 No. 66: Specialty Transformers. CSA 802-94: Maximum Losses for Distribution, Power and Dry-Type Transformers. NEMA TP-2: Standard Test Method for Measuring the Energy Consumption of Distribution Transformers. NEMA TP-3 Catalogue Product Name UL Standard 1 UL/cUL File Number .

- IEC 61558 – Dry Power Transformers 1.3. Construction This dry type transformer is normally produced according to standards mentioned above. Upon request transformers can be manufactured according to other standards (e.g. standards on ship transformers, isolation transformers for medical use and protection transformers.

Ex. 8-2 Transformers in Parallel . 347 Connecting transformers in parallel to supply greater load power. Measuring the efficiency of parallel-connected transformers. Ex. 8-3 Distribution Transformers . 355 Introduction to basic characteristics of distribution transformers.