Attention And Transformers Lecture 11 - Stanford University

1y ago

12 Views

2 Downloads

1.27 MB

86 Pages

Last View : 25d ago

Last Download : 3m ago

Upload by : Macey Ridenour

Report this link

Download PDF

Transcription

Lecture 11:Attention and TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 1May 06, 2021

Administrative: Midterm- Midterm was this Tuesday- We will be grading this week and you should havegrades by next week.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 2May 06, 2021

Administrative: Assignment 3- A3 is due Friday May 25th, 11:59pm Lots of applications of ConvNets Also contains an extra credit notebook, which isworth an additional 5% of the A3 grade. Extra credit will not be used when curving the classgrades.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 3May 06, 2021

Last Time: Recurrent Neural NetworksFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 4May 06, 2021

Last Time: Variable length computationgraph with shared weightsh0WfWx1Ly1L1y2L2y3L3yTh1fWh2fWh3 hTx2Fei-Fei Li, Ranjay Krishna, Danfei XuLTx3Lecture 11 - 5May 06, 2021

Let's jump to lecture 10 - slide 43Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 6May 06, 2021

Today's Agenda:- Attention with RNNs- In Computer Vision- In NLP- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention- TransformersFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 7May 06, 2021

Image Captioning using spatial featuresInput: Image IOutput: Sequence y y1, y2,., yTz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 9May 06, 2021

Image Captioning using spatial featuresInput: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPh0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 10May 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPpersony1z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPh0h1cy0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]Lecture 11 - 11May 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPpersonwearingy1y2h0h1h2cy0y1z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 12May 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPpersonwearinghaty1y2y3h0h1h2h3cy0y1y2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 13wearingMay 06, 2021

Image Captioning using spatial featuresDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Image IOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNz1,0 z1,1 1y2y3wearinghatz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 14May 06, 2021

Image Captioning using spatial featuresProblem: Input is "bottlenecked" through c- Model needs to encode everything itwants to say within earinghatThis is a problem if we want to generatereally long descriptions? 100s of words longz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2MLPz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 15May 06, 2021

Image Captioning with RNNs & Attentiongif sourceAttention idea: New context vector at everytime step.Each context vector will attend to differentimage regionsz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2Attention Saccades in humansh0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 16May 06, 2021

Image Captioning with RNNs & AttentionAlignment scores:Compute alignmentsscores (scalars):HxWe1,0,0 e1,0,1 e1,0,2e1,1,0 e1,1,1 e1,1,2fatt(.) is an MLPe1,2,0 e1,2,1 e1,2,2z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 17May 06, 2021

Image Captioning with RNNs & AttentionCompute alignmentsscores (scalars):fatt(.) is an MLPAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2Normalize to getattention weights:0 at, i, j 1,attention values sumto 1h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 18May 06, 2021

Image Captioning with RNNs & AttentionCompute alignmentsscores (scalars):fatt(.) is an MLPAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2z0,0 z0,1 z0,2CNNNormalize to getattention weights:Compute context vector:0 at, i, j 1,attention values sum to 1h0z1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDc1XXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 19May 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagepersony1z0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2h1h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]Lecture 11 - 20May 06, 2021

Image Captioning with RNNs & AttentionAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2Decoder: yt gV(yt-1, ht-1, ct)New context vector at every time steppersony1z0,0 z0,1 z0,2CNNh1h0z1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDc1y0c2XXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]Lecture 11 - 21May 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagez0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2personwearingy1y2h1h2h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]c2y1personLecture 11 - 22May 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagez0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2personwearinghaty1y2y3h1h2h3h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]c2y1personLecture 11 - 23c3y2wearingMay 06, 2021

Image Captioning with RNNs & AttentionDecoder: yt gV(yt-1, ht-1, ct)New context vector at every time stepEach timestep of decoder uses adifferent context vector that looks atdifferent parts of the input imagez0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2personwearinghat[END]y1y2y3y4h1h2h3h4h0z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xuc1y0[START]c2y1personLecture 11 - 24c3y2wearingc4y3hatMay 06, 2021

Image Captioning with RNNs & AttentionAlignment scores:Attention:HxWHxWe1,0,0 e1,0,1 e1,0,2a1,0,0 a1,0,1 a1,0,2e1,1,0 e1,1,1 e1,1,2a1,1,0 a1,1,1 a1,1,2e1,2,0 e1,2,1 e1,2,2a1,2,0 a1,2,1 a1,2,2z0,0 z0,1 z0,2CNNThis entire process is differentiable.- model chooses its ownattention weights. No attentionsupervision is 0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDc1y0c2y1c3y2c4y3XXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 25wearinghatMay 06, 2021

Image Captioning with AttentionSoft attentionHard attention(requiresreinforcementlearning)Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 26May 06, 2021

Image Captioning with AttentionXu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 27May 06, 2021

Attention can detect Gender BiasAll images are CC0 Public domain:dog,Burns et al. “Women also Snowboard: Overcoming Bias in Captioning Models” ECCV 2018Figures from Burns et al, copyright 2018. Reproduced with permission.Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 28May 06, 2021

Similar tasks in NLP - Language translation exampleInput: Sequence x x1, x2,., xTOutput: Sequence y y1, y2,., yTx0x1personne portantx2x3unchapeauFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 29May 06, 2021

Similar tasks in NLP - Language translation exampleInput: Sequence x x1, x2,., xTOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where zt RNN(xt, ut-1)fW(.) is MLPu is the hidden RNN statez0z1z2z3x0x1x2x3unchapeaupersonne portantFei-Fei Li, Ranjay Krishna, Danfei Xuh0Lecture 11 - 30May 06, 2021

Similar tasks in NLP - Language translation exampleDecoder: yt gV(yt-1, ht-1, c)where context vector c is often c h0Input: Sequence x x1, x2,., xTOutput: Sequence y y1, y2,., yTEncoder: h0 fW(z)where zt RNN(xt, ut-1)fW(.) is MLPu is the hidden RNN h4x0x1x2x3cy0y1y2y3unchapeauwearinghatpersonne portant[START]Fei-Fei Li, Ranjay Krishna, Danfei XupersonLecture 11 - 31May 06, 2021

Attention in NLP - Language translation exampleCompute alignmentsscores (scalars):fatt(.) is an MLPe0e1e2e3z0z1z2z3x0x1x2x3unchapeaupersonne portanth0Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 32May 06, 2021

Attention in NLP - Language translation examplea0a1a2Compute alignmentsscores (scalars):a3softmaxfatt(.) is an MLPe0e1e2e3z0z1z2z3x0x1x2x3unchapeaupersonne portantNormalize to getattention weights:0 at, i, j 1,attention values sum to 1h0Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 33May 06, 2021

Attention in NLP - Language translation examplea0a1a2Compute alignmentsscores (scalars):a3softmaxfatt(.) is an MLPe0e1e2e3z0z1z2z3Normalize to getattention weights:0 at, i, j 1,attention values sum to 1h0Compute context vector:x0x1personne portantx2x3unchapeauc1XBahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 34May 06, 2021

Attention in NLP - Language translation examplea0a1a2Decoder: yt gV(yt-1, ht-1, c)where context vector c is often c ne c2y1c3y2c4y3XBahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei Xu[START]personLecture 11 - 35wearinghatMay 06, 2021

Similar visualization of attention weightsEnglish to French translationexample:Input: "The agreement on theEuropean Economic Areawas signed in August 1992."Output: "L’accord sur la zoneéconomique européenne aété signé en août 1992."Without any attentionsupervision, model learnsdifferent word orderings fordifferent languagesBahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 36May 06, 2021

FeaturesAttention we just saw in image captioningz0,0 z0,1 z0,2z1,0 z1,1 z1,2z2,0 z2,1 z2,2hInputs:Features: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 38May 06, 2021

Attention we just saw in image captioningz0,0 z0,1 z0,2e0,0e0,1e0,2z1,0 z1,1 z1,2e1,0e1,1e1,2z2,0 z2,1 ment: ei,j fatt(h, zi,j)Inputs:Features: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 39May 06, 2021

tion we just saw in image captioningOperations:Alignment: ei,j fatt(h, zi,j)Attention: a softmax(e)z0,0 z0,1 z0,2e0,0e0,1e0,2z1,0 z1,1 z1,2e1,0e1,1e1,2z2,0 z2,1 atures: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 40May 06, 2021

Attention we just saw in image captioningcOutputs:context vector: c (shape: D)a0,0a0,1a0,2a1,0a1,1a1,2a2,0a2,1a2,2Attentionmul addz0,0 z0,1 z0,2e0,0e0,1e0,2z1,0 z1,1 z1,2e1,0e1,1e1,2z2,0 z2,1 s:Alignment: ei,j fatt(h, zi,j)Attention: a softmax(e)Output: c i,j ai,jzi,jInputs:Features: z (shape: H x W x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 41May 06, 2021

General attention layercOutputs:context vector: c (shape: D)a0a1a2Attentionmul addx0e0x1e1x2e2hAlignmentInput vectorssoftmaxOperations:Alignment: ei fatt(h, xi)Attention: a softmax(e)Output: c i ai xiInputs:Input vectors: x (shape: N x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuAttention operation is permutation invariant.Doesn't care about ordering of thefeaturesStretch H x W N into N vectorsLecture 11 - 42May 06, 2021

General attention layercOutputs:context vector: c (shape: D)a0a1a2Attentionmul addx0e0x1e1x2e2hOperations:Alignment: ei h ᐧ xiAttention: a softmax(e)Output: c i ai xiAlignmentInput vectorssoftmaxChange fatt(.) to a simple dot productonly works well with key & valuetransformation trick (will mention in afew slides)Inputs:Input vectors: x (shape: N x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 43May 06, 2021

General attention layercOutputs:context vector: c (shape: D)a0a1a2Attentionmul addx0e0x1e1x2e2hOperations:Alignment: ei h ᐧ xi / DAttention: a softmax(e)Output: c i ai xiAlignmentInput vectorssoftmaxChange fatt(.) to a scaled simple dot productLarger dimensions means more terms inthe dot product sum.So, the variance of the logits is higher.Large magnitude vectors will producemuch higher logits.So, the post-softmax distribution haslower-entropy, assuming logits are IID.Ultimately, these large magnitudevectors will cause softmax to peak andassign very little weight to all othersDivide by D to reduce effect of largemagnitude vectorsInputs:Input vectors: x (shape: N x D)Query: h (shape: D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 44May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: ( ) add ( perations:Alignment: ei,j qj ᐧ xi / DAttention: a softmax(e)Output: yj i ai,j xiAlignmentInput vectorssoftmax ( )Multiple query vectorseach query creates a new outputcontext vectorInputs:Input vectors: x (shape: N x D)Queries: q (shape: M x D)Fei-Fei Li, Ranjay Krishna, Danfei XuMultiple query vectorsLecture 11 - 45May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: ( ) add ( perations:Alignment: ei,j qj ᐧ xi / DAttention: a softmax(e)Output: yj i ai,j xiAlignmentInput vectorssoftmax ( )Notice that the input vectors are usedfor both the alignment as well as theattention calculations.We can add more expressivity tothe layer by adding a differentFC layer before each of the twosteps.Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 46May 06, 2021

General attention layerv0v1Operations:Key vectors: k xWkValue vectors: v xWvInput vectorsv2x0k0x1k1x2k2Notice that the input vectors are usedfor both the alignment as well as theattention calculations.We can add more expressivity tothe layer by adding a differentFC layer before each of the twosteps.q0q1q2Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 47May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2Notice that the input vectors are usedfor both the alignment as well as theattention calculations.We can add more expressivity tothe layer by adding a differentFC layer before each of the twosteps.Operations:Key vectors: k xWkValue vectors: v xWvAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viAlignmentInput vectorssoftmax ( )The input and output dimensions cannow change depending on the key andvalue FC layersInputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 48May 06, 2021

General attention layery0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2Operations:Key vectors: k xWkValue vectors: v xWvAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viEncoder: h0 fW(z)where z is spatial CNN featuresfW(.) is an MLPz0,0 z0,1 z0,2CNNAlignmentInput vectorssoftmax ( )Recall that the query vector was afunction of the input vectorsz1,0 z1,1 z1,2MLPh0z2,0 z2,1 z2,2Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 49May 06, 2021

Input vectorsSelf attention layerWe can calculate the query vectorsfrom the input vectors, therefore,defining a "self-attention" layer.Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j vix0Instead, query vectors arecalculated using a FC layer.x1x2No input query vectors anymoreq0q1q2Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 50May 06, 2021

Self attention layery0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2AlignmentInput vectorssoftmax ( )Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viInputs:Input vectors: x (shape: N x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 51May 06, 2021

Self attention layer - attends over sets of inputsy0y1y2Outputs:context vectors: y (shape: tionmul( ) add ( 0q1q2AlignmentInput vectorssoftmax ( )y0Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j viy1y2self-attentionx0x1x2Inputs:Input vectors: x (shape: N x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 52May 06, 2021

Self attention layer - attends over sets of inputsPermutation invariantProblem: how can we encode ordered sequences likelanguage or spatially ordered image features?y1y0y2self-attentionx1x0y2y1y0y0Fei-Fei Li, Ranjay Krishna, Danfei e 11 - 53x1x2May 06, 2021

Positional encodingy0y1y2self-attentionx0x1x2p0p1p2position encodingx0x1x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorDesiderata of pos(.) :1.It should output a unique encoding for eachtime-step (word’s position in a sentence)2.Distance between any two time-steps should beconsistent across sentences with different lengths.3.Our model should generalize to longer sentenceswithout any efforts. Its values should be bounded.4.It must be deterministic.So, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 54May 06, 2021

Positional encodingOptions for pos(.)y0y1y2self-attentionx0x1x2p0p1p21.Learn a lookup table: Learn parameters to use for pos(t)for t ε [0, T) Lookup table contains T x dparameters.position encodingx0x1x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorSo, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei XuDesiderata of pos(.) :1.It should output a unique encoding for eachtime-step (word’s position in a sentence)2.Distance between any two time-steps should beconsistent across sentences with different lengths.3.Our model should generalize to longer sentenceswithout any efforts. Its values should be bounded.4.It must be deterministic.Vaswani et al, “Attention is all you need”, NeurIPS 2017Lecture 11 - 55May 06, 2021

Positional encodingOptions for pos(.)y0y1y21.Learn a lookup table: Learn parameters to use for pos(t)for t ε [0, T) Lookup table contains T x dparameters.2.Design a fixed function with thedesiderata self-attentionx0x1x2p0p1p2position encodingx1x0x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorSo, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei Xup(t) whereVaswani et al, “Attention is all you need”, NeurIPS 2017Lecture 11 - 56May 06, 2021

Positional encodingOptions for pos(.)y0y1y21.Learn a lookup table: Learn parameters to use for pos(t)for t ε [0, T) Lookup table contains T x dparameters.2.Design a fixed function with thedesiderata Intuition:self-attentionx0x1x2p0p1p2position encodingx0x1x2Concatenate special positionalencoding pj to each input vector xjWe use a function pos: N Rdto process the position j of thevector into a d-dimensional vectorSo, pj pos(j)Fei-Fei Li, Ranjay Krishna, Danfei Xup(t) whereimage sourceVaswani et al, “Attention is all you need”, NeurIPS 2017Lecture 11 - 57May 06, 2021

Masked self-attention layery0y1y2Outputs:context vectors: y (shape: Dv)v0a0,0a0,1a0,2v10a1,1a1,2v200a2,2Attentionmul( ) add ( )x0k0e0,0e0,1e0,2x1k1- e1,1e1,2x2k2- - e2,2q0q1q2AlignmentInput vectorssoftmax ( )-Operations:Key vectors: k xWkValue vectors: v xWvQuery vectors: q xWqAlignment: ei,j qj ᐧ ki / DAttention: a softmax(e)Output: yj i ai,j vi-Prevent vectors fromlooking at future vectors.Manually set alignmentscores to -infinityInputs:Input vectors: x (shape: N x D)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 58May 06, 2021

Multi-head self attention layer- Multiple self-attention heads in parallely0y1y2Add or x0x1x2Splitx0Fei-Fei Li, Ranjay Krishna, Danfei Xux1x2Lecture 11 - 59May 06, 2021

General attention versus f-attentionq0q1Fei-Fei Li, Ranjay Krishna, Danfei Xuq2x0x1Lecture 11 - 60x2May 06, 2021

Comparing RNNs to TransformersRNNs( ) LSTMs work reasonably well for long sequences.(-) Expects an ordered sequences of inputs(-) Sequential computation: subsequent hidden states can only be computed after theprevious ones are done.Transformers:( ) Good at long sequences. Each attention calculation looks at all inputs.( ) Can operate over unordered sets or ordered sequences with positional encodings.( ) Parallel computation: All alignment and attention scores for all inputs can be done inparallel.(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculatedand stored for a single self-attention head. (but GPUs are getting bigger and better)Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 61May 06, 2021

Image Captioning using transformersInput: Image IOutput: Sequence y y1, y2,., yTz0,0 z0,1 z0,2CNNz1,0 z1,1 z1,2z2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNFeatures:HxWxDFei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 63May 06, 2021

Image Captioning using transformersInput: Image IOutput: Sequence y y1, y2,., yTEncoder: c TW(z)where z is spatial CNN featuresTW(.) is the transformer encoderz0,0 z0,1 z0,2CNNc0,0c0,2. c2,2z1,0 z1,1 z1,2Transformer encoderz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNc0,1Features:HxWxDz0,0Fei-Fei Li, Ranjay Krishna, Danfei Xuz0,1z0,2. z2,2Lecture 11 - 64May 06, 2021

Image Captioning using transformersDecoder: yt TD(y0:t-1, c)where TD(.) is the transformer decoderInput: Image IOutput: Sequence y y1, y2,., yTperson wearingEncoder: c TW(z)where z is spatial CNN featuresTW(.) is the transformer encoderz0,0 z0,1 z0,2CNNy1c0,0c0,2. c2,2Features:HxWxD[END]y3y4Transformer decoderz1,0 z1,1 z1,2Transformer encoderz2,0 z2,1 z2,2Extract spatialfeatures from apretrained CNNc0,1y2hatz0,0Fei-Fei Li, Ranjay Krishna, Danfei Xuz0,1z0,2. z2,2y0y1[START] personLecture 11 - 65y2y3wearinghatMay 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNMade up of N encoder blocks.In vaswani et al. N 6, Dq 512z0,0z0,1z0,2. z2,2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 66May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNLet's dive into one encoder blockz0,0z0,1z0,2. z2,2x0x1x2x2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 67May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNPositional encodingz0,0z0,1z0,2. z2,2x0x1x2Add positional encodingx2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 68May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNMulti-head self-attentionPositional encodingz0,0z0,1z0,2. z2,2x0x1x2Attention attends over all the vectorsAdd positional encodingx2Vaswani et al, “Attention is all you need”, NeurIPS 2017Fei-Fei Li, Ranjay Krishna, Danfei XuLecture 11 - 69May 06, 2021

The Transformer encoder blockc0,1c0,2.Transformer encoderc0,0. c2,2xNResidual connection Multi-head self-attentionPositional encodingz0,0z0,1z0,2. z2,2x0x1x2Attention attends over all the vectorsAdd positi

- Transformers. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021 Extract spatial features from a pretrained CNN Image Captioning using spatial features 9 CNN Features: H x W x D Xu et al, "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention", ICML 2015 z 0,0 z 0,1 z 0,2 z 1,0 z 1,1 z 1,2 z 2,0 z 2,1 z 2,2

Related Documents:

Transformers - GE Grid Solutions

applications including generator step-up (GSU) transformers, substation step-down transformers, auto transformers, HVDC converter transformers, rectifier transformers, arc furnace transformers, railway traction transformers, shunt reactors, phase shifting transformers and r

91 Views

2y ago

CHEMICAL REACTION ENGINEERING

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

94 Views

2y ago

TheJ&P Transformer Book

7.8 Distribution transformers 707 7.9 Scott and Le Blanc connected transformers 729 7.10 Rectiﬁer transformers 736 7.11 AC arc furnace transformers 739 7.12 Traction transformers 745 7.13 Generator neutral earthing transformers 750 7.14 Transformers for electrostatic precipitators 756 7.15 Series reactors 758 8 Transformer enquiries and .

29 Views

2y ago

Transformers - tekhar.com

2.5 MVA and a voltage up to 36 kV are referred to as distribution transformers; all transformers of higher ratings are classified as power transformers. 0.05-2.5 2.5-3000 .10-20 36 36-1500 36 Rated power Max. operating voltage [MVA] [kV] Oil distribution transformers GEAFOL-cast-resin transformers Power transformers 5/13- 5 .

22 Views

1y ago

Transformers for variable speed drive applications ...

cation and for the testing of the transformers. – IEC 61378-1 (ed. 2.0): 2011, converter transformers, Part 1, Transformers for industrial applications – IEC 60076 series for power transformers and IEC 60076-11 for dry-type transformers – IEEE Std, C57.18.10-1998, IEEE Standard Practices and Requirements for Semiconductor Power Rectifier

84 Views

3y ago

Types MTE and MTK 7.1 Transformers Standards and ...

Transformers (Dry-Type). CSA C9-M1981: Dry-Type Transformers. CSA C22.2 No. 66: Specialty Transformers. CSA 802-94: Maximum Losses for Distribution, Power and Dry-Type Transformers. NEMA TP-2: Standard Test Method for Measuring the Energy Consumption of Distribution Transformers. NEMA TP-3 Catalogue Product Name UL Standard 1 UL/cUL File Number .

48 Views

3y ago

Dry type Transformers - Unitrafo

- IEC 61558 – Dry Power Transformers 1.3. Construction This dry type transformer is normally produced according to standards mentioned above. Upon request transformers can be manufactured according to other standards (e.g. standards on ship transformers, isolation transformers for medical use and protection transformers.

50 Views

3y ago

Power Circuits and Transformers - Festo

Ex. 8-2 Transformers in Parallel . 347 Connecting transformers in parallel to supply greater load power. Measuring the efficiency of parallel-connected transformers. Ex. 8-3 Distribution Transformers . 355 Introduction to basic characteristics of distribution transformers.

14 Views

1y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

8m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Money Online Price Comparison - WordPress

you to compare car insurance quotes. You'll notice at the top of the screen is a warning regarding telling the truth when completing any form of car insurance quote as something withheld, which later becomes known, can void an insurance claim. 7 The process of completing a car insurance price comparison is broken down into 4

1y ago

174 Views

Better car deals - Consumer Affairs Victoria

Insurance protects you against costs and liabilities if the car is stolen, vandalised or damaged in a crash. When budgeting, consider taking out at least third party car property insurance. It may be cheaper to arrange your own insurance than taking it out through the trader. Contact insurance companies to compare premiums and policy coverage.

1y ago

153 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

218 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

209 Views

Life Cycle Analysis: Uber vs. Car Ownership

(LCA) will be performed to compare ridesharing services versus car ownership. We will compare per mile average cost and CO 2 emissions . assumption of 15 years being a car's lifetime and calculated average costs for car maintenance, repairs, insurance, gas and registration. We used Economic Input Output Life Cycle Assessment .

1y ago

122 Views

Attention And Transformers Lecture 11 - Stanford University

It looks like you're using an ad-blocker