Spectracular Learning Algorithms For Natural Language Processing

1y ago
6 Views
2 Downloads
1.25 MB
114 Pages
Last View : 30d ago
Last Download : 3m ago
Upload by : Samir Mcswain
Transcription

Spectracular Learning Algorithms for NaturalLanguage ProcessingShay CohenUniversity of EdinburghJune 10, 2014Spectral Learning for NLP1

Latent-variable ModelsLatent-variable models are used in many areas of NLP, speech, etc.:IHidden Markov ModelsILatent-variable PCFGsINaive Bayes for clusteringILexical representations: Brown clustering, Saul and Pereira,etc.IAlignments in statistical machine translationITopic modelingIetc. etc.The Expectation-maximization (EM) algorithm is generally usedfor estimation in these models (Dempster et al., 1977)Other relevant algorithms: cotraining, clustering methodsSpectral Learning for NLP2

Example 1: Hidden Markov ModelsS1S2S3S4thedogsawhimParameterized by π(s), t(s s0 ) and o(w s)Spectral learning: Hsu et al. (2009)Dynamical systems: Siddiqi et al. (2009), Boots and Gordon(2011)Head-automaton grammars for dep. parsing: Luque et al. (2012)Spectral Learning for NLP3

Example 2: Latent-Variable PCFGs(Matsuzaki et al., 2005; Petrovet al., 2006)S1SNPVP NP3VP2DNVPD1N2V4P1thedogsawhimthedogsawhimSpectral Learning for NLP4

Example 3: Naı̈ve BayesHXYp(h, x, y) p(h) p(x h) p(y h)I(the, dog)(I, saw)(ran, to)(John, was).EM can be used to estimate parametersSpectral Learning for NLP5

Example 4: Language Modellinghp(w2 w1 ) Spectral Learning for NLPw1Pw2h p(h w1 ) p(w2 h)(Saul and Pereira, 1997)6

Example 5: HMMs for SpeechPhoneme boundaries are hidden variablesRefinement HMMs (Stratos et al., 2013)Spectral Learning for NLP7

Example 6: Topic ModelsLatent topics attached to a document or to each word in adocumentMethod of moments algorithms such as Arora et al. (2012; 2013)Spectral Learning for NLP8

Example 7: Unsupervised Parsing𝒙 (𝐷𝑇, 𝑁𝑁, 𝑉𝐵𝐷, 𝐷𝑇, 𝑁𝑁)𝑢(𝒙)z2z1z3z1z2((DT NN) (VBD (DT NN)))z3w1w2w3w4w5w1w2w3w4w5𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , 𝑤5 , 𝑧1 , 𝑧2 , 𝑧3The bear ate the fishLatent structure is a bracketing (Parikh et al., 2014)Similar in flavor to tree learning algorithms (e.g. Anandkumar,2011)Very different in flavor from estimation algorithmsSpectral Learning for NLP9

0.3Example 8: Word pushsleepdrinkcarryeatlisten-0.2-0.10.0PC 20.10.2disagreeagree-0.20.00.20.4PC 1Embed a vocabulary into d-dimensional spaceCan later be used for various NLP problems downstreamRelated to canonical correlation analysis (Dhillon et al., 2012)Spectral Learning for NLP10

Spectral MethodsBasic idea: replace EM with methods based on matrixdecompositions, in particular singular value decomposition (SVD)SVD: given matrix A with m rows, n columns, approximate asA U ΣV which meansAjk dXσh Ujh Vkhh 1where σh are “singular values”U and V are m d and n d matricesRemarkably, can find the optimal rank-d approximation efficientlySpectral Learning for NLP11

Similarity of SVD to Naı̈ve BayesHXP (X x, Y y) YdXp(h)p(x h)p(y h)h 1Ajk dXσh Ujh Vkhh 1SVD approximation minimizes squared loss, not log-lossσh not interpretable as probabilitiesI Ujh , Vjh may be positive or negative, not probabilitiesBUT we can still do a lot with SVD (and higher-order,tensor-based decompositions)IISpectral Learning for NLP12

Outline Singular value decomposition Canonical correlation analysis Spectral learning of hidden Markov models Algorithm for latent-variable PCFGsSpectral Learning for NLP13

Singular Value Decomposition (SVD)SVDA {z}m ndXi 1σ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m nId min(m, n)Spectral Learning for NLP14

Singular Value Decomposition (SVD)SVDA {z}m ndXi 1σ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m nId min(m, n)Iσ1 . . . σd 0Spectral Learning for NLP14

Singular Value Decomposition (SVD)SVDA {z}dXm ni 1σ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m nId min(m, n)Iσ1 . . . σd 0Iu1 . . . ud Rm are orthonormal:uiSpectral Learning for NLP2 1ui · uj 0 i 6 j14

Singular Value Decomposition (SVD)SVDA {z}dXm ni 1σ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m nId min(m, n)Iσ1 . . . σd 0Iu1 . . . ud Rm are orthonormal:uiI2 1ui · uj 0 i 6 jv 1 . . . v d Rn are orthonormal:viSpectral Learning for NLP2 1v i · v j 0 i 6 j14

SVD in Matrix FormA {z}U {z}Σ {z}V {z}SVDm n U u1 . . . ud Rm d V v 1 . . . v d Rn d Spectral Learning for NLPm d d d d n 1σ Σ 00.σd d d R15

Matrix RankA Rm nrank(A) min(m, n)Irank(A) : number of linearly independent columns in A 1 1 1 21 1rankSpectral Learning for NLP 22 22 1 1 2 1 2 2 1 1 3rank 3(full-rank)16

Matrix Rank: Alternative DefinitionIrank(A) : number of positive singular values of A 1 1 2 1 2 2 1 1 2 1 1 2 1 2 2 1 1 3 4.53 0 00.7 0 Σ 000 0 500Σ 0 0.98 0 000.2rank 2rank 3(full-rank)Spectral Learning for NLP17

SVD and Low-Rank Matrix ApproximationISuppose we want to find B such that B Iarg minB: rank(B) rX(Ajk Bjk )2jkSolution:B rXσ iui(v i) i 1Spectral Learning for NLP18

SVD in PracticeIIBlack box, e.g., in MatlabIInput: matrix A, output: scalars σ 1 . . . σ d , vectors u1 . . . udand v 1 . . . v dIEfficient implementationsIApproximate, randomized approaches also availableCan be used to solve a variety of optimization problemsIFor instance, Canonical Correlation Analysis (CCA)Spectral Learning for NLP19

SVD in Practice - Random ProjectionsFor large matrices (Halko et al., 2011)Spectral Learning for NLP20

Outline Singular value decomposition Canonical correlation analysis Spectral learning of hidden Markov models Algorithm for latent-variable PCFGsSpectral Learning for NLP21

Simplest Model in Complexity: Naive BayesHXYp(h, x, y) p(h) p(x h) p(y h)(the, dog)(I, saw)(ran, to)(John, was).CCA helps identify HSpectral Learning for NLP22

Canonical Correlation Analysis (CCA)IData consists of paired samples: (x(i) , y (i) ) for i 1 . . . nIAs in co-training, x(i) Rd and y (i) Rd are two “views” ofa sample point0View 1x(1) (1, 0, 0, 0)y (1) (1, 0, 0, 1, 0, 1, 0)x(2) (0, 0, 1, 0).y (2) (0, 1, 0, 0, 0, 0, 1).x(100000) (0, 1, 0, 0)Spectral Learning for NLPView 2y (100000) (0, 0, 1, 0, 1, 1, 1)23

Projection MatricesIProject samples to lower dimensional spacex Rd x0 RpIIf p is small, we can learn with far fewer samples!Spectral Learning for NLP24

Projection MatricesIProject samples to lower dimensional spacex Rd x0 RpIIf p is small, we can learn with far fewer samples!0ICCA finds projection matrices A Rd p , B Rd pIThe new data points are a(i) Rp , b(i) Rp wherea(i) {z}A {z}x(i) {z}p 1Spectral Learning for NLPp d d 1b(i) {z}B y (i) {z} {z}p 1p d0 d0 124

Mechanics of CCA: Step 1I000Compute ĈXY Rd d , ĈXX Rd d , and ĈY Y Rd dn[ĈXY ]jk1 X (i)(i)(xj x̄j )(yk ȳk ) ni 1n1 X (i)(i)[ĈXX ]jk (xj x̄j )(xk x̄k )ni 1n1 X (i)(i)(yj ȳj )(yk ȳk )[ĈY Y ]jk ni 1where x̄ Spectral Learning for NLPPix(i) /nand ȳ Piy(i) /n25

Mechanics of CCA: Step 1I000Compute ĈXY Rd d , ĈXX Rd d , and ĈY Y Rd dn[ĈXY ]jk1 X (i)(i)(xj x̄j )(yk ȳk ) n[ĈXX ]jkn1 X (i)(i) (xj x̄j )(xk x̄k )ni 1i 1n1 X (i)(i)(yj ȳj )(yk ȳk )[ĈY Y ]jk ni 1where x̄ Spectral Learning for NLPPix(i) /nand ȳ Piy(i) /n26

Mechanics of CCA: Step 1I000Compute ĈXY Rd d , ĈXX Rd d , and ĈY Y Rd dn[ĈXY ]jk1 X (i)(i)(xj x̄j )(yk ȳk ) n[ĈXX ]jkn1 X (i)(i) (xj x̄j )(xk x̄k )n[ĈY Y ]jkn1 X (i)(i)(yj ȳj )(yk ȳk ) ni 1i 1i 1where x̄ Spectral Learning for NLPPix(i) /nand ȳ Piy(i) /n27

Mechanics of CCA: Step 2I 1/2 1/2Do SVD on ĈXX ĈXY ĈY Y 1/20 Rd d 1/2ĈXX ĈXY ĈY Y U ΣV SVDLet Up Rd p be the top p left singular vectors. Let0Vp Rd p be the top p right singular vectors.Spectral Learning for NLP28

Mechanics of CCA: Step 3I0Define projection matrices A Rd p and B Rd p 1/2A ĈXX UpI 1/2B ĈY Y VpUse A and B to project each (x(i) , y (i) ) for i 1 . . . n:x(i) Rd A x(i) Rp0y (i) Rd B y (i) RpSpectral Learning for NLP29

Justification of CCA: Correlation CoefficientsISample correlation coefficient for a1 . . . an R andb1 . . . bn R isPnā)(bi b̄)nni 1 (ai qCorr({ai }i 1 , {bi }i 1 ) pPPnn22i 1 (ai ā)i 1 (bi b̄)Pi ai /n,b̄ Pi bi /nbwhere ā Correlation 1aSpectral Learning for NLP30

Simple Case: p 10ICCA projection matrices are vectors u1 Rd , v1 RdIProject x(i) and y (i) to scalars u1 · x(i) and v1 · y (i)Spectral Learning for NLP31

Simple Case: p 10ICCA projection matrices are vectors u1 Rd , v1 RdIProject x(i) and y (i) to scalars u1 · x(i) and v1 · y (i)IWhat vectors does CCA find? Answer: u1 , v1 arg max Corr {u ·u,vSpectral Learning for NLPx(i) }ni 1 , {v·y (i) }ni 1 31

Finding the Next ProjectionsIAfter finding u1 and v1 , what vectors u2 and v2 does CCAfind? Answer: u2 , v2 arg max Corr {u ·u,vx(i) }ni 1 , {v·y (i) }ni 1 subject to the constraints x(i) }ni 1 , {u1x(i) }ni 1 Corr {u2 ·· 0 (i) n(i) nCorr {v2 · y }i 1 , {v1 · y }i 1 0Spectral Learning for NLP32

CCA as an Optimization ProblemICCA finds for j 1 . . . p (each column of A and B) uj , vj arg max Corr {u ·u,vx(i) }ni 1 , {v·y (i) }ni 1 subject to the constraints x(i) }ni 1 , {ukx(i) }ni 1 Corr {uj ·· 0 (i) n(i) nCorr {vj · y }i 1 , {vk · y }i 1 0for k jSpectral Learning for NLP33

Guarantees for CCAHXYIAssume data is generated from a Naive Bayes modelILatent-variable H is of dimension k, variables X and Y are ofdimension d and d0 (typically k d and k d0 )IUse CCA to project X and Y down to k dimensions (needs(x, y) pairs only!)ITheorem: the projected samples are as good as the originalsamples for prediction of H(Foster, Johnson, Kakade, Zhang, 2009)IBecause k d and k d0 we can learn to predict H with farfewer labeled examplesSpectral Learning for NLP34

Guarantees for CCA (continued)Kakade and Foster, 2007 - cotraining-style setting:IAssume that we have a regression problem: predict somevalue z given two “views” x and yIAssumption: either view x or y is sufficient for predictionIUse CCA to project x and y down to a low-dimensional spaceITheorem: if correlation coefficients drop off to zero quickly,we will need far fewer samples to learn when using theprojected representationIVery similar setting to cotraining, but no assumption ofindependence between the two viewsSpectral Learning for NLP35

“Variants” of CCA 1/2 1/2ĈXX ĈXY ĈY Y0 Rd dCentering leads to non-sparse CXY . 1/2 1/2Computing CXX and CY YSpectral Learning for NLPleads to large non-sparse matrices36

Outline Singular value decomposition Canonical correlation analysis Spectral learning of hidden Markov models Algorithm for latent-variable PCFGsSpectral Learning for NLP37

A Spectral Learning Algorithm for HMMsIAlgorithm due to Hsu, Kakade and Zhang (COLT 2009; JCSS2012)IAlgorithm relies on singular value decomposition followed byvery simple matrix operationsIClose connections to CCAIUnder assumptions on singular values arising from the model,has PAC-learning style guarantees (contrast with EM, whichhas problems with local optima)IIt is a very different algorithm from EMSpectral Learning for NLP38

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimp(the dog saw him, 12 1 3) {z} {z }x1 .x4h1 .h4 π(1) t(2 1) t(1 2) t(3 1)Spectral Learning for NLP39

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimp(the dog saw him, 12 1 3) {z} {z }x1 .x4h1 .h4 π(1) t(2 1) t(1 2) t(3 1) o(the 1) o(dog 2) o(saw 1) o(him 3)Spectral Learning for NLP39

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimp(the dog saw him, 12 1 3) {z} {z }x1 .x4h1 .h4 π(1) t(2 1) t(1 2) t(3 1) o(the 1) o(dog 2) o(saw 1) o(him 3)IIIInitial parameters: π(h) for each latent state hTransition parameters: t(h0 h) for each pair of states h0 , hObservation parameters: o(x h) for each state h, obs. xSpectral Learning for NLP39

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimThroughout this section:IWe use m to refer to the number of hidden statesIWe use n to refer to the number of possible words(observations)ITypically, m n (e.g., m 20, n 50, 000)Spectral Learning for NLP40

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4Spectral Learning for NLP41

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:Spectral Learning for NLP41

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)Spectral Learning for NLP41

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0Spectral Learning for NLP41

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xt(h h0 )o(dog h0 )fh10h0Spectral Learning for NLP41

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xh0Spectral Learning for NLPt(h h0 )o(dog h0 )fh10fh3 Xt(h h0 )o(saw h0 )fh20h041

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xt(h h0 )o(dog h0 )fh10fh3 t(h h0 )o(saw h0 )fh20h0h0fh4 XXt(h h0 )o(him h0 )fh30h0Spectral Learning for NLP41

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xt(h h0 )o(dog h0 )fh10fh3 Xh0Spectral Learning for NLPt(h h0 )o(saw h0 )fh20h0h0fh4 Xt(h h0 )o(him h0 )fh30p(. . .) Xfh4h41

HMMs: the forward algorithm in matrix formSpectral Learning for NLPH1H2H3H4thedogsawhim42

HMMs: the forward algorithm in matrix formSpectral Learning for NLPH1H2H3H4thedogsawhim42

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h)Spectral Learning for NLP42

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h) e.g., [Athe ]h0 ,h t(h0 h)o(the h)Spectral Learning for NLP42

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h) e.g., [Athe ]h0 ,h t(h0 h)o(the h)IDefine π as vector with elements πh , 1 as vector of all onesSpectral Learning for NLP42

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h) e.g., [Athe ]h0 ,h t(h0 h)o(the h)IIDefine π as vector with elements πh , 1 as vector of all onesThenp(the dog saw him) 1 Ahim Asaw Adog Athe πForward algorithm through matrix multiplication!Spectral Learning for NLP42

The Spectral Algorithm: definitionsH1H2H3H4thedogsawhimDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)Easy to derive an estimate:[P̂2,1 ]i,j Spectral Learning for NLPCount(X2 i, X1 j)N43

The Spectral Algorithm: definitionsH1H2H3H4thedogsawhimFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)Easy to derive an estimate, e.g.,:[P̂3,dog,1 ]i,j Spectral Learning for NLPCount(X3 i, X2 dog, X1 j)N44

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)Spectral Learning for NLP45

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)ISVD(P2,1 ) U Rn m , Σ Rm m , V Rn mSpectral Learning for NLP45

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)ISVD(P2,1 ) U Rn m , Σ Rm m , V Rn mIDefinition:Bx U P3,x,1 V {z}Σ 1 {z}m mSpectral Learning for NLPm m45

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)ISVD(P2,1 ) U Rn m , Σ Rm m , V Rn mIDefinition:Bx U P3,x,1 V {z}Σ 1 {z}m mIm mTheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleSpectral Learning for NLP45

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!Spectral Learning for NLP46

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note thatBhim Bsaw Bdog BtheSpectral Learning for NLP46

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Spectral Learning for NLPBhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 146

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Bhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 1 GAhim Asaw Adog Athe G 1The G’s cancel!Spectral Learning for NLP46

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Bhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 1 GAhim Asaw Adog Athe G 1The G’s cancel!IFollows that if we have b 1 G 1 and b0 Gπ thenSpectral Learning for NLP46

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Bhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 1 GAhim Asaw Adog Athe G 1The G’s cancel!IFollows that if we have b 1 G 1 and b0 Gπ thenb Bhim Bsaw Bdog Bthe b0Spectral Learning for NLP46

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Bhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 1 GAhim Asaw Adog Athe G 1The G’s cancel!IFollows that if we have b 1 G 1 and b0 Gπ thenb Bhim Bsaw Bdog Bthe b0Spectral Learning for NLP 1 Ahim Asaw Adog Athe π46

The Spectral Learning Algorithm1. Derive estimates[P̂2,1 ]i,j Count(X2 i, X1 j)NFor all words x,[P̂3,x,1 ]i,j Spectral Learning for NLPCount(X3 i, X2 x, X1 j)N47

The Spectral Learning Algorithm1. Derive estimates[P̂2,1 ]i,j Count(X2 i, X1 j)NFor all words x,[P̂3,x,1 ]i,j Count(X3 i, X2 x, X1 j)N2. SVD(P̂2,1 ) U Rn m , Σ Rm m , V Rn mSpectral Learning for NLP47

The Spectral Learning Algorithm1. Derive estimates[P̂2,1 ]i,j Count(X2 i, X1 j)NFor all words x,[P̂3,x,1 ]i,j Count(X3 i, X2 x, X1 j)N2. SVD(P̂2,1 ) U Rn m , Σ Rm m , V Rn m3. For all words x, define Bx U P̂3,x,1 V {z}Σ 1 . {z}m mm m(similar definitions for b0 , b , details omitted)Spectral Learning for NLP47

The Spectral Learning Algorithm1. Derive estimates[P̂2,1 ]i,j Count(X2 i, X1 j)NFor all words x,[P̂3,x,1 ]i,j Count(X3 i, X2 x, X1 j)N2. SVD(P̂2,1 ) U Rn m , Σ Rm m , V Rn m3. For all words x, define Bx U P̂3,x,1 V {z}Σ 1 . {z}m mm m(similar definitions for b0 , b , details omitted)4. For a new sentence x1 . . . xn , can calculate its probability, e.g.,p̂(the dog saw him) b Bhim Bsaw Bdog Bthe b0Spectral Learning for NLP47

GuaranteesIThroughout the algorithm we’ve used estimates P̂2,1 andP̂3,x,1 in place of P2,1 and P3,x,1IIf P̂2,1 P2,1 and P̂3,x,1 P3,x,1 then the method is exact.But we will always have estimation errorsIA PAC-Style Theorem: Fix some length T . To haveX p(x1 . . . xT ) p̂(x1 . . . xT ) x1 .xT {z}L1 distance between p and p̂with probability at least 1 δ, then number of samplesrequired is polynomial inn, m, 1/ , 1/δ, 1/σ, Twhere σ is m’th largest singular value of P2,1Spectral Learning for NLP48

Intuition behind the TheoremIDefine  A 2 sX(Âj,k Aj,k )2j,kIWith N samples, with probability at least 1 δ P̂2,1 P2,1 2 P̂3,x,1 P3,x,1 2 wherer I11log Nδr1NThen need to carefully bound how the error propagatesthrough the SVD step, the various matrix multiplications, etcetc. The “rate” at which propagates depends on T , m, n,1/σSpectral Learning for NLP49

SummaryIThe problem solved by EM: estimate HMM parameters π(h),t(h0 h), o(x h) from observation sequences x1 . . . xnIThe spectral algorithm:IIIICalculate estimates P̂2,1 (bigram counts) and P̂3,x,1 (trigramcounts)Run an SVD on P̂2,1Calculate parameter estimates using simple matrix operationsGuarantee: we recover the parameters up to linear transformsthat cancelSpectral Learning for NLP50

Outline Singular value decomposition Canonical correlation analysis Spectral learning of hidden Markov models Algorithm for latent-variable PCFGsSpectral Learning for NLP51

Problems with spectral HMM learning algorithmParameters are masked by an unknown linear transformationINegative marginals (due to sampling error)IParameters cannot be easily interpretedICannot improve parameters using, for example, EMHsu et al. suggest a way to extract probabilities, but the method isunstableSpectral Learning for NLP52

This part of the talkLike the spectral algorithm, has theoretical guaranteesEstimates are actual probabilitiesMore efficient than EMCan be used to initialize EM, which converges in an iteration or twoSpectral Learning for NLP53

L-PCFGs(Matsuzaki et al., 2005; Petrov et al., 2006)S1 ctral Learning for NLP54

The probability of a treep(tree, 1 3 1 2 2 4 1)S1 π(S1 ) t(S1 NP3 VP2 S1 ) VP2NP3D1N2V4P1t(NP3 D1 N2 NP3 ) t(VP2 V4 P1 VP2 ) q(D1 the D1 ) q(N2 dog N2 ) thedogsawhimq(V4 saw V4 ) q(P1 him P1 )p(tree) Xp(tree, h1 h2 h3 h4 h5 h6 h7 )h1 .h7Spectral Learning for NLP55

Inside and Outside TreesAt node VP:Outside tree o SSNPVPNPDtheNdogVsawPVPDNthedogInside tree t himVPVPsawhimConditionally independent given the label and the hidden statep(o, t VP, h) p(o VP, h) p(t VP, h)Spectral Learning for NLP56

Designing Feature FunctionsDesign functions ψ and φ:φ maps any inside tree to a binary vector of length dψ maps any outside tree to a binary vector of length d0SVPVPNPDNthedogOutside tree o ψ(o) [0, 1, 0, 0, . . . , 0, 0] VPsawhimInside tree t 0Rdφ(t) [1, 0, 0, 0, . . . , 0, 0] Rdψ and φ as multinomials p(f ) for f [d] and p(g) for g [d0 ].Spectral Learning for NLP57

Latent State DistributionsThink of f and g as representing a whole inside/outside treeSay we had a way of getting:Ip(f h, VP) for each h and f inside featureIp(g h, VP) for each h and g outside featureThen we could run EM on a convex problem to find parameters.How?Spectral Learning for NLP58

Binary rule estimationTake M samples of nodes with rule VP V NP.At sample iIg (i) outside feature at VPIf2 inside feature at VIf3 inside feature at NP(i)(i)t̂(h1 , h2 , h3 VP V NP) maxt̂MXi 1logXt̂(h1 , h2 , h3 VP V NP) h1 ,h2 ,h3 (i)(i)p(g h1 , VP)p(f2 h2 , V)p(f3 h3 , NP)(i)Spectral Learning for NLP59

Binary Rule Estimation Use Bayes rule to convertt̂(h1 , h2 , h3 VP V NP)tot̂(VP V NP, h2 , h3 VP, h1 ). The log-likelihood function is convex, and therefore EMconverges to global maximum Estimation of π and q is similar in flavorMain question: how do we get the latent state distributionsp(h f, VP) and p(h g, VP)?Spectral Learning for NLP60

Vector Representation of Inside and Outside TreesDesign functions Z and Y :Y maps any inside feature value f [d0 ] to a vector of length m.Z maps any outside feature value g [d] to a vector of length m.Convention: m is the number of hidden states under the L-PCFG.SNPVPVPDNthedogOutside tree o Z(g) [1, 0.4, 5.3, . . . , 72] VPsawhimInside tree t RmY (f ) [ 3, 17, 2, . . . , 3.5] RmZ and Y reduce the dimensionality of φ and ψ using CCASpectral Learning for NLP61

Identifying Latent State Distributions For each f [d], define:P0v(f ) dg 1 p(g f, VP)Z(g) E[Z(g) f, VP] v(f ) Rm is “the expected value of an outside tree(representation) given an inside tree (feature)”Spectral Learning for NLP62

Identifying Latent State Distributions For each f [d], define:P0v(f ) dg 1 p(g f, VP)Z(g) E[Z(g) f, VP] v(f ) Rm is “the expected value of an outside tree(representation) given an inside tree (feature)” By conditional independence:v(f ) mXp(h f, VP)w(h)h 1Rmwhere w(h) andPd0w(h) g 1 p(g h, VP)Z(g) E[Z(g) h, VP]. w(h) is “the expected value of an outside tree(representation) given a latent state”Spectral Learning for NLP62

Pivot Assumption PReminder: v(f ) mh 1 p(h f, VP)w(h) If we know w(h), we can find latent state distributions:I Given an inside tree (feature f ) and a node such as VP,compute v(f )I SolvemXarg min v(f ) p(h f, VP)w(h) 2p(h f,VP)h 1Assumption: For each latent state there is f [d] a “pivotfeature value” s.t.p(h f, VP) 1.Result of this: v(f ) w(h) for any pivot feature fSpectral Learning for NLP63

Identifying Latent State Distributions m pivot features {f1 , . . . , fm } such that v(fh ) w(h)Then, for all f [d]v(f ) mXp(h f, VP)v(fh )h 1 Therefore, we can identify p(h f, VP) for all f by solving:argSpectral Learning for NLPminp(h f,VP) v(f ) mXp(h f, VP)v(fh ) 2h 164

Identifying Pivot Features v(f ) are observable quantities, can be calculated from data Arora et al. (2012) showed how to find the pivot features Basic idea: find the corners of the convex hull spanned by the dfeaturesSpectral Learning for NLP65

Identifying Latent State DistributionsAlgorithm: Identify m pivot features f1 , . . . , fm by findingvertices of ConvexHull(v1 , . . . , vd ) (Arora et al., 2012) Solve for each f [d]:argmXmin v(f ) p(h f, VP)v(fh ) 2p(h f,VP)h 1Output:ILatent state distributions p(h f, VP) for any f [d]Can analogously get:ILatent state distributions p(h g, VP) for any g [d0 ] We managed to extract latent state probabilities from observeddata only!Spectral Learning for NLP66

Experiments - Language Modeling Saul and Pereira (1997):p(w2 w1 ) Xp(w2 h)p(h w1 ).hhw1w2This model is a specific case of L-PCFG Experimented with bi-gram modeling for two corpora: Browncorpus and Gigaword corpusSpectral Learning for NLP67

Results: perplexitymbigram Kneser-Neytrigram Kneser-NeyEMiterationspivotSpectral Learning for NLPBrown128 256408386388 36598426 597test415394364560NYT128 256271150284 2653532782 886test27915826771568

Results: perplexitymbigram Kneser-Neytrigram Kneser-NeyEMiterationspivotpivot EMiterationsBrown128 256408386388 36598426 597310 32711test415394364560357NYT128 256271150284 2653532782 886279 2921912test279158267715281 Initialize EM with pivot algorithm output EM converges in much fewer iterations Still consistent - called “two-step estimation” (Lehmann andCasella, 1998)Spectral Learning for NLP69

Results with EM (section 22 of Penn treebank)Performance with expectation-maximization (m 32): 88.56%Vanilla binarized PCFG maximum likelihood estimationperformance: 68.62%Performance with spectral algorithm (Cohen et al., 2013): 88.82%Spectral Learning for NLP70

Inside features usedConsider the VP node in the following tree:SNPVPDNVthecatsawNPDNthedogThe inside features consist of:IThe pairs (VP, V) and (VP, NP)IThe rule VP V NPIThe tree fragment (VP (V saw) NP)IThe tree fragment (VP V (NP D N))IThe pair of head part-of-speech tag with VP: (VP, V)Spectral Learning for NLP71

Outside features usedConsider the D node inthe following tree:SNPVPDNVthecatsawNPDNthedogThe outside features consist of:IThe fragmentsNPD ,NandVPVNPD SNPNVPVNPD IThe pair (D, NP) and triplet (D, NP, VP)IThe pair of head part-of-speech tag with D: (D, N)Spectral Learning for NLPN72

ResultsmEMiterationsSpectral(Cohen et al., 2013)PivotPivot EMiterations886.6940sec. 22162488.32 88.3530303288.5620sec. 6.8788.64286.4088.55285.8387.7688.03Again, EM converges in very few iterationsSpectral Learning for NLP73

ConclusionFormal guarantees:IStatistical consistencyINo problem of local maximaAdvantages over traditional spectral methods:INo negative probabilitiesIMore intuitive to understandSpectral Learning for NLP74

Things we did not talk aboutTheses algorithms can be kernelized (e.g. Song et al., 2010)Many other algorithms similar in flavor (see reading list)IRely on some decomposition of observable quantities to get ahandle

Latent-variable Models Latent-variable models are used in many areas of NLP, speech, etc.: I Hidden Markov Models I Latent-variable PCFGs I Naive Bayes for clustering I Lexical representations: Brown clustering, Saul and Pereira, etc. I Alignments in statistical machine translation I Topic modeling I etc. etc. The Expectation-maximization (EM) algorithm is generally used

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att

Den kanadensiska språkvetaren Jim Cummins har visat i sin forskning från år 1979 att det kan ta 1 till 3 år för att lära sig ett vardagsspråk och mellan 5 till 7 år för att behärska ett akademiskt språk.4 Han införde två begrepp för att beskriva elevernas språkliga kompetens: BI

Araling Panlipunan Ikalawang Markahan - Modyul 5: Interaksiyon ng Demand at Supply Z est for P rogress Z eal of P artnership 9 Name of Learner: _ Grade & Section: _ Name of School: _ Alamin Ang pinakatiyak na layunin ng modyul na ito ay matutuhan mo bilang mag-aaral ang mahahalagang ideya o konsepto tungkol sa interaksiyon ng demand at supply. Mula sa mga inihandang gawain at .