Spectral Learning Algorithms For Natural Language Processing

1y ago
9 Views
2 Downloads
2.06 MB
174 Pages
Last View : 13d ago
Last Download : 3m ago
Upload by : Isobel Thacker
Transcription

Spectral Learning Algorithms for NaturalLanguage ProcessingShay Cohen1 , Michael Collins1 , Dean Foster2 , Karl Stratos1and Lyle Ungar21 Columbia2 UniversityUniversityof PennsylvaniaJune 9, 2013Spectral Learning for NLP1

Latent-variable ModelsLatent-variable models are used in many areas of NLP, speech, etc.:ILatent-variable PCFGs (Matsuzaki et al; Petrov et al)IHidden Markov ModelsINaive Bayes for clusteringILexical representations: Brown clustering, Saul and Pereira,etc.IAlignments in statistical machine translationITopic modelingIetc. etc.The Expectation-maximization (EM) algorithm is generally usedfor estimation in these models (Depmster et al., 1977)Other relevant algorithms: cotraining, clustering methodsSpectral Learning for NLP2

Example 1: Latent-Variable PCFGs(Matsuzaki, 2005; Petrov et al.,2006)S1SNPVP NP3VP2DNVPD1N2V4P1thedogsawhimthedogsawhimSpectral Learning for NLP3

Example 2: Hidden Markov ModelsS1S2S3S4thedogsawhimParameterized by π(s), t(s s0 ) and o(w s)EM is used for learning the parametersSpectral Learning for NLP4

Example 3: Naı̈ve BayesHXYp(h, x, y) p(h) p(x h) p(y h)I(the, dog)(I, saw)(ran, to)(John, was).EM can be used to estimate parametersSpectral Learning for NLP5

Example 4: Brown Clustering and Related ModelsC(w1 )C(w2 )w1w2p(w2 w1 ) p(C(w2 ) C(w1 )) p(w2 C(w2 ))(Brown et al., 1992)hp(w2 w1 ) Spectral Learning for NLPw1Pw2h p(h w1 ) p(w2 h)(Saul and Pereira, 1997)6

Example 5: IBM Translation ModelsnullPor favorPlease,, desearia reservar una habitacion .Iwould liketobookaroom.Hidden variables are alignmentsEM used to estimate parametersSpectral Learning for NLP7

Example 6: HMMs for SpeechPhoneme boundaries are hidden variablesSpectral Learning for NLP8

Co-training (Blum and Mitchell, 1998)Examples come in pairsEach view is assumed to be sufficient for classificationE.g. Collins and Singer (1999):. . . , says Mr. Cooper, a vice president of . . .IView 1. Spelling features: “Mr.”, “Cooper”IView 2. Contextual features: appositive presidentSpectral Learning for NLP9

Spectral MethodsBasic idea: replace EM (or co-training) with methods based onmatrix decompositions, in particular singular value decomposition(SVD)SVD: given matrix A with m rows, n columns, approximate asAjk dXσh Ujh Vjhh 1where σh are “singular values”U and V are m d and n d matricesRemarkably, can find the optimal rank-d approximation efficientlySpectral Learning for NLP10

Similarity of SVD to Naı̈ve BayesHXP (X x, Y y) YdXp(h)p(x h)p(y h)h 1Ajk dXσh Ujh Vjhh 1SVD approximation minimizes squared loss, not log-lossσh not interpretable as probabilitiesI Ujh , Vjh may be positive or negative, not probabilitiesBUT we can still do a lot with SVD (and higher-order,tensor-based decompositions)IISpectral Learning for NLP11

CCA vs. Co-trainingICo-training assumption: 2 views, each sufficient forclassificationISeveral heuristic algorithms developed for this settingICanonical correlation analysis:IIIIIITake paired examples x(i),1 , x(i),2Transform to z (i),1 , z (i),2z’s are linear projections of the x’sProjections are chosen to maximize correlation between z 1 andz2Solvable using SVD!Strong guarantees in several settingsSpectral Learning for NLP12

One Example of CCA: Lexical RepresentationsIx Rd is a worddog (0, 0, . . . , 0, 1, 0, . . . , 0, 0) R200,000I0y Rd is its context informationdog-context (11, 0, . . . 0, 917, 3, 0, . . . 0) R400,000IUse CCA on x and y to derive x Rkdog (0.03, 1.2, . . . 1.5) R100Spectral Learning for NLP13

Spectral Learning of HMMs and L-PCFGsSimple algorithms: require SVD, then method of moments inlow-dimensional spaceClose connection to CCAGuaranteed to learn (unlike EM) under assumptions on singularvalues in the SVDSpectral Learning for NLP14

Spectral Methods in NLPIBalle, Quattoni, Carreras, ECML 2011 (learning of finite-statetransducers)ILuque, Quattoni, Balle, Carreras, EACL 2012 (dependencyparsing)IDhillon et al, 2012 (dependency parsing)ICohen et al 2012, 2013 (latent-variable PCFGs)Spectral Learning for NLP15

OverviewBasic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: JustificationLexical representationsHidden Markov modelsLatent-variable PCFGsConclusionSpectral Learning for NLP16

MatricesA Rm n m n “matrix of dimensions m by n”Spectral Learning for NLP 3 1 4A 0 2 5 A R2 317

Vectorsu Rn n “vector of dimension n”Spectral Learning for NLP 0u 2 1u R318

Matrix TransposeIA Rn m is the transpose of A Rm nA Spectral Learning for NLP 3 1 40 2 5 3 0A 1 2 4 519

Matrix MultiplicationMatrices B Rm d and C Rd nA {z}B {z}C {z}m nSpectral Learning for NLPm d d n20

OverviewBasic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: JustificationLexical representationsHidden Markov modelsLatent-variable PCFGsConclusionSpectral Learning for NLP21

Singular Value Decomposition (SVD)SVDA {z}m nId min(m, n)Spectral Learning for NLPdXi 1σ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m n22

Singular Value Decomposition (SVD)SVDA {z}m ndXi 1Id min(m, n)Iσ1 . . . σd 0Spectral Learning for NLPσ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m n22

Singular Value Decomposition (SVD)SVDA {z}m ndXi 1σ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m nId min(m, n)Iσ1 . . . σd 0Iu1 . . . ud Rm are orthonormal:uiSpectral Learning for NLP2 1ui · uj 0 i 6 j22

Singular Value Decomposition (SVD)SVDA {z}m ndXi 1σ i {z}ui (v i) {z} {z }scalar m 1 1 n {z }m nId min(m, n)Iσ1 . . . σd 0Iu1 . . . ud Rm are orthonormal:uiI2 1ui · uj 0 i 6 jv 1 . . . v d Rn are orthonormal:viSpectral Learning for NLP2 1v i · v j 0 i 6 j22

SVD in Matrix FormA {z}U {z}Σ {z}V {z}SVDm n U u1 . . . ud Rm d V v 1 . . . v d Rn d Spectral Learning for NLPm d d d d n 1σ Σ 00.σd d d R23

Matrix RankA Rm nrank(A) min(m, n)Irank(A) : number of linearly independent columns in A 1 1 1 21 1rankSpectral Learning for NLP 22 22 1 1 2 1 2 2 1 1 3rank 3(full-rank)24

Matrix Rank: Alternative DefinitionIrank(A) : number of positive singular values of A 1 1 2 1 2 2 1 1 2 1 1 2 1 2 2 1 1 3 4.53 0 00.7 0 Σ 000 0 500Σ 0 0.98 0 000.2rank 2rank 3(full-rank)Spectral Learning for NLP25

SVD and Low-Rank Matrix ApproximationISuppose we want to find B such that B Iarg minB: rank(B) rX(Ajk Bjk )2jkSolution:B rXσ iui(v i) i 1Spectral Learning for NLP26

SVD in PracticeIIBlack box, e.g., in MatlabIInput: matrix A, output: scalars σ 1 . . . σ d , vectors u1 . . . udand v 1 . . . v dIEfficient implementationsIApproximate, randomized approaches also availableCan be used to solve a variety of optimization problemsIFor instance, Canonical Correlation Analysis (CCA)Spectral Learning for NLP27

OverviewBasic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: JustificationLexical representationsHidden Markov modelsLatent-variable PCFGsConclusionSpectral Learning for NLP28

Canonical Correlation Analysis (CCA)IData consists of paired samples: (x(i) , y (i) ) for i 1 . . . nIAs in co-training, x(i) Rd and y (i) Rd are two “views” ofa sample point0View 1x(1) (1, 0, 0, 0)y (1) (1, 0, 0, 1, 0, 1, 0)x(2) (0, 0, 1, 0).y (2) (0, 1, 0, 0, 0, 0, 1).x(100000) (0, 1, 0, 0)Spectral Learning for NLPView 2y (100000) (0, 0, 1, 0, 1, 1, 1)29

Example of Paired Data: Webpage Classification (Blumand Mitchell, 98)IDetermine if a webpage is an course home pagecourse 1 instructor’shomepage course home page· · · Announcements· · ·Lectures· · · TAs· · · Information· · · TA’shomepage course 2IIIView 1. Words on the page: “Announcements”, “Lectures”View 2. Identities of pages pointing to the page: instructror’shome page, related course home pagesEach view is sufficient for the classification!Spectral Learning for NLP30

Example of Paired Data: Named Entity Recognition(Collins and Singer, 99)IIdentify an entity’s type as either Organization, Person, orLocation. . . , says Mr. Cooper, a vice president of . . .IView 1. Spelling features: “Mr.”, “Cooper”IView 2. Contextual features: appositive presidentIEach view is sufficient to determine the entity’s type!Spectral Learning for NLP31

Example of Paired Data: Bigram ModelHXYp(h, x, y) p(h) p(x h) p(y h)(the, dog)(I, saw)(ran, to)(John, was).IEM can be used to estimate the parameters of the modelIAlternatively, CCA can be used to derive vectors which can beused in a predictor 0.3 1.5 the . dog . 1.1 0.4Spectral Learning for NLP32

Projection MatricesIProject samples to lower dimensional spacex Rd x0 RpIIf p is small, we can learn with far fewer samples!Spectral Learning for NLP33

Projection MatricesIProject samples to lower dimensional spacex Rd x0 RpIIf p is small, we can learn with far fewer samples!0ICCA finds projection matrices A Rd p , B Rd pIThe new data points are a(i) Rp , b(i) Rp wherea(i) {z}A {z}x(i) {z}p 1Spectral Learning for NLPp d d 1b(i) {z}B y (i) {z} {z}p 1p d0 d0 133

Mechanics of CCA: Step 1I000Compute ĈXY Rd d , ĈXX Rd d , and ĈY Y Rd dn[ĈXY ]jk1 X (i)(i)(xj x̄j )(yk ȳk ) ni 1n1 X (i)(i)[ĈXX ]jk (xj x̄j )(xk x̄k )ni 1n1 X (i)(i)(yj ȳj )(yk ȳk )[ĈY Y ]jk ni 1where x̄ Spectral Learning for NLPPix(i) /nand ȳ Piy(i) /n34

Mechanics of CCA: Step 1I000Compute ĈXY Rd d , ĈXX Rd d , and ĈY Y Rd dn[ĈXY ]jk1 X (i)(i)(xj x̄j )(yk ȳk ) ni 1[ĈXX ]jkn1 X (i)(i) (xj x̄j )(xk x̄k )ni 1n1 X (i)(i)(yj ȳj )(yk ȳk )[ĈY Y ]jk ni 1where x̄ Spectral Learning for NLPPix(i) /nand ȳ Piy(i) /n35

Mechanics of CCA: Step 1I000Compute ĈXY Rd d , ĈXX Rd d , and ĈY Y Rd dn[ĈXY ]jk1 X (i)(i)(xj x̄j )(yk ȳk ) ni 1[ĈXX ]jkn1 X (i)(i) (xj x̄j )(xk x̄k )ni 1[ĈY Y ]jkn1 X (i)(i)(yj ȳj )(yk ȳk ) ni 1where x̄ Spectral Learning for NLPPix(i) /nand ȳ Piy(i) /n36

Mechanics of CCA: Step 2I 1/2 1/2Do SVD on ĈXX ĈXY ĈY Y 1/20 Rd d 1/2ĈXX ĈXY ĈY Y U ΣV SVDLet Up Rd p be the top p left singular vectors. Let0Vp Rd p be the top p right singular vectors.Spectral Learning for NLP37

Mechanics of CCA: Step 3I0Define projection matrices A Rd p and B Rd p 1/2A ĈXX UpI 1/2B ĈY Y VpUse A and B to project each (x(i) , y (i) ) for i 1 . . . n:x(i) Rd A x(i) Rp0y (i) Rd B y (i) RpSpectral Learning for NLP38

Input and Output of CCAx(i) (0, 0, 0, 1, 0, 0,0, 0, 0, . . . , 0) R50,000 (i)a ( 0.3 . . . 0.1) R100y (i) (497, 0, 1, 12, 0, 0, 0, 7,0, 0, 0, 0, . . . , 0, 58, 0) R120,000 (i)bSpectral Learning for NLP ( 0.7 . . . 0.2) R10039

OverviewBasic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: JustificationLexical representationsHidden Markov modelsLatent-variable PCFGsConclusionSpectral Learning for NLP40

Justification of CCA: Correlation CoefficientsISample correlation coefficient for a1 . . . an R andb1 . . . bn R isPnā)(bi b̄)nni 1 (ai qCorr({ai }i 1 , {bi }i 1 ) pPPnn22i 1 (ai ā)i 1 (bi b̄)Pi ai /n,b̄ Pi bi /nbwhere ā Correlation 1aSpectral Learning for NLP41

Simple Case: p 10ICCA projection matrices are vectors u1 Rd , v1 RdIProject x(i) and y (i) to scalars u1 · x(i) and v1 · y (i)Spectral Learning for NLP42

Simple Case: p 10ICCA projection matrices are vectors u1 Rd , v1 RdIProject x(i) and y (i) to scalars u1 · x(i) and v1 · y (i)IWhat vectors does CCA find? Answer: u1 , v1 arg max Corr {u ·u,vSpectral Learning for NLPx(i) }ni 1 , {v·y (i) }ni 1 42

Finding the Next ProjectionsIAfter finding u1 and v1 , what vectors u2 and v2 does CCAfind? Answer: u2 , v2 arg max Corr {u ·u,vx(i) }ni 1 , {v·y (i) }ni 1 subject to the constraints x(i) }ni 1 , {u1x(i) }ni 1 Corr {u2 ·· 0 (i) n(i) nCorr {v2 · y }i 1 , {v1 · y }i 1 0Spectral Learning for NLP43

CCA as an Optimization ProblemICCA finds for j 1 . . . p (each column of A and B) uj , vj arg max Corr {u ·u,vx(i) }ni 1 , {v·y (i) }ni 1 subject to the constraints x(i) }ni 1 , {ukx(i) }ni 1 Corr {uj ·· 0 (i) n(i) nCorr {vj · y }i 1 , {vk · y }i 1 0for k jSpectral Learning for NLP44

Guarantees for CCAHXYIAssume data is generated from a Naive Bayes modelILatent-variable H is of dimension k, variables X and Y are ofdimension d and d0 (typically k d and k d0 )IUse CCA to project X and Y down to k dimensions (needs(x, y) pairs only!)ITheorem: the projected samples are as good as the originalsamples for prediction of H(Foster, Johnson, Kakade, Zhang, 2009)IBecause k d and k d0 we can learn to predict H with farfewer labeled examplesSpectral Learning for NLP45

Guarantees for CCA (continued)Kakade and Foster, 2007 - cotraining-style setting:IAssume that we have a regression problem: predict somevalue z given two “views” x and yIAssumption: either view x or y is sufficient for predictionIUse CCA to project x and y down to a low-dimensional spaceITheorem: if correlation coefficients drop off to zero quickly,we will need far fewer samples to learn when using theprojected representationIVery similar setting to cotraining, but:IINo assumption of independence between the two viewsCCA is an exact algorithm - no need for heuristicsSpectral Learning for NLP46

Summary of the SectionISVD is an efficient optimization techniqueIICCA derives a new representation of paired data thatmaximizes correlationIILow-rank matrix approximationSVD as a subroutineNext: use of CCA in deriving vector representations of words(“eigenwords”)Spectral Learning for NLP47

OverviewBasic conceptsLexical representationsIEigenwords found using the thin SVD between words andcontextcapture distributional similaritycontain POS and semantic information about wordsare useful features for supervised learningHidden Markov ModelsLatent-variable PCFGsConclusionSpectral Learning for NLP48

Uses of Spectral Methods in NLPIWord sequence labelingIIIIIPart of Speech tagging (POS)Named Entity Recognition (NER)Word Sense Disambiguation (WSD)Chunking, prepositional phrase attachment, .Language modelingIIWhat is the most likely next word given a sequence of words(or of sounds)?What is the most likely parse given a sequence of words?Spectral Learning for NLP49

Uses of Spectral Methods in NLPIWord sequence labeling: semi-supervised learningIIIUse CCA to learn vector representation of words (eigenwords)on a large unlabeled corpus.Eigenwords map from words to vectors, which are used asfeatures for supervised learning.Language modeling: spectral estimation of probabilisticmodelsIIUse eigenwords to reduce the dimensionality of generativemodels (HMMs,.)Use those models to compute the probability of an observedword sequenceSpectral Learning for NLP50

The Eigenword Matrix UIU contains the singular vectors from the thin SVD of thebigram count matrixate cheeseate01cheese 0000hamI10You20I ate hamYou ate cheeseYou ateSpectral Learning for NLPham10000I00000You0000051

The Eigenword Matrix UIU contains the singular vectors from the thin SVD of thebigram matrix (wt 1 wt ) analogous to LSA, but uses contextinstead of documentsIIIContext can be multiple neighboring words (we often use thewords before and after the target)Context can be neighbors in a parse treeEigenwords can also be computed using the CCA betweenwords and their contextsIWords close in the transformed space are distributionally,semantically and syntactically similarIWe will later use U in HMMs and parse trees to project wordsto low dimensional vectors.Spectral Learning for NLP52

Two Kinds of Spectral ModelsIContext oblivious (eigenwords)IIlearn a vector representation of each word type based on itsaverage contextContext sensitive (eigentokens or state)Iestimate a vector representation of each word token based onits particular context using an HMM or parse treeSpectral Learning for NLP53

Eigenwords in PracticeIWork well with corpora of 100 million wordsIWe often use trigrams from the Google n-gram collectionIWe generally use 30-50 dimensionsICompute using fast randomized SVD methodsSpectral Learning for NLP54

How Big Should Eigenwords Be?IA 40-D cube has 240 (about a trillion) vertices.IMore precisely, in a 40-D space about 1.540 11 millionvectors can all be approximately orthogonal.ISo 40 dimensions gives plenty of space for a vocabulary of amillion wordsSpectral Learning for NLP55

Fast SVD: Basic Methodproblem Find a low rank approximation to a n m matrix M .solution Find an n k matrix A such that M AA MSpectral Learning for NLP56

Fast SVD: Basic Methodproblem Find a low rank approximation to a n m matrix M .solution Find an n k matrix A such that M AA MConstruction A is constructed by:1. create a random m k matrix Ω (iid normals)2. compute M Ω3. Compute thin SVD of result: U DV M Ω4. A Ubetter: iterate a couple times“Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrixdecompositions” by N. Halko, P. G. Martinsson, and J.A. Tropp.Spectral Learning for NLP56

milesincheswifesisterbrotherdaughterunclesonfather husbandmothergirl raturepermeabilitydensityviscosity-0.3-0.2-0.1PC 2acrespoundsbytesbarrels0.00.10.2Eigenwords for ’Similar’ Words are CloseSpectral Learning for NLP-0.2-0.10.00.10.20.30.457

0.3Eigenwords Capture Part of sleepdrinkcarryeatlisten-0.2-0.10.0PC 20.10.2disagreeagreeSpectral Learning for NLP-0.20.00.20.458

Eigenwords: Pronouns0.1PC 3Spectral Learning for NLP-0.2-0.10.00.10.20.359

Eigenwords: Numbers2007eightninesevenfivesixthreefourtwoone 0-0.2-0.1PC 20.00.10.220082009-0.310Spectral Learning for NLP-0.4-0.20.07986534210.260

0.15Eigenwords: 0.10-0.05PC 2lisadavidjohnmichaelSpectral Learning for enlindanancysusanlizjennifercharlesdorothywilliam seph-0.10.00.1tricia0.261

CCA has Nice Properties for Computing EigenwordsIWhen computing the SVD of a word context matrix (asabove) we need to decide how to scale the countsIUsing raw counts gives more emphasis to common wordsBetter: rescaleIIIIDivide each row by the square root of the total count of theword in that rowRescale the columns to account for the redundancyCCA between words and their contexts does thisautomatically and optimallyICCA ’whitens’ the word-context covariance matrixSpectral Learning for NLP62

Semi-supervised Learning ProblemsISequence labeleing (Named Entity Recognition, POS,WSD.)IIIITopic identificationIIIIX target wordZ context of the target wordlabel person / place / organization .X words in titleZ words in abstractlabel topic categorySpeaker identification:IIIX videoZ audiolabel which character is speakingSpectral Learning for NLP63

Semi-supervised Learning using CCAIFind CCA between X and ZIRecall: CCA finds projection matrices A and B such thatx A x {z} {z} {z}k 1Ik 1k d0 d0 1Project X and Z to estimate hidden state: (x, z)IIk d d 1z B z {z} {z} {z}Note: if x is the word and z is its context, then A is thematrix of eigenwords, x is the (context oblivious) eigenwordcorresponding to work x, and z gives a context-sensitive“eigentoken”Use supervised learning to predict label from hidden stateIand from hidden state of neighboring wordsSpectral Learning for NLP64

Theory: CCA has Nice PropertiesIIf one uses CCA to map from target word and context (twoviews, X and Z) to reduced dimension hidden state and thenuses that hidden state as features in a linear regression topredict a y, then we have provably almost as good a fit in thereduced dimsion (e.g. 40) as in the original dimension (e.g.million word vocabulary).IIn contrast, Principal Components Regression (PCR:regression based on PCA, which does not “whiten” thecovariance matrix) can miss all the signal[Foster and Kakade, ’06]Spectral Learning for NLP65

Semi-supervised ResultsIFind spectral features on unlabeled dataIIIIIUse in discriminative modelIIICRF for NERAveraged perceptron for chunkingCompare against state-of-the-art embeddingsIIIRCV-1 corpus: Newswire63 million tokens in 3.3 million sentences.Vocabulary size: 300kSize of embeddings: k 50C&W, HLBL, Brown, ASO and Semi-Sup CRFBaseline features based on identity of word and its neighborsBenefitIIINamed Entity Recognition (NER): 8% error reductionChunking: 29% error reductionAdd spectral features to discriminative parser: 2.6% errorreductionSpectral Learning for NLP66

Section SummaryIEigenwords found using thin SVD between words and contextIIIIIcapture distributional similaritycontain POS and semantic information about wordsperform competitively to a wide range of other embeddingsCCA version provides provable guarantees when used asfeatures in supervised learningNext: eigenwords form the basis for fast estimation of HMMsand parse treesSpectral Learning for NLP67

A Spectral Learning Algorithm for HMMsIAlgorithm due to Hsu, Kakade and Zhang (COLT 2009; JCSS2012)IAlgorithm relies on singular value decomposition followed byvery simple matrix operationsIClose connections to CCAIUnder assumptions on singular values arising from the model,has PAC-learning style guarantees (contrast with EM, whichhas problems with local optima)IIt is a very different algorithm from EMSpectral Learning for NLP68

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimp(the dog saw him, 12 1 3) {z} {z }x1 .x4h1 .h4 π(1) t(2 1) t(1 2) t(3 1)Spectral Learning for NLP69

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimp(the dog saw him, 12 1 3) {z} {z }x1 .x4h1 .h4 π(1) t(2 1) t(1 2) t(3 1) o(the 1) o(dog 2) o(saw 1) o(him 3)Spectral Learning for NLP69

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimp(the dog saw him, 12 1 3) {z} {z }x1 .x4h1 .h4 π(1) t(2 1) t(1 2) t(3 1) o(the 1) o(dog 2) o(saw 1) o(him 3)IIIInitial parameters: π(h) for each latent state hTransition parameters: t(h0 h) for each pair of states h0 , hObservation parameters: o(x h) for each state h, obs. xSpectral Learning for NLP69

Hidden Markov Models (HMMs)H1H2H3H4thedogsawhimThroughout this section:IWe use m to refer to the number of hidden statesIWe use n to refer to the number of possible words(observations)ITypically, m n (e.g., m 20, n 50, 000)Spectral Learning for NLP70

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4Spectral Learning for NLP71

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:Spectral Learning for NLP71

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)Spectral Learning for NLP71

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0Spectral Learning for NLP71

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xt(h h0 )o(dog h0 )fh10h0Spectral Learning for NLP71

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xh0Spectral Learning for NLPt(h h0 )o(dog h0 )fh10fh3 Xt(h h0 )o(saw h0 )fh20h071

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xt(h h0 )o(dog h0 )fh10h0fh4 Xfh3 Xt(h h0 )o(saw h0 )fh20h0t(h h0 )o(him h0 )fh30h0Spectral Learning for NLP71

HMMs: the forward algorithmH1H2H3H4thedogsawhimp(the dog saw him) Xp(the dog saw him, h1 h2 h3 h4 )h1 ,h2 ,h3 ,h4The forward algorithm:fh0 π(h)fh1 Xt(h h0 )o(the h0 )fh00h0fh2 Xt(h h0 )o(dog h0 )fh10h0fh4 Xh0Spectral Learning for NLPfh3 Xt(h h0 )o(saw h0 )fh20h0t(h h0 )o(him h0 )fh30p(. . .) Xfh4h71

HMMs: the forward algorithm in matrix formSpectral Learning for NLPH1H2H3H4thedogsawhim72

HMMs: the forward algorithm in matrix formSpectral Learning for NLPH1H2H3H4thedogsawhim72

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h)Spectral Learning for NLP72

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h) e.g., [Athe ]h0 ,h t(h0 h)o(the h)Spectral Learning for NLP72

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h) e.g., [Athe ]h0 ,h t(h0 h)o(the h)IDefine π as vector with elements πh , 1 as vector of all onesSpectral Learning for NLP72

HMMs: the forward algorithm in matrix formIH1H2H3H4thedogsawhimFor each word x, define the matrix Ax Rm m as[Ax ]h0 ,h t(h0 h)o(x h) e.g., [Athe ]h0 ,h t(h0 h)o(the h)IIDefine π as vector with elements πh , 1 as vector of all onesThenp(the dog saw him) 1 Ahim Asaw Adog Athe πForward algorithm through matrix multiplication!Spectral Learning for NLP72

The Spectral Algorithm: definitionsH1H2H3H4thedogsawhimDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)Easy to derive an estimate:[P̂2,1 ]i,j Spectral Learning for NLPCount(X2 i, X1 j)N73

The Spectral Algorithm: definitionsH1H2H3H4thedogsawhimFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)Easy to derive an estimate, e.g.,:[P̂3,dog,1 ]i,j Spectral Learning for NLPCount(X3 i, X2 dog, X1 j)N74

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)Spectral Learning for NLP75

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)ISVD(P2,1 ) U Rn m , Σ Rm m , V Rn mSpectral Learning for NLP75

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)ISVD(P2,1 ) U Rn m , Σ Rm m , V Rn mIDefinition:Bx U P3,x,1 V {z}Σ 1 {z}m mSpectral Learning for NLPm m75

Main Result Underlying the Spectral AlgorithmIDefine the following matrix P2,1 Rn n :[P2,1 ]i,j P(X2 i, X1 j)IFor each word x, define the following matrix P3,x,1 Rn n :[P3,x,1 ]i,j P(X3 i, X2 x, X1 j)ISVD(P2,1 ) U Rn m , Σ Rm m , V Rn mIDefinition:Bx U P3,x,1 V {z}Σ 1 {z}m mIm mTheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleSpectral Learning for NLP75

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!Spectral Learning for NLP76

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note thatBhim Bsaw Bdog BtheSpectral Learning for NLP76

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Spectral Learning for NLPBhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 176

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Bhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 1 GAhim Asaw Adog Athe G 1The G’s cancel!!Spectral Learning for NLP76

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Bhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 1 GAhim Asaw Adog Athe G 1The G’s cancel!!IFollows that if we have b 1 G 1 and b0 Gπ thenSpectral Learning for NLP76

Why does this matter?ITheorem: if P2,1 is of rank m, thenBx GAx G 1where G Rm m is invertibleIRecall p(the dog saw him) 1 Ahim Asaw Adog Athe π.Forward algorithm through matrix multiplication!INow note that Bhim Bsaw Bdog BtheGAhim G 1 GAsaw G 1 GAdog G 1 GAthe G 1 GAhim Asaw Adog Athe G 1The G’s cancel!!IFollows that if we have b 1 G 1 and b0 Gπ thenb Bhim Bsaw Bdog Bthe b0Spectral Learni

Latent-variable Models Latent-variable models are used in many areas of NLP, speech, etc.: I Latent-variable PCFGs (Matsuzaki et al; Petrov et al) I Hidden Markov Models I Naive Bayes for clustering I Lexical representations: Brown clustering, Saul and Pereira, etc. I Alignments in statistical machine translation I Topic modeling I etc. etc. The Expectation-maximization (EM) algorithm is .

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

including spectral subtraction [2-5] Wiener filtering [6-8] and signal subspace techniques [9-10], (ii) Spectral restoration algorithms including . Spectral restoration based speech enhancement algorithms are used to enhance quality of noise masked speech for robust speaker identification. In presence of background noise, the performance of .

Spectral iQ Gain refers to gain applied to the newly generated spectral cue. This value is relative to gain prescribed across the target region. For example, if the target region spans 1000 Hz and 3000 Hz and the gain applied in that range is 20dB, a Spectral iQ gain setting of 3dB will cause Spectral iQ to generate new cues that peak at

Spectral Methods and Inverse Problems Omid Khanmohamadi Department of Mathematics Florida State University. Outline Outline 1 Fourier Spectral Methods Fourier Transforms Trigonometric Polynomial Interpolants FFT Regularity and Fourier Spectral Accuracy Wave PDE 2 System Modeling Direct vs. Inverse PDE Reconstruction 3 Chebyshev Spectral Methods .