Structured Learning

3y ago
30 Views
2 Downloads
1.77 MB
36 Pages
Last View : 9d ago
Last Download : 3m ago
Upload by : Mollie Blount
Transcription

Structured LearningWith Applications to Natural Language ProcessingRyan le.comJoint work with:Fernando Pereira, KobyCrammer, Jeff Reynar, KerryHannan, Tyler Neylon

Classification Goal: learn a function F : X ! Y– Or learn P(Y X), s(X,Y)– F(X) argmaxY P(Y X), F(X) argmaxY s(X,Y) Supervised setting– Assume training set T (x1, y1), (x2, y2), (x T , y T )– Where each (xi, yi) 2 X Y Atomic Classification– Binary: Y {-1, 1}– Multiclass: Y {0, 1, , M} Atomic classification suitable for many problems– Text classification and categorization, object rec.

Structured Classification Some problems require classification of structuredoutputs For example, part-of-speech taggingx John hit the ball with the sticky NV D NPDN Outputs, y, are structured set of atomic decisions Output space has exponential size relative to input Q: Why not just reduce to atomic classification by predicting eachcomponent independently? A: This assumes an unrealistic independence assumptionbetween the components of the output

Other examples Syntactic parsing:– Input x is a sentence– Output y is a syntactic parse tree Machine Translation– Input x is a sentenceI don’t speak french– Output y is another sentence Summarization– Input x is a set of documents– Output y is a set of relevant sentences Many more examples Je ne parle pas francais

Hidden Markov Models (DBNs) Generative sequence labeling modelsyx Prominent in speech, NLP, vision, etc.Easy to train in supervised settingEasy to incorporate hidden (missing) valuesCan be extended beyond sequences (DBNs)Relatively good empirical performance

Discriminative Models For many text classifications problems,discriminative models have superior performance Discriminative models optimize a function morecorrelated with prediction accuracy– SVMs: minimize prediction error– Logistic Regression: maximize likelihood of cond. dist.– Perceptron: min error– Unlike generative models, do not model input x– No need to make drastic independence assumptionsover different properties of x

Discriminative Modelsyx P(y1 ym x)

Some Definitions x is always an input, y always the output– x x1 x2 xn (usually sequence of words)– y y1 y2 ym (output variables, seq n m) Gen(x) is the space of valid outputs for input x Φ(x, y) Rp maps input/output pairs to a highdimensional feature representation w Rp is a corresponding weight vector

Conditional Random Fields (Lafferty et al. 01)Linear classifiers:P(y x) Z-1 ew · Φ(x,y)Z y* ew · Φ(x,y*)Training: set w to maximize conditional log likelihood oftraining data (x1,y1), (x2,y2), w arg max log P(yi xi)w(xi,yi)

Conditional Random Fields (Lafferty et al. 01)Inferencearg max P(y x) arg max Z-1 ew · Φ(x,y)y Gen(x)y Gen(x) arg max w · Φ(x,y)Gen(x) is exp in size of xTrainingy Gen(x)w arg max log P(yi xi)w(xi,yi)(Convex) L [ Φ(xi,yi) - EP(Y xi) Φ(xi,Y) ](xi,yi)EP(Y xi) Φ(xi,Y) P(y xi) Φ(xi,y)y Gen(xi)!Just like multinomial logistic regression!

Markov Random Fields P(y x) factors over clique “potentials” (Hammersleyand Clifford ‘71)P(y x) y1-1Zy2e w · Φ (x,c)c yyny3xnSequences (1st order)P(y x) -1Ze w · Φ (x,yi-1,yi)i 1

CRFs The fact that the probability distribution factors bycliques in the graph is useful E.g., sequence labeling– Inference: Viterbi (like HMMs)– Training: Forward-backward algorithm (like HMMs) In general– Junction tree algorithms– Trees: Belief propagation– Approximations: loopy BP, variational methods, MC Good overview: Sutton and McCallum 2006– http://www.cs.umass.edu/ casutton/publications/crf-tutorial.pdf

Large-Margin Learning SVMs, AdaBoost, winnow and other LM classifiersoften represent state-of-the art performance Other benefits– Easily kernalizable– Understood generalization error bounds What about for structured learning problems?

Binary Large-Margin Classification1min w 22Such that:yi(w Φ(xi) b) 1 (xi, yi) T

Large-Margin Multiclass ClassificationCrammer and Singer 2001min 1 w 22w s(xΦ(xi,i,yyi)i) -- s(xw i, Φ(xy) i, y)L(y i, y)L(yi, y)Such that: y, (xi, yi) TL(y1, y2) 0L(y1, y2) 0 iff y1 y2O( T Y ) constraints

Large-Margin Structured ClassificationAltun et al. ‘03, Taskar et al. ‘03min 1 w 22Such that:s(xi, yi) - s(xi, y) L(yi, y) y Gen(xi), (xi, yi) T- Exponential number of constraints!!O( T Gen(x)) constraintsL(yi, y) depends on problem- Sequence labeling usually hamming loss- F1, Blue, Rouge, labeled P/R,

How do we solve this?1min w 22Such that:s(xi, yi) - s(xi, y) L(yi, y) y Gen(xi), (xi, yi) T Taskar et al. ‘03 (M3Ns) Factor constraints by output structure (like CRFs) Assumes loss factors relative to output– Not always true Time consuming– No real large scale public implementations (yet)

How do we solve this?1min w 22Such that:s(xi, yi) - s(xi, y) L(yi, y) y Gen(xi), (xi, yi) T Tsochantaridis et al. ‘04 Sample constraints smartly (cutting plane alg.) Assumes loss factors relative to output– But empirically not that important Time consuming– A little better than M3Ns

Online Large-Margin Structured Learning Basically a large-margin perceptron algorithmT (x1,y1), (x T ,y T )MIRA0w 0Crammer & Singer ‘03repeat N timesMcDonald ‘05for t 1 . T Consider instance (xt,yt)w(i 1) arg minw* 1 w* - wi 22s(xt, yt) - s(xt, y) L(yt, y)s.t. y k-Best-Gen(xt)w wN* T return wMake smallchange to wEach QP hasjust k constraints

Online Large-Margin Structured Learning Advantages of MIRA– Fast (scales better than almost all SL algs)– Relies only on inference (e.g., Viterbi w/ f.o. Markov) Robust to approximate inference (McDonald ‘06)– Easy to implement– Arbitrary loss functions– Provable convergance, generalization error Disadvantages– Needs k-best inference But approximations work fine– Sub optimal parameter settings

Preliminary Experiments (Seq. Classification)

Review Structured Learning generalizes m.c. learning Use factorizations of output structure to:– Run inference (Viterbi, CKY, Junction Tree)– Learn parameters (CRFs, M3Ns) Conditional Random Fields– Max likelihood (must compute normalization Z) Large-Margin (M3Ns, MIRA, Cutting plane)– Min error– Often only needs inference– Kernels, slack variables, (no time!)

Applications to NLP Syntactic Dependency Parsing– 14 languages Sentiment Analysis– Bag-of-words vs fine-to-coarse models Summarization– Multi-document summarization– Dialogue summarization Meetings, IM, discussion

Syntactic Dependency Parsing Given a sentence, construct head-mod relationship Parse is weakly connected graph G– Usually satisfies tree constraint (single parent)Projective Parsing (English, Chinese)Non-Projective Parsing (Dutch, Czech, German)

Syntactic Dependency Parsing Let x x1, , xn be a sentence Let y be a dependency parse (tree)– Say (i,j) y if xi is the head of xj (or xj modifies xi) Let s(x,y) be the score of parse y for x Factor score by dependency edges(x,y) (i,j) y s(x,i,j) (i,j) y w Φ(x,i,j)s(x,root,hit) s(x,hit,John) s(x,hit,ball) s(x,hit,with) s(x,ball,the) s(x,with,bat) s(x,bat,the)

Inference Algorithms Non-projective case– Maximum r-arborescence– Chu-Liu-Edmonds Algorithm O(n2)– K-best O(kn2) Projective case– Eisner algorithm O(n3)– K-best O(k*log(k)*n3) Higher order edge factorization– Non-projective: NP-hard (some good approximations)– Projective still remains polynomial

Feature Representation Φ Identity of head and modifierPart-of-speech of head and modifierPart-of-speech of surrounding wordsPart-of-speech of words between head andmodifier Distance between head and modifier Direction of dependency edge Prefix, suffix of head and modifier

ExperimentsHighest reporting systemat CoNLL 2006

Sentiment Analysis[] [][][] Classify sentences as positive, negative, nosentiment, fact, opinion, etc. Classify document, paragraph, etc. as well Useful in many mining problems– Translation, summarization, Q&A,

Fine-to-Coarse Sentiment Analysis Most previous work analyzes sentiment at:– Phrase– Sentence– ParagraphMainly bag-of-wordsMulticlass classificationPang et al. ‘03, ‘04, ‘06– Document What if we want sentiment on multiple levels? Jointly learn sentiment on all levels– Higher levels influence lower level decisions “It is a lot harder to use than it looks” - pos or neg?– Lower levels help with high level decisions (Pang ‘04) Ignore neutral sentences, discourse cues (beginning / end)

Sentence-Document Level ModelDocument labelSentence labelsInput: sentences Document label dependent on sentences Sentences dependent on neighbors and doc label Models directly the relationship between differentlevels

Inference Let x s1, , sn be sentences of a document Let y (yd, ys) (yd, ys1, , ysn) be a jointdocument-sentence labeling of sentiment Let s(x,y) be the score of a labeling for input x Factor the score over cliquess(x,y) i s(x, ysi, ysi 1, yd) i w Φ(x, ysi, ysi 1, yd)y arg maxy s(x,y)

Inference Note that if we fix the label, yd, then inference isreduced to sequence classification– I.e., sentence label model forms a chain Algorithm for joint case– For each yd Find ys ys1, , ysn maximizing s(x, (yd, ys)) (with Viterbi) If s(x, (yd, ys)) better than previous best, record Linear in doc labels, quadratic in sentence labels K-best easy (just merging k-best lists)

Beyond Two-Level Models E.g., sentence, paragraph, document Nested hierarchical models– Inference is tractable (bottom-up dynamic programming) Can have dependencies across levels– But may make inference hard

Experiments Document-Classifier– Bag-of-words document classifier Sentence-Classifier– Bag-of-words sentence classifier Sentence-Structured– Sentence Classifier, with sequential structure Joint-Structured (models I just talked about)

The End CRFs– Sutton and McCallum. An Introduction to ConditionalRandom Fields for Relational Learning. 2006.– http://www.inference.phy.cam.ac.uk/hmw26/crf/ Large Margin– Ben Taskar. Learning Structured Prediction Models: ALarge Margin Approach. Stanford University, CA,December 2004. Learning with Structured Outputs– BakIr et al, eds. Predicting Structured Data. The MITPress, 2006.

Structured Classification Some problems require classification of structured outputs For example, part-of-speech tagging x John hit the ball with the stick y N V D N P D N Outputs, y, are structured set of atomic decisions Output space has exponential size relative to input

Related Documents:

Key takeaway: After being educated on the difference between a lump-sum and a structured settlement, 73 percent of Americans would choose a structured settlement payout when they received their settlement in a personal injury case. Chose structured settlement Chose lump sum CHART 4 - REASONS FOR CHOOSING A STRUCTURED SETTLEMENT

Moving from structured to open inquiry: Challenges and limits 385 Structured, guided, and open inquiry approaches: advantages and disadvantages The type of inquiry that is more relevant to the teaching and learning facilities available in schools remains controversial among educators. Some teachers prefer using structured or

Understanding Rheology of Structured Fluids Keywords: structured fluids, sol gel transition, solution, yield stress, thixotropy, viscosity, mechanical stability, shelf life, flow curve, inks, cosmetics, dispersions, food 1 AAN016 Figure 1: Viscosity of a structured fluid as a function of shear rate and particle concentration1

Other examples are MySql3 and Postgres indixes4. 2.2 Fully Structured Data Fig.1: Sample Table in a Relational Database System Fully structured data follows a prede ned schema. "An instance of such a schema is some data that conforms to this speci cation,"[2]. A typical example for fully structured

Source: SIFMA for Structured Finance issuance; SP LCD, Barclays Research for Corporate issuance Structured Products issuance is a fraction of Corporate issuance post 2008 crisis Structured Finance Issuance US & Europe Structured Finance I

4. effective 1 apr 13, structured self development 1 (ssd-1) is a prerequisite to attend the warrior leader course (wlc). effective 1 jun 13, structured self development 3 (ssd-3) is a prerequisite to attend the senior leader course (slc). effective 1 jun 13, structured self development 4 (ssd-4) is a prerequisite to

Nola.J. Pender Health Promotion Model. The tools used were structured knowledge questionnaire and semi–structured 5 point Likert scale on attitude followed by structured teaching programme on preconception care. The result revealed that the mean pre–test score on knowledge and attitude was

Agile Software Development with Scrum An Iterative, Empirical and Incremental Framework for Completing Complex Projects (Slides by Prof. Dr. Matthias Hölzl, based on material from Dr. Philip Mayer with input from Dr. Andreas Schroeder and Dr. Annabelle Klarl) CHAOS Report 2009 Completion of projects: 32% success 44% challenged 24% impaired Some of the reasons for failure: Incomplete .