Conditional Random Fields: An Introduction

3y ago

44 Views

2 Downloads

109.63 KB

9 Pages

Last View : 5d ago

Last Download : 3m ago

Upload by : Rafael Ruffin

Report this link

Download PDF

Transcription

Conditional Random Fields: An Introduction Hanna M. WallachFebruary 24, 20041Labeling Sequential DataThe task of assigning label sequences to a set of observation sequences arisesin many fields, including bioinformatics, computational linguistics and speechrecognition [6, 9, 12]. For example, consider the natural language processingtask of labeling the words in a sentence with their corresponding part-of-speech(POS) tags. In this task, each word is labeled with a tag indicating its appropriate part of speech, resulting in annotated text, such as:(1)[PRP He] [VBZ reckons] [DT the] [JJ current] [NN account] [NNdeficit] [MD will] [VB narrow] [TO to] [RB only] [# #] [CD 1.8] [CDbillion] [IN in] [NNP September] [. .]Labeling sentences in this way is a useful preprocessing step for higher naturallanguage processing tasks: POS tags augment the information contained withinwords alone by explicitly indicating some of the structure inherent in language.One of the most common methods for performing such labeling and segmentation tasks is that of employing hidden Markov models [13] (HMMs) or probabilistic finite-state automata to identify the most likely sequence of labels for thewords in any given sentence. HMMs are a form of generative model, that definesa joint probability distribution p(X, Y ) where X and Y are random variablesrespectively ranging over observation sequences and their corresponding labelsequences. In order to define a joint distribution of this nature, generative models must enumerate all possible observation sequences – a task which, for mostdomains, is intractable unless observation elements are represented as isolatedunits, independent from the other elements in an observation sequence. Moreprecisely, the observation element at any given instant in time may only directly Universityof Pennsylvania CIS Technical Report MS-CIS-04-211

depend on the state, or label, at that time. This is an appropriate assumptionfor a few simple data sets, however most real-world observation sequences arebest represented in terms of multiple interacting features and long-range dependencies between observation elements.This representation issue is one of the most fundamental problems whenlabeling sequential data. Clearly, a model that supports tractable inference isnecessary, however a model that represents the data without making unwarranted independence assumptions is also desirable. One way of satisfying boththese criteria is to use a model that defines a conditional probability p(Y x) overlabel sequences given a particular observation sequence x, rather than a jointdistribution over both label and observation sequences. Conditional models areused to label a novel observation sequence x? by selecting the label sequence y ?that maximizes the conditional probability p(y ? x? ). The conditional nature ofsuch models means that no effort is wasted on modeling the observations, andone is free from having to make unwarranted independence assumptions aboutthese sequences; arbitrary attributes of the observation data may be capturedby the model, without the modeler having to worry about how these attributesare related.Conditional random fields [8] (CRFs) are a probabilistic framework for labeling and segmenting sequential data, based on the conditional approach describedin the previous paragraph. A CRF is a form of undirected graphical model thatdefines a single log-linear distribution over label sequences given a particularobservation sequence. The primary advantage of CRFs over hidden Markovmodels is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference.Additionally, CRFs avoid the label bias problem [8], a weakness exhibited bymaximum entropy Markov models [9] (MEMMs) and other conditional Markovmodels based on directed graphical models. CRFs outperform both MEMMsand HMMs on a number of real-world sequence labeling tasks [8, 11, 15].2Undirected Graphical ModelsA conditional random field may be viewed as an undirected graphical model,or Markov random field [3], globally conditioned on X, the random variablerepresenting observation sequences. Formally, we define G (V, E) to be anundirected graph such that there is a node v V corresponding to each of therandom variables representing an element Yv of Y . If each random variableYv obeys the Markov property with respect to G, then (Y , X) is a conditionalrandom field. In theory the structure of graph G may be arbitrary, providedit represents the conditional independencies in the label sequences being modeled. However, when modeling sequences, the simplest and most common graphstructure encountered is that in which the nodes corresponding to elements of2

Y form a simple first-order chain, as illustrated in Figure 1.Y1Y2Yn 1Y3Yn.X X1 , . . . , Xn 1 , XnFigure 1: Graphical structure of a chain-structured CRFs for sequences. Thevariables corresponding to unshaded nodes are not generated by the model.2.1Potential FunctionsThe graphical structure of a conditional random field may be used to factorizethe joint distribution over elements Yv of Y into a normalized product of strictlypositive, real-valued potential functions, derived from the notion of conditionalindependence.1 Each potential function operates on a subset of the randomvariables represented by vertices in G. According to the definition of conditionalindependence for undirected graphical models, the absence of an edge betweentwo vertices in G implies that the random variables represented by these verticesare conditionally independent given all other random variables in the model.The potential functions must therefore ensure that it is possible to factorize thejoint probability such that conditionally independent random variables do notappear in the same potential function. The easiest way to fulfill this requirementis to require each potential function to operate on a set of random variableswhose corresponding vertices form a maximal clique within G. This ensuresthat no potential function refers to any pair of random variables whose verticesare not directly connected and, if two vertices appear together in a clique thisrelationship is made explicit. In the case of a chain-structured CRF, such as thatdepicted in Figure 1, each potential function will operate on pairs of adjacentlabel variables Yi and Yi 1 .It is worth noting that an isolated potential function does not have a directprobabilistic interpretation, but instead represents constraints on the configurations of the random variables on which the function is defined. This in turnaffects the probability of global configurations – a global configuration with ahigh probability is likely to have satisfied more of these constraints than a globalconfiguration with a low probability.1 The product of a set of strictly positive, real-valued functions is not guaranteed to satisfythe axioms of probability. A normalization factor is therefore introduced to ensure that theproduct of potential functions is a valid probability distribution over the random variablesrepresented by vertices in G.3

3Conditional Random FieldsLafferty et al. [8] define the the probability of a particular label sequence ygiven observation sequence x to be a normalized product of potential functions,each of the formXXµk sk (yi , x, i)),(2)λj tj (yi 1 , yi , x, i) exp (jkwhere tj (yi 1 , yi , x, i) is a transition feature function of the entire observationsequence and the labels at positions i and i 1 in the label sequence; sk (yi , x, i)is a state feature function of the label at position i and the observation sequence;and λj and µk are parameters to be estimated from training data.When defining feature functions, we construct a set of real-valued featuresb(x, i) of the observation to expresses some characteristic of the empirical distribution of the training data that should also hold of the model distribution.An example of such a feature is(1 if the observation at position i is the word “September”b(x, i) 0 otherwise.Each feature function takes on the value of one of these real-valued observationfeatures b(x, i) if the current state (in the case of a state function) or previousand current states (in the case of a transition function) take on particular values. All feature functions are therefore real-valued. For example, consider thefollowing transition function:(b(x, i) if yi 1 IN and yi NNPtj (yi 1 , yi , x, i) 0otherwise.In the remainder of this report, notation is simplified by writings(yi , x, i) s(yi 1 , yi , x, i)andFj (y, x) nXfj (yi 1 , yi , x, i),i 1where each fj (yi 1 , yi , x, i) is either a state function s(yi 1 , yi , x, i) or a transition function t(yi 1 , yi , x, i). This allows the probability of a label sequence ygiven an observation sequence x to be written asp(y x, λ) X1exp (λj Fj (y, x)).Z(x)jZ(x) is a normalization factor.4(3)

4Maximum EntropyThe form of a CRF, as given in (3), is heavily motivated by the principle ofmaximum entropy – a framework for estimating probability distributions froma set of training data. Entropy of a probability distribution [16] is a measure ofuncertainty and is maximized when the distribution in question is as uniform aspossible. The principle of maximum entropy asserts that the only probabilitydistribution that can justifiably be constructed from incomplete information,such as finite training data, is that which has maximum entropy subject to aset of constraints representing the information available. Any other distributionwill involve unwarranted assumptions. [7]If the information encapsulated within training data is represented usinga set of feature functions such as those described in the previous section, themaximum entropy model distribution is that which is as uniform as possiblewhile ensuring that the expectation of each feature function with respect tothe empirical distribution of the training data equals the expected value ofthat feature function with respect to the model distribution. Identifying thisdistribution is a constrained optimization problem that can be shown [2, 10, 14]to be satisfied by (3).5Maximum Likelihood Parameter InferenceAssuming the training data {(x(k) , y (k) )} are independently and identically distributed, the product of (3) over all training sequences, as a function of theparameters λ, is known as the likelihood, denoted by p({y (k) } {x(k) }, λ). Maximum likelihood training chooses parameter values such that the logarithm ofthe likelihood, known as the log-likelihood, is maximized. For a CRF, the loglikelihood is given by XX1 log λj Fj (y (k) , x(k) ) .L(λ) Z(x(k) )jkThis function is concave, guaranteeing convergence to the global maximum.Differentiating the log-likelihood with respect to parameter λj gives L(λ) Ep̃(Y ,X) [Fj (Y , X)] λjhiXEp(Y x(k) ,λ) Fj (Y , x(k) ) ,kwhere p̃(Y , X) is the empirical distribution of training data and Ep [·] denotesexpectation with respect to distribution p. Note that setting this derivative to5

zero yields the maximum entropy model constraint: The expectation of eachfeature with respect to the model distribution is equal to the expected valueunder the empirical distribution of the training data.It is not possible to analytically determine the parameter values that maximize the log-likelihood – setting the gradient to zero and solving for λ does notalways yield a closed form solution. Instead, maximum likelihood parametersmust be identified using an iterative technique such as iterative scaling [5, 1, 10]or gradient-based methods [15, 17].6CRF Probability as Matrix ComputationsFor a chain-structured CRF in which each label sequence is augmented by startand end states, y0 and yn 1 , with labels start and end respectively, the probability p(y x, λ) of label sequence y given an observation sequence x may beefficiently computed using matrices.Letting Y be the alphabet from which labels are drawn and y and y 0 belabels drawn from this alphabet, we define a set of n 1 matrices {Mi (x) i 1, . . . , n 1}, where each Mi (x) is a Y Y matrix with elements of the formXλj fj (y 0 , y, x, i)).Mi (y 0 , y x) exp (jThe unnormalized probability of label sequence y given observation sequence xmay be written as the product of the appropriate elements of the n 1 matricesfor that pair of sequences:p(y x, λ) n 11 YMi (yi 1 , yi x).Z(x) i 1Similarly, the normalization factor Z(x) for observation sequence x, may becomputed from the set of Mi (x) matrices using closed semirings, an algebraicstructure that provides a general framework for solving path problems in graphs.Omitting details, Z(x) is given by the (start,end) entry of the product of alln 1 Mi (x) matrices:#"n 1YMi (x)(4)Z(x) i 1start,end6

7Dynamic ProgrammingIn order to identify the maximum-likelihood parameter values – irrespectiveof whether iterative scaling or gradient-based methods are used – it must bepossible to efficiently compute the expectation of each feature function withrespect to the CRF model distribution for every observation sequence x(k) inthe training data, given byhi XEp(Y x(k) ,λ) Fj (Y , x(k) ) p(Y y x(k) , λ)Fj (y, x(k) ).(5)yPerforming such calculations in a naı̈ve fashion is intractable due to the requiredsum over label sequences: If observation sequence x(k) has n elements, thereare n Y possible corresponding label sequences. Summing over this number ofterms is prohibitively expensive.Fortunately, the right-hand side of (5) may be rewritten asn XXi 1p(Yi 1 y 0 , Yi y x(k) , λ)fj (y 0 , y, x(k) ),(6)y 0 ,yeliminating the need to sum over n Y sequences. Furthermore, a dynamicprogramming method, similar to the forward-backward algorithm for hiddenMarkov models, may be used to calculate p(Yi 1 y 0 , Yi y x(k) , λ).Defining forward and backward vectors – αi (x) and βi (x) respectively – bythe base cases(1 if y startα0 (y x) 0 otherwiseandβn 1 (y x) (10if y stopotherwiseand the recurrence relationsαi (x)T αi 1 (x)T Mi (x)andβi (x) Mi 1 (x)βi 1 (x),the probability of Yi and Yi 1 taking on labels y 0 and y given observation sequence x(k) may be written asp(Yi 1 y 0 , Yi y x(k) , λ) αi 1 (y 0 x)Mi (y 0 , y x)βi (y x).Z(x)Z(x) is given by the (start,stop) entry of the product of all n 1 Mi (x) matrices as in (4). Substituting this expression into (6) yields an efficient dynamicprogramming method for computing feature expectations.7

References[1] A. L. Berger. The improved iterative scaling algorithm: A gentle introduction, 1997.[2] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics,22(1):39–71, 1996.[3] P. Clifford. Markov random fields in statistics. In Geoffrey Grimmettand Dominic Welsh, editors, Disorder in Physical Systems: A Volume inHonour of John M. Hammersley, pages 19–32. Oxford University Press,1990.[4] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press/McGraw-Hill, 1990.[5] J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43:1470–1480, 1972.[6] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological SequenceAnalysis: Probabilistic Models of Proteins and Nucleic Acids. CambridgeUniversity Press, 1998.[7] E. T. Jaynes. Information theory and statistical mechanics. The PhysicalReview, 106(4):620–630, May 1957.[8] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In InternationalConference on Machine Learning, 2001.[9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In International Conference on Machine Learning, 2000.[10] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. Technical Report CMU-CS-95-144, Carnegie Mellon University,1995.[11] D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction usingconditional random fields. Proceedings of the ACM SIGIR, 2003.[12] L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. PrenticeHall Signal Processing Series. Prentice-Hall, Inc., 1993.[13] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–285, 1989.[14] A. Ratnaparkhi. A simple introduction to maximum entropy models fornatural language processing. Technical Report 97-08, Institute for Researchin Cognitive Science, University of Pennsylvania, 1997.8

[15] F. Sha and F. Pereira. Shallow parsing with conditional random fields.Proceedings of Human Language Technology, NAACL 2003, 2003.[16] C. E. Shannon. A mathematical theory of communication. Bell SystemTech. Journal, 27:379–423 and 623–656, 1948.[17] H. M. Wallach. Efficient training of conditional random fields. Master’sthesis, University of Edinburgh, 2002.9

Related Documents:

Random Matrix Theory - Cornell University

Start by finding out how Python generates random numbers. Type ?random to find out about scipy's random number generators. Try typing 'random.random()' a few times. Try calling it with an integer argument. Use 'hist' (really pylab.hist) to make a histogram of 1000 numbers generated by random.random. Is th

31 Views

2y ago

Random Matrix Theory

Start by ﬁnding out how Python generates random numbers. Type ?random to ﬁnd out about scipy's random number generators. Try typing 'random.random()' a few times. Try calling it with an integer argument. Use 'hist' (really pylab.hist) to make a histogram of 1000 numbers generated by random.random. Is the distribution Gaussian, uniform, or .

29 Views

1y ago

Conditional Statements - Mr Meyers Math

2.1 Conditional Statements 71 Conditional Statements RECOGNIZING CONDITIONAL STATEMENTS In this lesson you will study a type of logical statement called a conditional statement. A has two parts, a hypothesisand a conclusion. When the statement is written in the "if" part contains the and the "then" part contains the Here is an example:

19 Views

1y ago

An Introduction to Conditional Random Fields

An Introduction to Conditional Random Fields Charles Sutton1 and Andrew McCallum2 1 EdinburghEH8 9AB, UK, csutton@inf.ed.ac.uk 2 Amherst, MA01003, USA, mccallum@cs.umass.edu Abstract Often we wish to predict a large number of variables that depend on each other as well as on other observed variables. Structured predic- tion methods are essentially a combination of classi cation and graph-ical .

66 Views

3y ago

MATH 105: Finite Mathematics 7-4: Conditional Probability

Introduction to Conditional Probability Some Examples A “New” Multiplication Rule Conclusion Conditional Probability Here is another example of Conditional Probability. Example An urn contains 10 balls: 8 red and 2 white. Two balls are drawn at random without replacement. 1 What is the proba

30 Views

2y ago

AnIntroductionto StatisticalSignalProcessing

2.3 Probability spaces 22 2.4 Discrete probability spaces 44 2.5 Continuous probability spaces 54 2.6 Independence 68 2.7 Elementary conditional probability 70 2.8 Problems 73 3 Random variables, vectors, and processes 82 3.1 Introduction 82 3.2 Random variables 93 3.3 Distributions of random variables 102 3.4 Random vectors and random .

58 Views

3y ago

Segmentation Based Classification of 3D Urban Point Clouds: A Super ...

In [20] a method of multi-scale Conditional Random Fields is proposed to classify 3D outdoor terrestrial laser scanned data by introducing regional edge potentials in addition to the local edge and node potentials in the multi-scale Conditional Random Fields. This is followed by ﬁtting Plane patches

8 Views

1y ago

Submittal Data Fiberglass Bondstrand 2000 Pipe pe

1) Minimum wall thickness shall not less than 87.5% of nominal wall thickness in accordance with ASTM D2996. 2) Use these values for calculating longitudinal thrust. 3) No-shave pipe. Typical pipe performance Nominal Pipe Size Internal Pressure Rating1 Collapse Pressure Rating2 Designation in mm Psig MPa psig MPa Per ASTM D2996

46 Views

3y ago

Recent Views

Guide to becoming a barrister in New South Wales - NSW Bar

1. What is a barrister? 7 2. Eligibility to be a barrister 8 3. The New South Wales Bar exam 8 3.1 Registering for the Bar exam 8 3.2 The exam process 9 3.3 Preparing for the Bar exam 9 4. Bar Practice Course 10 4.1 Registering for the Bar Practice Course 10 4.2 Attendance during the Bar Practice Course 11 4.3 Bar Practice Course material 11 5.

1y ago

137 Views

Republic of Mauritius Code of Ethics

clerk, barrister's clerk or clerk or any other employee of any person acting in any of the above capacities. II. GENERAL PRINCIPLES 3. Independence of the Barrister and the Cab-Rank Principle 3.1 The many duties to which a barrister is subject require his absolute independence, free from all other influence, especially such as may arise from

1y ago

129 Views

Simon and Katy Gittins Innovation and the Bar - Clerksroom

Absolute Barrister was the first company formed to take advantage of the direct access rules and its goal is to continue to drive innovation to allow better access to legal services. Husband and wife team first founded Absolute Barrister under a different name in 2011. Absolute Barrister is an innovative, award-winning online

1y ago

121 Views

PLANNERS AS EXPERT WITNESSES - RTPI

When acting as an expert witness your advocate will work with you to ask questions to draw out the key elements of the case. You will also be questioned by the opposing side’s advocate12. Working with your barrister The role of advocate is often carried out by a barrister. In order to perform effectively as an expert

3y ago

244 Views

BARRISTER - Miami

BARRISTER 1 Dennis O. Lynch Is Law School's New Dean D ennis O. Lynch, professor and dean emeritus at the University of Denver College of Law and prominent expert on Latin American law, is the new dean of the University of Miami School of Law. He succeeds Mary Doyle, who had been interim dean since the May 1998 resignation of Samuel C .

1y ago

216 Views

Becoming a Barrister - Bar Council

BARRISTER? "It is wonderful to be able to stand up and represent someone in court using your skills, to win a case for them." Simon O'Toole, 5 Pump Court chambers In England and Wales, the legal profession is split into two main groups: barristers and solicitors, with legal executives making an increasingly important contribution.

1y ago

131 Views

JAMES S. M. KITCHEN Suite 224 BARRISTER & SOLICITOR Airdrie AB T4B 3C3

BARRISTER & SOLICITOR. 203-304 Main St S Suite 224 . Airdrie AB T4B 3C3 . Phone: 403-667-8575 . Email: james@jsmklaw.ca [2] COVID mRNA vaccines (Pfizer and Moderna). She is also compelled to maintain both the physical and spiritual integrity of her body by asserting her God-given prior right to decline

1y ago

206 Views

A Barrister's Guide to Your Personal Injury Claim - Headway

This is the first edition of "A Barrister's Guide to Your Personal Injury Claim". My website www.abarristersguide.org.uk explains that the guide is intended to provide clear, authoritative and independent advice about all aspects of personal injury claims in England and Wales.

1y ago

113 Views

ALTERNATIVE DISPUTE RESOLUTION - Law Reform

Barrister-at-Law LEGISLAT ION DIRECTO. RY. Project Manager for Legislation Directory: Heather Mahon LLB (ling. Ger.), M.Litt., Barrister-at-Law . Legal Researchers: Margaret Devaney LLB Eóin McManus BA, LLB (NUI), LLM (Lond) vi ADMINISTRATION STAFF Head of Administration and Development:

1y ago

117 Views

Albania Albanie

10th and 11th April 2006 at the Peace Palace in The Hague. She is a barrister -at-law and is a member of Gray's Inn, London, United Kingdom. Ms. SITPAH SELVARATNAM, Bachelor of Laws (University of Wales, United Kingdom); Master of Law (University of Cambridge, United Kingdom); Barrister -at-law (Lincoln's Inn,

1y ago

126 Views

Seminar Resolving and Avoiding Construction Disputes - The Hong Kong .

Seminar Resolving and Avoiding Construction Disputes Gary Soo Barrister-at-Law & Chartered Engineer Dates 03/05/2011 Gary Soo Arbitrator, Barrister-at-Law, Chartered Engineer CEDR Accredited Mediator LLM (Peking), LLB & BSc FHKIArb, FCIArb, FIoD, CQP, MIStructE, MICE, MHKIE, MASCE Arbitration and litigation involving commercial and construction .

10m ago

73 Views

by Anthony Trollope - Robert C. Walton

Anthony Trollope (1815-1882) was born in London to a failed barrister and a novelist whose writing for many years supported the family. Financial difficulties forced him to transfer from one school to another and prevented a university education. At age 19 he began work for the Post Office,

3y ago

119 Views

DAMAGES IN Small Claims Court

Deputy Judge, Small Claims Court, Superior Court of Justice . 1:00 p.m. – 1:25 p.m. Damages in Employment Law-Managing Your Client’s Expectations and Effective Advocacy before the Court (15 minutes) Carla Bocci, Barrister & Solicitor, Deputy Judge, Small Claims Court, Superior Court of Justice . 1:25 p.m. – 1:30 p.m.

3y ago

198 Views

A PRACTICAL APPROACH TO PLANNING LAW

A PRACTICAL APPROACH TO PLANNING LAW THIRTEENTH EDITION Victor Moore LLM, BARRISTER Professor of Law Emeritus, University of Reading Michael Purdue LLB, LLM (LONDON), . It furthers the University’s objective of excellence in research, scholarship, and education by p

2y ago

117 Views

INTERNATIONAL SOCIETY - Courts

and people, as well as smuggling migrants across borders and engaging in maritime piracy and cybercrime, and the responses of numerous jurisdictions to these plus other criminal justice problems. - 11 - Richard C.C. Peck, Q.C. (*) Barrister. Vcmcouver. BC. Canada Prof . Ellen S Pod

2y ago

111 Views

Conditional Random Fields: An Introduction

It looks like you're using an ad-blocker