ICMLA 2010 Tutorial On Sequence Labeling: Generative And Discriminative .

11m ago
8 Views
1 Downloads
1.27 MB
132 Pages
Last View : 22d ago
Last Download : 3m ago
Upload by : Cade Thielen
Transcription

ICMLA 2010 Tutorial on Sequence Labeling: Generative and Discriminative Approaches Hidden Markov Models, Conditional Random Fields and Structured SVMs Hakan Erdogan Sabanci University Istanbul, Turkey December 2010 ICMLA 2010, Bethesda, MD Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Outline 1 Sequence Labeling 2 Binary Classifiers 3 Multi-class classification 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random fields 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Sequence labeling problem definition I Given a sequence of observations/feature vectors, determine an appropriate label/state for each observation We will assume the observations can be discrete or continuous, scalar or vector We assume the labels/states are discrete and from a finite set Try to reduce errors by considering the relations between the observations and states (observation-state) and the relation between neighboring states (state-state) t 1 Label 5 Label 2 yt 5 5 5 5 5 2 2 2 2 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Sequence labeling applications Speech recognition Part-of-speech tagging Shallow parsing Handwriting recognition Protein secondary structure prediction Video analysis Facial expression dynamic modeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Urns and balls example Assume there are two urns with black and white balls [Rabiner, 1989] One urn has more black than white (90% vs 10%) and vice versa Someone pulls out one ball at a time and shows us without revealing which urn he uses and puts it back into the urn He is more likely to use the same urn (90% chance) once he starts using one We are looking only at the sequence of balls and recording them Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Questions about the urns and balls example Questions of interest: 1 2 3 Can we predict which urn is used at a given time? What is the probability of observing the sequence of balls shown to us? Can we estimate/learn the ratio of balls in each urn by looking at a long sequence of balls if we did not know the ratios beforehand? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Jason Eisner’s ice-cream example Try to guess whether the weather was hot or cold by observing only how many ice-creams (0, 1, 2 or 3 ) Jason ate each day in a sequence of 30 days Two states and observations with 4 distinct values (discrete observations) Question: Can we determine if a day was hot or cold given the sequence of ice-creams consumed by Jason? Example excel sheet online (illustrates forward backward algorithm) Example also adopted in [Jurafsky and Martin, 2008] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Human activity labeling in an exercise video Assume we are given an exercise video of a single person and we are interested in labeling actions of the person as either “standing”, “squatting” or “lying down” (assume for now that no other action is present) We track the subject and have a bounding box around her/him at each frame of the video We consider as features xt [ht , wt ]T where ht is the height of the bounding box and wt is the width of the bounding box So, we have continuous (real) observations and three labels Question: Given the height and width of the bounding boxes in all frames, can we determine the action type in each frame? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Approach, notation and variables We will first analyze binary and multi-class classification with linear models Multi-class classification will be the basis for understanding the sequence labeling problem Then, we will introduce HMM, CRF, and structured SVM approaches for sequence labeling Notation: x is an observed feature vector, xt a feature vector at sequence position t, x1:T a sequence of feature vectors y is a discrete label (or state), y Y where Y { 1, 1} for binary classification, Y [M] {1, 2, . . . , M} for multi-class classification yt is the label/state at sequence position t, y1:T is a sequence of labels/states w and w̃ are parameter vectors, wj is the jth component F(x1:T , y1:T ) is a feature vector for CRF and structured SVM Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Outline 1 Sequence Labeling 2 Binary Classifiers 3 Multi-class classification 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random fields 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Linear binary classification I Given training data {(xi , yi ) : i [N]} where xi IRd and yi Y { 1, 1} [N] {1, . . . , N}, xi [xi,1 , xi,2 , . . . , xi,d ]T are feature vectors, yi are class identities Find a weight vector w̃ IRd 1 such that for test data x, we obtain a score w̃T x̃ wT x b where: x̃ [xT , 1]T is the augmented feature vector w̃ [wT , b]T is the augmented weight vector, b is called the bias term w̃ represents a hyperplane in IRd 1 , it is the normal vector to the hyperplane {x̃ : w̃T x̃ 0} that passes through the origin Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Linear binary classification II wTx b 0 0 0 w x (wTx b)/ w -b/ w Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Linear binary classification The score can be used to classify a new sample x into one of two classes by thresholding: 1 if w̃T x̃ T ŷ 1 if w̃T x̃ T We can obtain an ROC curve (or DET curve) by changing the threshold T By default, we may assume T 0 since the bias term is included in the model One can also obtain posterior probabilities p(y x) from the score w̃T x̃ The problem: how to obtain the “best” weight vector w̃? We consider three different classifiers: Fisher’s linear discriminant, logistic regression and support vector machines Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Fisher’s linear discriminant (FLD) I Assume each class’ data are multi-variate Gaussian distributed with a common covariance matrix (homoscedastic) p(x y ) N (x; µy , Σ) where Σ is the common covariance matrix, µy are class means 1 T 1 exp (x µy ) Σ (x µy ) 2 N (x; µy , Σ) ((2π)d Σ )1/2 Using Bayes’ therorem, we can show that p(x y 1)p(y 1) p(y 1 x) P σ(wT x b) 0 0 y 0 Y p(x y )p(y ) where σ(t) (1 exp( t)) 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Fisher’s linear discriminant (FLD) II is the logistic sigmoid function, and w Σ 1 (µ1 µ 1 ) 1 1 T 1 p(y 1) 1 b µT ) 1 Σ µ1 µ 1 Σ µ 1 log( 2 2 p(y 1) (1) Fisher’s linear discriminant yields a linear boundary for classification Fisher’s linear discriminant is a generative model for x conditioned on y It seems a waste to find two class means (of size 2d) and a common covariance matrix (size d(d 1)/2) where in the end what matters for classification is the weight vector w and the bias b (size d 1 only) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers FLD 8 6 6 4 Feature 2 Feature 2 FLD 8 2 0 4 2 0 2 2 0 2 4 Feature 1 Modeling assumption valid 6 2 0 2 4 Feature 1 Modeling assumption invalid Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 6

Binary Classifiers Logistic regression I Discriminative linear model as opposed to FLD which is a generative model Assume p(y 1 x) σ(w̃T x̃) where σ(.) is the logistic sigmoid [Bishop, 2006] Find parameters w̃ by maximizing the conditional log-likelihood (CLL) of the class identities y , p(y x) (instead of p(x y ) in FLD, a generative model) using the training data Define πi (w̃) σ(w̃T x̃i ) X X ˆ arg max w̃ log(πi (w̃)) log(1 πi (w̃)) w̃ y 1 y 1 i i Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Logistic regression II Gradient of the log-likelihood wrt to the parameters is [Bishop, 2006] X X L(w̃) xi (πi (w̃) 1) xi πi (w̃) yi 1 yi 1 There is an iterative reweighted least squares (IRLS) algorithm to solve for w̃ given in [Bishop, 2006] We only estimate d 1 parameters from data directly There is no need to assume a distribution p(x y ) since x’s are given to us in a training scenario So, we only model p(y x) which becomes “discriminative modeling” The assumed parametric form of p(y x) comes from the generative FLD model though! Question: Can I get back a discriminatively trained conditional Gaussian distributions from logistic regression result w̃? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Logistic regression III Answer is yes, choosing µy and Σ consistent with w̃ obtained through logistic regression using equation 2 will give discriminatively trained parameters for Gaussians (note that they will not be ML trained parameters) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers LR 8 6 6 4 Feature 2 Feature 2 FLD 8 2 0 4 2 0 2 2 2 0 2 4 Feature 1 FLD when assumption invalid 6 2 0 2 4 Feature 1 LR with the same data Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 6

Binary Classifiers Support vector machines I A discriminative classifier that tries to maximize the soft margin between classes to increase generalizibility N X 1 min w 2 C ξi 2 i 1 subject to: yi (wT xi b) 1 ξi ξi 0 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 (2) (3)

Binary Classifiers Support vector machines II The constraint optimization is simply equivalent to minimizing the following objective function wrt w and b without constraints (called the primal formulation) N X 1 (1 yi (wT x b)) min w 2 C 2 i 1 where (x) max(x, 0) and C is a fixed parameter to be determined Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Support vector machines III Usually, the dual formulation is used to solve the SVM problem since it appears to be easier, results in sparsity due to support vectors and enables using kernel functions for nonlinear separability max N X N αi i 1 N 1 XX αi αj yi yj xT i xj 2 i 1 j 1 subject to N X αi yi 0 (4) 0 αi C (5) i 1 where αi are the dual variables (Lagrange multipliers for each constraint, thatPis training sample). Optimal weight vector can be found by w N i 1 αi yi xi Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Support vector machines IV 1 wTx b 0 -1 ξ 0,α 0 (irrelevant) ξ 0, 0 α C (non-boundary SV) ξ 0, α C (boundary SV) 2/ w There is sparsity in αi and most αi turn out to be zero The xi that correspond to nonzero αi are called support vectors If αi C , then xi are boundary support vectors Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Support vector machines V The support vectors are the only training samples that matter in determining the optimal weights (hence the separating hyperplane) To find optimal b, take any nonzero αi and use the equation T αi yi (w xi b) 1 0 (comes from KKT conditions) or average values from all nonzero αi ’s. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Regularized empirical risk (RER) minimization I Unifying framework for different linear classifiers Consider a predictor function fθ (x) : IRd IRm that can be used to predict output labels y from features x, m is the number of prediction values (discriminants) For the linear binary classification problem, m 1, θ w̃ and fw̃ (x) w̃T x̃ Define a “loss function” L : IRm Y IR that measures the cost of prediction We define the expected risk as follows: Z R(θ) L (fθ (x), y ) p(x, y )dxdy Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Regularized empirical risk (RER) minimization II Since the joint distribution p(x, y ) is usually unknown, we can find an estimate of the risk from the training data called the “empirical risk” which needs to be minimized wrt the parameters θ N 1 X L (fθ (xi ), yi ) R̂(θ) N i 1 This can be interpreted as the training set error rate Usually we add a penalty term Ω(θ) for the parameters which penalizes complex (or large) predictor functions to arrive at what is called the “regularized empirical risk” N 1 X R̂(θ) L (fθ (xi ), yi ) Ω(θ) N i 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Regularized empirical risk (RER) minimization III Note that “structured risk” minimization is similar where the regularization function is replaced by the Vapnik-Chervonenkis (VC) confidence of the predictor function (which is also a measure of complexity of the predictor) All previous methods (FLD, LR and SVM) can be seen as variants of regularized empirical risk minimization by utilization of a different loss function as we elaborate in the following slides Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Loss functions I RER for linear binary classification is as follows: R̂(w̃) N 1 X T L w̃ x̃i , yi Ω(w̃) N i 1 Using different loss functions yield different methods for linear binary classification Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers LS loss L(w̃T x̃, y ) (w̃T x̃ ty )2 Here ty are regression targets for each class in least squares regression. Using LS loss with no regularization is equivalent to FLD N when ty y where Ny is the number of samples in class y Ny [Bishop, 2006] LS-SVM uses ty y and a quadratic regularizer [Suykens and Vandewalle, 1999]. Without regularization, LS-SVM is similar to FLD Regularized LDA [Friedman, 1989] is equivalent to using a regularization function in the RER framework Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers BNLL loss n o L(w̃T x̃, y ) log 1 exp y w̃T x̃ Using binomial negative log-likelihood (BNLL) loss function with no regularization is equivalent to performing a logistic regression classification Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Hinge loss L(w̃T x̃, y ) (1 y w̃T x̃) Using hinge loss function and a regularization function Ω(w̃) λw̃T Dw̃ (where D is a diagonal matrix with all ones in the diagonal except for the last entry which is zero) is equivalent to performing an SVM classification Note that Ω(w̃) λw̃T w̃ is also used which is slightly different (in liblinear [Fan et al., 2008]) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Other loss functions There are many other loss functions defined in the literature [LeCun et al., 2006] Huber-hinge loss is one which smooths the edge of the hinge loss so that we end up with a differentiable loss function Squared-hinge loss is the square of the hinge loss which is also differentiable (called L2 -SVM) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Loss function plots Illustration of loss functions L(z) absolute hinge squared hinge huber hinge logistic ls loss value, L(z) 1 0.5 0 0.5 0 0.5 1 1.5 z ywTx Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 2

Binary Classifiers Regularization functions I Although originally FLD and LR do not use regularization, it was shown to be beneficial in all cases Typically an L2 -norm regularization Ω(w̃) λ w̃ 2 is used λ is a hyper-parameter that needs to be determined (next slide) Regularization helps in all classifiers arguably due to improving test accuracy due to penalizing the complexity of the classifier avoiding overtraining/overfitting avoiding numerical problems in implementation. L1 -norm is used when sparse parameter vectors are desired For example L1 -regularized L1 -SVM is considered in [Mangasarian, 2006]. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Estimating the hyperparameters I The hyperparameter λ can be found by grid search on validation data, that is by evaluating the performance of the trained classifier on the validation data Multiple fold cross-validation on the training data can also be performed for this purpose Bayesian “evidence framework” may be used with some losses to find the hyperparameters, this requires re-interpreting the empirical risk optimization as a maximum aposteriori probability (MAP) estimation problem [Hoffmann, 2007] - Bayesian LDA Perturbation analysis may be used [Zheng et al., 2009] Other ad-hoc methods exist as well Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers RER optimization methods LS and logistic losses are differentiable, so primal algorithms such as Newton’s method work very well LS loss has closed form solution For logistic loss, Newton’s method results in iterative re-weighted least squares (IRLS) formulation For hinge loss, one needs to go to the dual QP problem and solve it in the dual For squared-hinge or huber-hinge losses, one can solve it in the primal domain as well libsvm and liblinear are good libraries for solving RER formulations Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers How to obtain nonlinear classifiers? I Sometimes, class data are not separable by a line! 2 Feature 2 1 Replace x with φ(x) which is a nonlinear mapping into a higher dimensional space 0 1 2 3 2 1 0 1 Feature 1 2 If φ(x) is known explicity, you may use it instead of x to solve the problem Use the kernel trick to replace inner products with the kernel function: φ(xi )T φ(xj ) K (xi , xj ) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers How to obtain nonlinear classifiers? II No need to know what φ(.) is, if we use the kernel trick (φ can be infinite dimensional) just replace inner products with the kernel function Kernel versions usually solved in the dual, but it was shown it is possible to solve in the primal as well [Chapelle, 2007] We will not focus on kernel versions of these classifiers in this talk! linear classifiers are as good as or better than kernel versions for large scale learning [Fan et al., 2008] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Outline 1 Sequence Labeling 2 Binary Classifiers 3 Multi-class classification 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random fields 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multi-class classification Using multiple binary classifiers and combining them (multiple machine) One-vs-all One-vs-one Error correcting output codes (ECOC) Direct multi-class classification (single machine) (explanation next slide) A paper compared these approaches and found that one-vs-one gave the best result (used in libsvm) [Hsu and Lin, 2002] For structured classification, we need direct multi-class classification since almost impossible to enumerate all possible output labels (topic of future lectures) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Linear multi-class classification Given training data {(xi , yi ) : i [N]} where xi IRd and yi Y [M] xi [xi,1 , xi,2 , . . . , xi,d ]T are feature vectors, yi are class identities Find a set of weight vectors w̃y IRd 1 such that for test data x, we T obtain a score for class y as w̃T y x̃ wy x by Classification is done by ŷ arg maxy w̃T y x̃ Each w̃y represents a hyperplane in IRd 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Geometry of linear multi-class discriminants Rblue Rred Rgreen Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Generative multi-class classification I Given class conditional probability distributions p(x y ) Obtain class posteriors as: p(x y )p(y ) 0 0 y 0 p(x y )p(y ) p(y x) P If class conditional probability distributions are from the exponential family n o p(x y ; θ y ) h(x) exp θ T y t(x) A(θ y ) where θ are the parameters (or transformed parameters) of the pdf, t(x) is the sufficient statistics vector, A(θ) is the log-partition function and h(x) is a base measure independent of the parameters. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Generative multi-class classification II Then, the conditional likelihood has the form: exp{w̃T y φ(x)} p(y x) P T y 0 exp{w̃y 0 φ(x)} T T and φ(x) t(x)T , 1 where w̃y θ T y , A(θ y ) log (p(y )) This form of the conditional likelihood is called normalized exponential or softmax function. Hence, if we map the features x using the sufficient statistics and add a constant feature, conditional likelihood will have a log-linear form Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Generative multi-class classification III For example, for the Gaussian setup with means µy and equal covariances (known Σ) (multi-class FLD), we get the following θy µy t(x) x wy by Σ 1 µy 1 Σ 1 µy log(p(y )) µT 2 y Exercise: Perform the same analysis when both µy and Σy are different for each class. What are the sufficient statistics and parameters then? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 (6)

Multi-class classification Multi-class (multinomial) logistic regression I exp{w̃T y φ(x)} Assuming p(y x) P we can maximize the T y 0 exp{w̃y 0 φ(x)} conditional log-likelihood of the training data log p(y1 , . . . , yN x1 , . . . , xN ) N X log p(yi xi ) i 1 This yields the following negative log likelihood (NLL) objective function to minimize N X X w̃T exp{w̃T yi φ(xi ) log y 0 φ(xi )} i 1 y0 This objective can be minimized using gradient-based techniques. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multiclass SVM I Multiclass SVM was formulated in [Crammer and Singer, 2001] as follows T T T First define w [wT 1 , w2 , . . . , wM ] as concatenation of weight vectors. N X 1 ξi min w 2 C 2 i 1 subject to: T w̃T yi x̃i w̃y 0 x̃i ξi 1 ξi , i, y 0 6 yi 0 Let ȳi arg maxy 0 6 yi w̃T y 0 x̃i Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 (7)

Multi-class classification Multiclass SVM II The first set of inequalities can also be written as: T w̃T yi x̃i w̃ȳi x̃i 1 ξi , i The problem can be formulated as an unconstrained minimization problem as follows: N X 1 min w 2 C (1 w̃T w̃T yi x̃i max y 0 x̃i ) y 0 6 yi 2 i 1 In multi-class classification, some class pairs may be less costly to be confused, so we can define a label-loss function (y , y 0 ) which gives a cost to replacing y with y 0 during classification, which can be incorporated in the SVM constraints formulation [Tsochantaridis et al., 2005] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multiclass SVM III “Margin rescaling” yields the set of constraints T 0 0 w̃T yi x̃i w̃y 0 x̃i (yi , y ) ξi , i, y 6 yi whereas “slack rescaling” yields the following set of constraints T w̃T yi x̃i w̃y 0 x̃i 1 ξi , i, y 0 6 yi (yi , y 0 ) There is a cutting-plane algorithm in [Tsochantaridis et al., 2005] which solves a series of constrained optimization algorithms to solve these problems Furthermore, [Joachims et al., 2009] introduced 1-slack constraints, 1-slack formulation with margin rescaling uses the constraints N N 1 X T 1 X T (w̃yi x̃i w̃y 0 x̃i ) (yi , yi0 ) ξ, (yi0 )N i 1 i N N i 1 i 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multiclass SVM IV 1-slack formulation with slack rescaling is also provided in the same paper There are efficient cutting-plane algorithms in [Joachims et al., 2009] for solving 1-slack problems in the dual space These solvers are provided in the svm-struct package Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class I Remembering RER N 1 X R̂(θ) L (fθ (xi ), yi ) Ω(θ) N i 1 T T For multi-class, the parameters are w̃ [w̃T 1 , . . . , w̃M ] and the loss functions for each type are as follows: multi-class FLD: L (fw̃ (xi ), yi ) M X 2 (w̃T y x̃i t(y , yi )) y 1 where t(y , yi ) are appropriate targets for class y for least squares regression when the true class is yi Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class II p p [Ye, 2007] shows that using t(y , y ) N/N Ny /N when y i p y yi and t(y , yi ) Nyi /N otherwise, is equivalent to linear discriminant analysis FLD is equivalent to using a least squares loss function multi-class LR: L (fw̃ (xi ), yi ) w̃T yi x̃i log X exp{w̃T y 0 x̃i } y0 multi-class SVM: L (fw̃ (xi ), yi ) (1 w̃T w̃T yi x̃i max y 0 x̃i ) 0 y 6 yi Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class III Re-writing the SVM loss function is possible as follows: (w̃T L (fw̃ (xi ), yi ) w̃T yi x̃i max y 0 x̃i 1 δy ,yi ) 0 y This form is equivalent to the one before since this is guaranteed to be nonnegative We can generalize using label-loss (y , y 0 ) and “margin rescaling” introduced before to get 0 (w̃T L (fw̃ (xi ), yi ) w̃T yi x̃i max y 0 x̃i (yi , y )) 0 y We can replace the max with a softmax (log-sum-exp) to get an “approximation” SVM-softmax X 0 L (fw̃ (xi ), yi ) w̃T exp{w̃T yi x̃i log y 0 x̃i (yi , y )} y0 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class IV Comparing the LR loss function with the SVM-softmax one, we see that they are similar except for the addition of the label-loss function to provide additional margin for confusable classes in the SVM-softmax It is possible to get the RER expression for “slack rescaling” and 1-slack versions of margin and slack rescaling as well For other possible loss functions, for example see [LeCun et al., 2006] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Optimization algorithms I Multi-class FLD has closed form solution Multi-class LR can be solved using Newton’s method in the primal yielding an iterative re-weighted least squares algorithm Multi-class SVM can be directly solved from the dual problem [Crammer and Singer, 2001, Fan et al., 2008] especially if number of training samples N is small Cutting-plane algorithms are attractive alternatives [Tsochantaridis et al., 2005, Joachims et al., 2009] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Optimization algorithms II Multi-class SVM-softmax can be solved using Newton’s method in the primal If Newton algorithm’s Hessian is too big to be computed efficiently, L-BFGS can be used Stochastic gradient algorithms are fast and applicable, but need to be careful with convergence and choosing step sizes [Bottou, 2004] The optimal choice of the algorithm depends on the values of feature dimension d, number of classes M and number of training samples N Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Log-linear models A general log-linear conditional model is given by: X 1 p(y x; w) exp wj Fj (x, y ) Z (x, w) j where Z (x, w) X y0 exp X j wj Fj (x, y 0 ) is called the partition function. Note that multi-class LR is a log-linear model where we define a T T concatenated weight vector w [w̃T 1 , . . . , w̃M ] and where the feature vector F (x, y ) x̃ ey where ey is the unit vector with one in position y and zeros elsewhere and denotes the Kronecker product. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Hidden Markov Models Outline 1 Sequence Labeling 2 Binary Classifiers

Sequence Labeling Outline 1 Sequence Labeling 2 Binary Classi ers 3 Multi-class classi cation 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random elds 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Related Documents:

The profile matrixfor a given motif contains frequency counts for each letter at each position of the isolated conserved region. 8 Sequence logo and consensus sequence We can extract the so-called consensus sequence, i.e. the string of most frequent letters: A graphical representation of the consensus sequence is called a sequence logo:

IPDPS'07 Tutorial HPC Methods for Computational Genomics 11 Topics for this Tutorial Review high-performance methods in computational genomics that belong one of the following classes 1. Compare one sequence vs. another sequence Application: Sequence alignment 2. Compare one sequence against many sequences Application: Querying a database 3.

Tutorial Process The AVID tutorial process has been divided into three partsÑ before the tutorial, during the tutorial and after the tutorial. These three parts provide a framework for the 10 steps that need to take place to create effective, rigorous and collaborative tutorials. Read and note the key components of each step of the tutorial .

Tutorial Process The AVID tutorial process has been divided into three partsÑ before the tutorial, during the tutorial and after the tutorial. These three parts provide a framework for the 10 steps that need to take place to create effective, rigorous and collaborative tutorials. Read and note the key components of each step of the tutorial .

Tutorial 1: Basic Concepts 10 Tutorial 1: Basic Concepts The goal of this tutorial is to provide you with a quick but successful experience creating and streaming a presentation using Wirecast. This tutorial requires that you open the tutorial document in Wirecast. To do this, select Create Document for Tutorial from the Help menu in Wirecast.

Tutorial 16: Urban Planning In this tutorial Introduction Urban Planning tools Zoning Masterplanning Download items Tutorial data Tutorial pdf This tutorial describes how CityEngine can be used for typical urban planning tasks. Introduction This tutorial describes how CityEngine can be used to work for typical urban .

Tutorial 1: Basic Concepts 10 Tutorial 1: Basic Concepts The goal of this tutorial is to provide you with a quick but successful experience creating and streaming a presentation using Wirecast. This tutorial requires that you open the tutorial document in Wirecast. To do this, select Create Document for Tutorial from the Help menu in Wirecast.

This paper aims to extend this range and introduces a novel engineering application of Origami: Folded Textured Sheets. Existing applications of Origami in engineering can broadly be catego-rized into three areas. Firstly, many deployable structures take inspiration from, or are directly derived from, Origami folding. Examples are diverse and range from wrapping solar sails [Guest and .