ICMLA 2010 Tutorial On Sequence Labeling: Generative And Discriminative .

11m ago

8 Views

1 Downloads

1.27 MB

132 Pages

Last View : 22d ago

Last Download : 3m ago

Upload by : Cade Thielen

Report this link

Download PDF

Transcription

ICMLA 2010 Tutorial on Sequence Labeling: Generative and Discriminative Approaches Hidden Markov Models, Conditional Random Fields and Structured SVMs Hakan Erdogan Sabanci University Istanbul, Turkey December 2010 ICMLA 2010, Bethesda, MD Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Outline 1 Sequence Labeling 2 Binary Classifiers 3 Multi-class classification 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random fields 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Sequence labeling problem definition I Given a sequence of observations/feature vectors, determine an appropriate label/state for each observation We will assume the observations can be discrete or continuous, scalar or vector We assume the labels/states are discrete and from a finite set Try to reduce errors by considering the relations between the observations and states (observation-state) and the relation between neighboring states (state-state) t 1 Label 5 Label 2 yt 5 5 5 5 5 2 2 2 2 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Sequence labeling applications Speech recognition Part-of-speech tagging Shallow parsing Handwriting recognition Protein secondary structure prediction Video analysis Facial expression dynamic modeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Urns and balls example Assume there are two urns with black and white balls [Rabiner, 1989] One urn has more black than white (90% vs 10%) and vice versa Someone pulls out one ball at a time and shows us without revealing which urn he uses and puts it back into the urn He is more likely to use the same urn (90% chance) once he starts using one We are looking only at the sequence of balls and recording them Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Questions about the urns and balls example Questions of interest: 1 2 3 Can we predict which urn is used at a given time? What is the probability of observing the sequence of balls shown to us? Can we estimate/learn the ratio of balls in each urn by looking at a long sequence of balls if we did not know the ratios beforehand? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Jason Eisner’s ice-cream example Try to guess whether the weather was hot or cold by observing only how many ice-creams (0, 1, 2 or 3 ) Jason ate each day in a sequence of 30 days Two states and observations with 4 distinct values (discrete observations) Question: Can we determine if a day was hot or cold given the sequence of ice-creams consumed by Jason? Example excel sheet online (illustrates forward backward algorithm) Example also adopted in [Jurafsky and Martin, 2008] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Human activity labeling in an exercise video Assume we are given an exercise video of a single person and we are interested in labeling actions of the person as either “standing”, “squatting” or “lying down” (assume for now that no other action is present) We track the subject and have a bounding box around her/him at each frame of the video We consider as features xt [ht , wt ]T where ht is the height of the bounding box and wt is the width of the bounding box So, we have continuous (real) observations and three labels Question: Given the height and width of the bounding boxes in all frames, can we determine the action type in each frame? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Sequence Labeling Approach, notation and variables We will first analyze binary and multi-class classification with linear models Multi-class classification will be the basis for understanding the sequence labeling problem Then, we will introduce HMM, CRF, and structured SVM approaches for sequence labeling Notation: x is an observed feature vector, xt a feature vector at sequence position t, x1:T a sequence of feature vectors y is a discrete label (or state), y Y where Y { 1, 1} for binary classification, Y [M] {1, 2, . . . , M} for multi-class classification yt is the label/state at sequence position t, y1:T is a sequence of labels/states w and w̃ are parameter vectors, wj is the jth component F(x1:T , y1:T ) is a feature vector for CRF and structured SVM Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Outline 1 Sequence Labeling 2 Binary Classifiers 3 Multi-class classification 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random fields 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Linear binary classification I Given training data {(xi , yi ) : i [N]} where xi IRd and yi Y { 1, 1} [N] {1, . . . , N}, xi [xi,1 , xi,2 , . . . , xi,d ]T are feature vectors, yi are class identities Find a weight vector w̃ IRd 1 such that for test data x, we obtain a score w̃T x̃ wT x b where: x̃ [xT , 1]T is the augmented feature vector w̃ [wT , b]T is the augmented weight vector, b is called the bias term w̃ represents a hyperplane in IRd 1 , it is the normal vector to the hyperplane {x̃ : w̃T x̃ 0} that passes through the origin Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Linear binary classification II wTx b 0 0 0 w x (wTx b)/ w -b/ w Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Linear binary classification The score can be used to classify a new sample x into one of two classes by thresholding: 1 if w̃T x̃ T ŷ 1 if w̃T x̃ T We can obtain an ROC curve (or DET curve) by changing the threshold T By default, we may assume T 0 since the bias term is included in the model One can also obtain posterior probabilities p(y x) from the score w̃T x̃ The problem: how to obtain the “best” weight vector w̃? We consider three different classifiers: Fisher’s linear discriminant, logistic regression and support vector machines Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Fisher’s linear discriminant (FLD) I Assume each class’ data are multi-variate Gaussian distributed with a common covariance matrix (homoscedastic) p(x y ) N (x; µy , Σ) where Σ is the common covariance matrix, µy are class means 1 T 1 exp (x µy ) Σ (x µy ) 2 N (x; µy , Σ) ((2π)d Σ )1/2 Using Bayes’ therorem, we can show that p(x y 1)p(y 1) p(y 1 x) P σ(wT x b) 0 0 y 0 Y p(x y )p(y ) where σ(t) (1 exp( t)) 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Fisher’s linear discriminant (FLD) II is the logistic sigmoid function, and w Σ 1 (µ1 µ 1 ) 1 1 T 1 p(y 1) 1 b µT ) 1 Σ µ1 µ 1 Σ µ 1 log( 2 2 p(y 1) (1) Fisher’s linear discriminant yields a linear boundary for classification Fisher’s linear discriminant is a generative model for x conditioned on y It seems a waste to find two class means (of size 2d) and a common covariance matrix (size d(d 1)/2) where in the end what matters for classification is the weight vector w and the bias b (size d 1 only) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers FLD 8 6 6 4 Feature 2 Feature 2 FLD 8 2 0 4 2 0 2 2 0 2 4 Feature 1 Modeling assumption valid 6 2 0 2 4 Feature 1 Modeling assumption invalid Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 6

Binary Classifiers Logistic regression I Discriminative linear model as opposed to FLD which is a generative model Assume p(y 1 x) σ(w̃T x̃) where σ(.) is the logistic sigmoid [Bishop, 2006] Find parameters w̃ by maximizing the conditional log-likelihood (CLL) of the class identities y , p(y x) (instead of p(x y ) in FLD, a generative model) using the training data Define πi (w̃) σ(w̃T x̃i ) X X ˆ arg max w̃ log(πi (w̃)) log(1 πi (w̃)) w̃ y 1 y 1 i i Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Logistic regression II Gradient of the log-likelihood wrt to the parameters is [Bishop, 2006] X X L(w̃) xi (πi (w̃) 1) xi πi (w̃) yi 1 yi 1 There is an iterative reweighted least squares (IRLS) algorithm to solve for w̃ given in [Bishop, 2006] We only estimate d 1 parameters from data directly There is no need to assume a distribution p(x y ) since x’s are given to us in a training scenario So, we only model p(y x) which becomes “discriminative modeling” The assumed parametric form of p(y x) comes from the generative FLD model though! Question: Can I get back a discriminatively trained conditional Gaussian distributions from logistic regression result w̃? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Logistic regression III Answer is yes, choosing µy and Σ consistent with w̃ obtained through logistic regression using equation 2 will give discriminatively trained parameters for Gaussians (note that they will not be ML trained parameters) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers LR 8 6 6 4 Feature 2 Feature 2 FLD 8 2 0 4 2 0 2 2 2 0 2 4 Feature 1 FLD when assumption invalid 6 2 0 2 4 Feature 1 LR with the same data Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 6

Binary Classifiers Support vector machines I A discriminative classifier that tries to maximize the soft margin between classes to increase generalizibility N X 1 min w 2 C ξi 2 i 1 subject to: yi (wT xi b) 1 ξi ξi 0 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 (2) (3)

Binary Classifiers Support vector machines II The constraint optimization is simply equivalent to minimizing the following objective function wrt w and b without constraints (called the primal formulation) N X 1 (1 yi (wT x b)) min w 2 C 2 i 1 where (x) max(x, 0) and C is a fixed parameter to be determined Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Support vector machines III Usually, the dual formulation is used to solve the SVM problem since it appears to be easier, results in sparsity due to support vectors and enables using kernel functions for nonlinear separability max N X N αi i 1 N 1 XX αi αj yi yj xT i xj 2 i 1 j 1 subject to N X αi yi 0 (4) 0 αi C (5) i 1 where αi are the dual variables (Lagrange multipliers for each constraint, thatPis training sample). Optimal weight vector can be found by w N i 1 αi yi xi Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Support vector machines IV 1 wTx b 0 -1 ξ 0,α 0 (irrelevant) ξ 0, 0 α C (non-boundary SV) ξ 0, α C (boundary SV) 2/ w There is sparsity in αi and most αi turn out to be zero The xi that correspond to nonzero αi are called support vectors If αi C , then xi are boundary support vectors Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Support vector machines V The support vectors are the only training samples that matter in determining the optimal weights (hence the separating hyperplane) To find optimal b, take any nonzero αi and use the equation T αi yi (w xi b) 1 0 (comes from KKT conditions) or average values from all nonzero αi ’s. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Regularized empirical risk (RER) minimization I Unifying framework for different linear classifiers Consider a predictor function fθ (x) : IRd IRm that can be used to predict output labels y from features x, m is the number of prediction values (discriminants) For the linear binary classification problem, m 1, θ w̃ and fw̃ (x) w̃T x̃ Define a “loss function” L : IRm Y IR that measures the cost of prediction We define the expected risk as follows: Z R(θ) L (fθ (x), y ) p(x, y )dxdy Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Regularized empirical risk (RER) minimization II Since the joint distribution p(x, y ) is usually unknown, we can find an estimate of the risk from the training data called the “empirical risk” which needs to be minimized wrt the parameters θ N 1 X L (fθ (xi ), yi ) R̂(θ) N i 1 This can be interpreted as the training set error rate Usually we add a penalty term Ω(θ) for the parameters which penalizes complex (or large) predictor functions to arrive at what is called the “regularized empirical risk” N 1 X R̂(θ) L (fθ (xi ), yi ) Ω(θ) N i 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Regularized empirical risk (RER) minimization III Note that “structured risk” minimization is similar where the regularization function is replaced by the Vapnik-Chervonenkis (VC) confidence of the predictor function (which is also a measure of complexity of the predictor) All previous methods (FLD, LR and SVM) can be seen as variants of regularized empirical risk minimization by utilization of a different loss function as we elaborate in the following slides Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Loss functions I RER for linear binary classification is as follows: R̂(w̃) N 1 X T L w̃ x̃i , yi Ω(w̃) N i 1 Using different loss functions yield different methods for linear binary classification Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers LS loss L(w̃T x̃, y ) (w̃T x̃ ty )2 Here ty are regression targets for each class in least squares regression. Using LS loss with no regularization is equivalent to FLD N when ty y where Ny is the number of samples in class y Ny [Bishop, 2006] LS-SVM uses ty y and a quadratic regularizer [Suykens and Vandewalle, 1999]. Without regularization, LS-SVM is similar to FLD Regularized LDA [Friedman, 1989] is equivalent to using a regularization function in the RER framework Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers BNLL loss n o L(w̃T x̃, y ) log 1 exp y w̃T x̃ Using binomial negative log-likelihood (BNLL) loss function with no regularization is equivalent to performing a logistic regression classification Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Hinge loss L(w̃T x̃, y ) (1 y w̃T x̃) Using hinge loss function and a regularization function Ω(w̃) λw̃T Dw̃ (where D is a diagonal matrix with all ones in the diagonal except for the last entry which is zero) is equivalent to performing an SVM classification Note that Ω(w̃) λw̃T w̃ is also used which is slightly different (in liblinear [Fan et al., 2008]) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Other loss functions There are many other loss functions defined in the literature [LeCun et al., 2006] Huber-hinge loss is one which smooths the edge of the hinge loss so that we end up with a differentiable loss function Squared-hinge loss is the square of the hinge loss which is also differentiable (called L2 -SVM) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Loss function plots Illustration of loss functions L(z) absolute hinge squared hinge huber hinge logistic ls loss value, L(z) 1 0.5 0 0.5 0 0.5 1 1.5 z ywTx Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 2

Binary Classifiers Regularization functions I Although originally FLD and LR do not use regularization, it was shown to be beneficial in all cases Typically an L2 -norm regularization Ω(w̃) λ w̃ 2 is used λ is a hyper-parameter that needs to be determined (next slide) Regularization helps in all classifiers arguably due to improving test accuracy due to penalizing the complexity of the classifier avoiding overtraining/overfitting avoiding numerical problems in implementation. L1 -norm is used when sparse parameter vectors are desired For example L1 -regularized L1 -SVM is considered in [Mangasarian, 2006]. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers Estimating the hyperparameters I The hyperparameter λ can be found by grid search on validation data, that is by evaluating the performance of the trained classifier on the validation data Multiple fold cross-validation on the training data can also be performed for this purpose Bayesian “evidence framework” may be used with some losses to find the hyperparameters, this requires re-interpreting the empirical risk optimization as a maximum aposteriori probability (MAP) estimation problem [Hoffmann, 2007] - Bayesian LDA Perturbation analysis may be used [Zheng et al., 2009] Other ad-hoc methods exist as well Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers RER optimization methods LS and logistic losses are differentiable, so primal algorithms such as Newton’s method work very well LS loss has closed form solution For logistic loss, Newton’s method results in iterative re-weighted least squares (IRLS) formulation For hinge loss, one needs to go to the dual QP problem and solve it in the dual For squared-hinge or huber-hinge losses, one can solve it in the primal domain as well libsvm and liblinear are good libraries for solving RER formulations Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers How to obtain nonlinear classifiers? I Sometimes, class data are not separable by a line! 2 Feature 2 1 Replace x with φ(x) which is a nonlinear mapping into a higher dimensional space 0 1 2 3 2 1 0 1 Feature 1 2 If φ(x) is known explicity, you may use it instead of x to solve the problem Use the kernel trick to replace inner products with the kernel function: φ(xi )T φ(xj ) K (xi , xj ) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Binary Classifiers How to obtain nonlinear classifiers? II No need to know what φ(.) is, if we use the kernel trick (φ can be infinite dimensional) just replace inner products with the kernel function Kernel versions usually solved in the dual, but it was shown it is possible to solve in the primal as well [Chapelle, 2007] We will not focus on kernel versions of these classifiers in this talk! linear classifiers are as good as or better than kernel versions for large scale learning [Fan et al., 2008] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Outline 1 Sequence Labeling 2 Binary Classifiers 3 Multi-class classification 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random fields 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multi-class classification Using multiple binary classifiers and combining them (multiple machine) One-vs-all One-vs-one Error correcting output codes (ECOC) Direct multi-class classification (single machine) (explanation next slide) A paper compared these approaches and found that one-vs-one gave the best result (used in libsvm) [Hsu and Lin, 2002] For structured classification, we need direct multi-class classification since almost impossible to enumerate all possible output labels (topic of future lectures) Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Linear multi-class classification Given training data {(xi , yi ) : i [N]} where xi IRd and yi Y [M] xi [xi,1 , xi,2 , . . . , xi,d ]T are feature vectors, yi are class identities Find a set of weight vectors w̃y IRd 1 such that for test data x, we T obtain a score for class y as w̃T y x̃ wy x by Classification is done by ŷ arg maxy w̃T y x̃ Each w̃y represents a hyperplane in IRd 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Geometry of linear multi-class discriminants Rblue Rred Rgreen Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Generative multi-class classification I Given class conditional probability distributions p(x y ) Obtain class posteriors as: p(x y )p(y ) 0 0 y 0 p(x y )p(y ) p(y x) P If class conditional probability distributions are from the exponential family n o p(x y ; θ y ) h(x) exp θ T y t(x) A(θ y ) where θ are the parameters (or transformed parameters) of the pdf, t(x) is the sufficient statistics vector, A(θ) is the log-partition function and h(x) is a base measure independent of the parameters. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Generative multi-class classification II Then, the conditional likelihood has the form: exp{w̃T y φ(x)} p(y x) P T y 0 exp{w̃y 0 φ(x)} T T and φ(x) t(x)T , 1 where w̃y θ T y , A(θ y ) log (p(y )) This form of the conditional likelihood is called normalized exponential or softmax function. Hence, if we map the features x using the sufficient statistics and add a constant feature, conditional likelihood will have a log-linear form Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Generative multi-class classification III For example, for the Gaussian setup with means µy and equal covariances (known Σ) (multi-class FLD), we get the following θy µy t(x) x wy by Σ 1 µy 1 Σ 1 µy log(p(y )) µT 2 y Exercise: Perform the same analysis when both µy and Σy are different for each class. What are the sufficient statistics and parameters then? Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 (6)

Multi-class classification Multi-class (multinomial) logistic regression I exp{w̃T y φ(x)} Assuming p(y x) P we can maximize the T y 0 exp{w̃y 0 φ(x)} conditional log-likelihood of the training data log p(y1 , . . . , yN x1 , . . . , xN ) N X log p(yi xi ) i 1 This yields the following negative log likelihood (NLL) objective function to minimize N X X w̃T exp{w̃T yi φ(xi ) log y 0 φ(xi )} i 1 y0 This objective can be minimized using gradient-based techniques. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multiclass SVM I Multiclass SVM was formulated in [Crammer and Singer, 2001] as follows T T T First define w [wT 1 , w2 , . . . , wM ] as concatenation of weight vectors. N X 1 ξi min w 2 C 2 i 1 subject to: T w̃T yi x̃i w̃y 0 x̃i ξi 1 ξi , i, y 0 6 yi 0 Let ȳi arg maxy 0 6 yi w̃T y 0 x̃i Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010 (7)

Multi-class classification Multiclass SVM II The first set of inequalities can also be written as: T w̃T yi x̃i w̃ȳi x̃i 1 ξi , i The problem can be formulated as an unconstrained minimization problem as follows: N X 1 min w 2 C (1 w̃T w̃T yi x̃i max y 0 x̃i ) y 0 6 yi 2 i 1 In multi-class classification, some class pairs may be less costly to be confused, so we can define a label-loss function (y , y 0 ) which gives a cost to replacing y with y 0 during classification, which can be incorporated in the SVM constraints formulation [Tsochantaridis et al., 2005] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multiclass SVM III “Margin rescaling” yields the set of constraints T 0 0 w̃T yi x̃i w̃y 0 x̃i (yi , y ) ξi , i, y 6 yi whereas “slack rescaling” yields the following set of constraints T w̃T yi x̃i w̃y 0 x̃i 1 ξi , i, y 0 6 yi (yi , y 0 ) There is a cutting-plane algorithm in [Tsochantaridis et al., 2005] which solves a series of constrained optimization algorithms to solve these problems Furthermore, [Joachims et al., 2009] introduced 1-slack constraints, 1-slack formulation with margin rescaling uses the constraints N N 1 X T 1 X T (w̃yi x̃i w̃y 0 x̃i ) (yi , yi0 ) ξ, (yi0 )N i 1 i N N i 1 i 1 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Multiclass SVM IV 1-slack formulation with slack rescaling is also provided in the same paper There are efficient cutting-plane algorithms in [Joachims et al., 2009] for solving 1-slack problems in the dual space These solvers are provided in the svm-struct package Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class I Remembering RER N 1 X R̂(θ) L (fθ (xi ), yi ) Ω(θ) N i 1 T T For multi-class, the parameters are w̃ [w̃T 1 , . . . , w̃M ] and the loss functions for each type are as follows: multi-class FLD: L (fw̃ (xi ), yi ) M X 2 (w̃T y x̃i t(y , yi )) y 1 where t(y , yi ) are appropriate targets for class y for least squares regression when the true class is yi Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class II p p [Ye, 2007] shows that using t(y , y ) N/N Ny /N when y i p y yi and t(y , yi ) Nyi /N otherwise, is equivalent to linear discriminant analysis FLD is equivalent to using a least squares loss function multi-class LR: L (fw̃ (xi ), yi ) w̃T yi x̃i log X exp{w̃T y 0 x̃i } y0 multi-class SVM: L (fw̃ (xi ), yi ) (1 w̃T w̃T yi x̃i max y 0 x̃i ) 0 y 6 yi Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class III Re-writing the SVM loss function is possible as follows: (w̃T L (fw̃ (xi ), yi ) w̃T yi x̃i max y 0 x̃i 1 δy ,yi ) 0 y This form is equivalent to the one before since this is guaranteed to be nonnegative We can generalize using label-loss (y , y 0 ) and “margin rescaling” introduced before to get 0 (w̃T L (fw̃ (xi ), yi ) w̃T yi x̃i max y 0 x̃i (yi , y )) 0 y We can replace the max with a softmax (log-sum-exp) to get an “approximation” SVM-softmax X 0 L (fw̃ (xi ), yi ) w̃T exp{w̃T yi x̃i log y 0 x̃i (yi , y )} y0 Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Regularized empirical risk for multi-class IV Comparing the LR loss function with the SVM-softmax one, we see that they are similar except for the addition of the label-loss function to provide additional margin for confusable classes in the SVM-softmax It is possible to get the RER expression for “slack rescaling” and 1-slack versions of margin and slack rescaling as well For other possible loss functions, for example see [LeCun et al., 2006] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Optimization algorithms I Multi-class FLD has closed form solution Multi-class LR can be solved using Newton’s method in the primal yielding an iterative re-weighted least squares algorithm Multi-class SVM can be directly solved from the dual problem [Crammer and Singer, 2001, Fan et al., 2008] especially if number of training samples N is small Cutting-plane algorithms are attractive alternatives [Tsochantaridis et al., 2005, Joachims et al., 2009] Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Optimization algorithms II Multi-class SVM-softmax can be solved using Newton’s method in the primal If Newton algorithm’s Hessian is too big to be computed efficiently, L-BFGS can be used Stochastic gradient algorithms are fast and applicable, but need to be careful with convergence and choosing step sizes [Bottou, 2004] The optimal choice of the algorithm depends on the values of feature dimension d, number of classes M and number of training samples N Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Multi-class classification Log-linear models A general log-linear conditional model is given by: X 1 p(y x; w) exp wj Fj (x, y ) Z (x, w) j where Z (x, w) X y0 exp X j wj Fj (x, y 0 ) is called the partition function. Note that multi-class LR is a log-linear model where we define a T T concatenated weight vector w [w̃T 1 , . . . , w̃M ] and where the feature vector F (x, y ) x̃ ey where ey is the unit vector with one in position y and zeros elsewhere and denotes the Kronecker product. Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Hidden Markov Models Outline 1 Sequence Labeling 2 Binary Classifiers

Sequence Labeling Outline 1 Sequence Labeling 2 Binary Classi ers 3 Multi-class classi cation 4 Hidden Markov Models 5 Generative vs Discriminative Models 6 Conditional random elds 7 Training CRFs 8 Structured SVM for sequence labeling Hakan Erdogan, A tutorial on sequence labeling, ICMLA 2010, Bethesda MD, December 2010

Related Documents:

sequence pattern Lecture 7: Sequence Motif Discovery - Otago

The profile matrixfor a given motif contains frequency counts for each letter at each position of the isolated conserved region. 8 Sequence logo and consensus sequence We can extract the so-called consensus sequence, i.e. the string of most frequent letters: A graphical representation of the consensus sequence is called a sequence logo:

20 Views

1y ago

IPDPS Symposium Tutorial - CECS Department of UC Irvine

IPDPS'07 Tutorial HPC Methods for Computational Genomics 11 Topics for this Tutorial Review high-performance methods in computational genomics that belong one of the following classes 1. Compare one sequence vs. another sequence Application: Sequence alignment 2. Compare one sequence against many sequences Application: Querying a database 3.

10 Views

1y ago

of the AVID Tutorial Process - baldwinschools.org

Tutorial Process The AVID tutorial process has been divided into three partsÑ before the tutorial, during the tutorial and after the tutorial. These three parts provide a framework for the 10 steps that need to take place to create effective, rigorous and collaborative tutorials. Read and note the key components of each step of the tutorial .

26 Views

1y ago

of the AVID Tutorial Process - Aberdeen School District

32 Views

1y ago

Tutorial for Windows 10 - Telestream

Tutorial 1: Basic Concepts 10 Tutorial 1: Basic Concepts The goal of this tutorial is to provide you with a quick but successful experience creating and streaming a presentation using Wirecast. This tutorial requires that you open the tutorial document in Wirecast. To do this, select Create Document for Tutorial from the Help menu in Wirecast.

26 Views

1y ago

Tutorial 16: Urban Planning - ArcGIS

Tutorial 16: Urban Planning In this tutorial Introduction Urban Planning tools Zoning Masterplanning Download items Tutorial data Tutorial pdf This tutorial describes how CityEngine can be used for typical urban planning tasks. Introduction This tutorial describes how CityEngine can be used to work for typical urban .

32 Views

1y ago

Wirecast 8.3 Tutorial for Windows - Telestream

21 Views

11m ago

Origami Folding: A Structural Engineering Approach

This paper aims to extend this range and introduces a novel engineering application of Origami: Folded Textured Sheets. Existing applications of Origami in engineering can broadly be catego-rized into three areas. Firstly, many deployable structures take inspiration from, or are directly derived from, Origami folding. Examples are diverse and range from wrapping solar sails [Guest and .

102 Views

3y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

106 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

685 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

277 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

283 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

238 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

572 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

344 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

182 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

577 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

223 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

216 Views

NK-ID 0192-8365-3702-0D3E - Car-O-Liner

CAR-O-DATA. 4. The vast majority of vehicles on the road today can be found in Car-O-Liner's database. Your . Car-O-Tronic. is delivered with a 14-day trial . Car-O-Data Vision2. subscription. Car-O-Data. is available with different subscription periods and database. 4. Check all options with our distributors. SOFTWARE PART. NO. Vision2 X1 .

3y ago

326 Views

46686 Vision2 IM EN r0 - Metropolitan Car-o-liner

Car-O-Tronic, Vision2 Software and Car-O-Data. Car-O-Tronic is the measuring hardware, Vision2 Software is the measuring software. Car-O-Data is a database containing Car-O-Liner DataSheets, photo DataSheets and indexes for most vehicles. Car-O-Data is available through an online subscription or a DVD subscription which is updated 4 times a year.

3y ago

301 Views

Colorado Masonic Library & Museum Store

York Rite 15.00 _ CE40 Car Emblem - Order of the Eastern Star Cut-Out Auto Car Emblem-CE40 OES 15.00 _ CE41 Car Emblem - Shriners Cut-Out Auto Car Emblem-CE41 Shrine 15.00 _ CE42 Car Emblem - 33rd Degree Wings Up Cut-Out Auto Car Emblem-CE42 Scottish Rite 15.00 _ CE43 Car Emblem Free & Ac

2y ago

522 Views

Queueing Theory Part 2 - UW Courses Web Server

Queueing Theory-12 Car Wash Example Consider the following 3 car washes Suppose cars arrive according to a Poisson input process and service follows an exponential distribution Fill in the following table What conclusions can you draw from your results? ! µ! L L q W W q P 0 Car Wash A 0.1 car/min 0.5 car/min Car Wash B 0.1 car/min

1y ago

252 Views

ICMLA 2010 Tutorial On Sequence Labeling: Generative And Discriminative .

It looks like you're using an ad-blocker