Generative Models: Gaussian Discriminative Analysis And

3y ago

39 Views

2 Downloads

267.80 KB

6 Pages

Last View : 1d ago

Last Download : 3m ago

Upload by : Sabrina Baez

Report this link

Download PDF

Transcription

Generative Models: Gaussian Discriminative Analysis and Naı̈veBayesAuthor: Sami Abu-El-Haija (samihaija@umich.edu)October 10, 2013In this document, we briefly review the concept of Generative Models, and review the derivation of GDAand Naı̈ve Bayes.1Generative vs DiscriminativeGenerally, there are two wide classes of Machine Learning models: Generative Models and DiscriminativeModels. Discriminative models aim to come up with a “good separator”. Generative Models aim to estimatedensities to the training data. Generative Models assume that the data was generated from some probabilitydensity. The aim of fitting a generative model is to estimate the probability distribution that the data wasgenerated from.1.1Recap: Discriminative ModelRecall Logistic Regression, which is a discriminative model. In Logistic Regression, one maximizes likelihoodparameters via Maximum Likelihood Estimate (MLE), of the conditional probability:L(w) NYp(t(i) x(i) )i 1The above likelihood can be maximized by optimization methods, such as Gradient Ascent or Newton’smethod. The learned w arg maxw L(w), 1 is then used for classification. The Logisitc classifier takes atest example (unseen example) and computes p(t 1 x; w) σ(wT x) and p(t 0 x; w) 1 p(t 0 x; w),then classifies the example to the class that returned a larger probability measure. Please read the LogisticRegression handout if you need a refresher.Let’s dig deeper into what’s happening during classification. The argument of the σ(.) is the innerproduct of wT x. This inner product can be visualized as a linear separator. For instance, w can be plottedas a straight line if the features are 2-dimensional (M 2), plotted as a plane if M 3, and plottedas a hyper-plane if M 3. The Logistic Classifier classifies a test example depending on the side of thehyperplane that it lies on.1.2Generative ModelsIn contrast, Generative Models, like GDA, fit probability distributions to the input feature vectors. TheirLikelihood is the joint distribution 2 :L(parameters) NYp(x(i) , t(i) ; parameters)iis common denote with superscript the value returned by an arg maxtaught laterQin EECS445, some generative algorithms are used to learn probability of dataQin a non-classification setting.i.e., optimizing for i p(x(i) ), where as generative models used for classification generally learn i p(x(i) , t(i) )1 it2 As1

It is possible in such models to generate examples, that look realistic to the training data. BecauseGenerative models fit probability distributions to the data (rather than learning a separating hyperplane,like discriminative models), it is possible to get samples from the fitted probability distributions (similarto how one can sample a number from gaussian distribution with known parameters). it is not possible indiscriminative models to sample or generate realistic examples. Since, there is no plausible way of generatinga realistic example using a separating hyperplane (a.k.a. decision boundary).2Gaussain Discriminative Analysis (GDA)GDA uses Bayes rule to represent the joint distribution:p(x, t) p(x t)p(t)When restricted to a binary classification tasks, GDA models p(t) as a Bernoulli Distribution withparameter φ. Note: this φ has nothing to do with the feature mapping function φ(.). We apologize for mixingnotations. Throughout this document, we use x as a feature vector. GDA models:p(t 1) φp(t 0) 1 p(t 0) 1 φOr, combining the two cases into one:p(t) φt (1 φ)1 tFurther, GDA models p(x t c) as a Guassian distribution with parameters µc , Σ, known as “meanvector for class c” and “covariance matrix”: 11T 1exp (x µ)Σ(x µ)p(x t c) cc1M2(2π) Σ 2In the Binary classification case: 1T 1(x µ)Σ(x µ)exp 111M2(2π) Σ 2 11T 1p(x t 0) exp (x µ0 ) Σ (x µ0 )1M2(2π) Σ 2p(x t 1) 1Note that every class has its own mean vector but all share a single covariance matrix. µc RMand Σ RM M , where M is the number of features (dimension of x). In the binary case, which werestrict ourselves to in this document, c {0, 1}. We use the same trick above to combine the probabilitiesconditioned on t 0 and t 1 to write:p(x t) p(x t 1)t p(x t 0)1 tFinally, a GDA classifier takes an example x and assigns the class that maximizes the joint distribution,like:arg max [p(x, t; φ, µ0 , µ1 , Σ)] arg max [p(t; φ)p(x t; µ0 , µ1 , Σ)]t {0,1}2.1t {0,1}Maximum Likelihood Estimates (MLE)The parameters of the model are φ, µ0 , µ1 , Σ. Learning a GDA model corresponds to finding the parametersthat maximize the likelihood. i.e. solving for:2

arg max L(φ, µ0 , µ1 , Σ) {φ,µ0 ,µ1 ,Σ}NYp(x(i) , t(i) ; φ, µ0 , µ1 , Σ)i 1 NYp(x(i) t(i) ; µ0 , µ1 , Σ) p(t(i) ; φ)i 1Which is equivalent to solving for the parameters that maximize the log-likelihood:arg max l(φ, µ0 , µ1 , Σ) log{φ,µ0 ,µ1 ,Σ}NYp(x(i) t(i) ; µ0 , µ1 , Σ) p(t(i) ; φ)i 1For each of the parameters φ, µ0 , µ1 , Σ, it is possible to take the derivative of the log-likelihood with respectto that parameter, set to zero, and solve for the parameter. Giving the maximum likelihood estimates:N1 X (i)I{t 1}N i 1PN(i) 0}x(i)i 1 I{tµ0 PN(i) 0}i 1 I{tPN(i) 1}x(i)i 1 I{tµ1 PN(i) 1}i 1 I{tφ Σ N1 X (i)(x µt(i) )(x(i) µt(i) )TN i 1Note: The Σ is contains within it the computed estimates for µ0 and µ1 . The parameters can be learnedin the listed order.2.2Meaning of the MLE parametersThe expressions above are very interpretable. In particular: The maximum likelihood φ R is equal to the ratio of training examples with class 1. µ0 RM is equal to the mean (centroid) of the feature vectors that have the label 0. µ1 RM is equal to the mean (centroid) of the feature vectors that have the label 1. Σ RM M is the covariance of features, averaged across training data. If features k, l are correlatedin the training data, Σkl Σlk will be large (large and positive if they are positively correlated, largeand negative if they are negatively correlated). If they are uncorrelated, Σkl Σlk 02.3Matrix Identities for deriving MLE of GDAIn order to derive the MLE for Σ, it is necessary to use matrix identities that have not been taught (yet)in the course. All the formulas below are given (and proven) in Stanford’s Linear Algebra Review Notes forMachine Learning 3 . Recall the trace operator tr(A) which takes a square matrix and returns the sum of the elements alongthe diagonal.3 3

If A was a real number (i.e.: A R1 1 or simply A R1 1 , then A tr(A)). In other words the traceof a real-number the number itself. More relevant to us, if some matrix multiplication xT Ax producesa real number, than we can apply the trace operator since:tr(xT Ax) xT Ax In addition, one is allowed to ’rotate’ the arguments of a trace:tr(ABC) tr(CAB) tr(BCA) Further, if A is a square matrix, then the matrix derivative: A tr(Ab) bwhere b can be a vector or another matrix. The derivative of a determinant of some matrix A is given by: A A A A T Chaining the derivative of log and derivative of determinant (please verify this yourself using chainrule): A log A A 12.4GDA exampleWill be added to the document later2.5GDA or Logistic Regression?Will be added to the document later3Naı̈ve Bayes (NB)Naı̈ve Bayes is also a generative model. It generally used for text classification, to classify documents intoone of K classes. Lets start by summarizing the model parameters: φ1 , . . . , φK . One for each class. These are called the priors. φj equals “the probability of any documentbelongingto class k”. The term prior in machine learning refers to “prior knowledge” 4 . In general,PK1. Therefore, it is common to estimate K 1 priors φ1 , . . . , φK 1 since we can definek 1 φk PK 1φK 1 k 1 φk µkj R for j M, k [1, K]. this means, every word j and class k have a measure µkj , which is equalto the probability of word j appearing in class kFrom this point onwards in the document, we will restrict ourselves to binary classification, where thedocument class is {0, 1}. Therefore, we use p(t 1) φ to denote the prior of the positive class, and theprior for the negative class is implicitly p(t 0) 1 φ. The derivations are easily extensible to K classes.4 Most times, priors are measured from the training set. In some cases where one has prior knowledge that spam occurs 20%of the time, it is possible to assign φspam 0.2 and φnotspam 0.84

3.1Classification in Naı̈ve BayesGiven an example document (x “life is good”), Naı̈ve Bayes Classifier classifies the example into its classby computing P (x, t 0) and P (x, t 1) then classifying the example to the class with the larger probabilitymeasure. Same as GDA, Naı̈ve Bayes doesn’t compute this expression explicitly, but decomposes it into twoexpressions using Bayes Rule:p(x, t 1) p(t 1)p(x t 1) p(t 1)p(word1 “life”, word2 “is”, word2 “good” t 1) decomposing document into words p(t 1)p(word1 “life” t 1)p(word2 “is” t 1)p(word3 “good” t 1) the Naı̈ve assumptionY p(t 1)p(word j t 1)wordj x φYµ1jwordj xGoing from the second line to the third line is the core of the Naı̈ve Bayes algorithm. It is a very strongassumption (known as the Naı̈ve assumption, giving rise to the name of the model). Consecutive words inlanguage are important. The Naı̈ve Bayes model assumes that consecutive words are all independent fromone another. Nonetheless, this over simplified model of the language does reasonably well on some tasks.Similarily,Yp(x, t 0) (1 φ)µ0jwordj x3.2Event Models for Naı̈ve BayesThere are two types of Naı̈ve Bayes Models, each has a different interpretation and a different way in modelingand computing µkj3.2.1Multiomial Event ModelHere, documents are represented as integer vectors of size M , where M equals to the size of the vocabularyof the English language. Conceretely, x ZM . The j-th entry of the document vector (xj ) represents thenumber of times that the j-th word appears in the document.Here, the maximum likelihood estimate for parameter µkj gets assigned to:the number of times word j appears in classktotal number of words in class kNote: This is the model in the lecture slides and also in the homeworkµkj 3.2.2Multivariate Bernoulli Event ModelHere, documents are represented as binary vectors of size M , where M equals to the size of the vocabularyof the English language. Conceretely, x {0, 1}M . The j-th entry of the document vector (xj ) is set to 1 ifthe j-th word appears in the document.Here, the maximum likelihood estimate for parameter µkj gets assigned to the fraction of documents fromclass k that contain word j:µkj the number of documents in class k containing word jnumber of documents in class k5

3.3Maximum Likelihood EstimatesThe likelihood of the generative Naı̈ve Bayes classifier is:L(φ, µ01 , µ02 , . . . , µ0M , µ11 , µ12 , . . . , µ1M , ) NYp(x(i) , t(i) ; φ, µ01 , µ02 , . . . , µ0M , µ11 , µ12 , . . . , µ1M )i 1 NYp(t(i) ; φ)p(x(i) t(i) ; µ01 , . . . , µ0M , µ11 , . . . , µ1M )i 1 NY t(i) φi 1Yµ1j wordj x (1 φ) 1 t(i)Yµ0j wordj xTaking the derivative of the log-likelihood with respect to the φ, setting to zero, and solving for φ yields:PNφ i 1I[t(i) 1] N Similarly, taking derivative with respect to each µkj yield the estimates described in the two previoussubsections (depending on which event model is being used)3.4Laplace SmoothingIn the current construction of Naı̈ve Bayes, if there was a case that some word (say “homework”) wasnever observed in the spam class during training, then some obvious spam email that contains the word“homework” will be classified as spam with zero probability. This is because µspamhomework 0, and the productof zero times something gives a result of zero. Laplace smoothing removes this problem by adding a “fakedocument” to both classes that contains every word exactly once. Adding this fake document is equivalentto modifying the MLE estimates µkj (of multinomial event model) to:1 the number of times word j appears in classkM total number of words in class kThis removes the bogus assumption that based on my training set, it is impossible to find the wordhomework in the spam classµkj 4ConclusionIn this document, we briefly introduced the difference between discriminative and generative models. Wealso discussed the formulation of GDA and Naı̈ve Bayes.6

1 Generative vs Discriminative Generally, there are two wide classes of Machine Learning models: Generative Models and Discriminative Models. Discriminative models aim to come up with a \good separator". Generative Models aim to estimate densities to the training data. Generative Models ass

Related Documents:

Eigenboosting: Combining Discriminative and Generative ...

Combining discriminative and generative information by using a shared feature pool. In addition to discriminative classify- . to generative models discriminative models have two main drawbacks: (a) discriminant models are not robust, whether. in

34 Views

3y ago

Structured Discriminative Models for Speech Recognition

Structured Discriminative Models for Speech Recognition Combining Discriminative and Generative Models Test Data ϕ( , )O λ λ Compensation Adaptation/ Generative Discriminative HMM Canonical O λ Hypotheses λ Hypotheses Score Space Recognition O Hypotheses Final O Classifier Use generative

27 Views

3y ago

Hybrid Discriminative-Generative Approach with Gaussian ...

2 Discriminative Models 2.1 Overview From a probabilistic perspective, a discriminative model (or regression model ) represents a conditional . Generative models (or joint models ) consist of mod- . to the shared challeng

33 Views

3y ago

Learning Generative Models via Discriminative Approaches

For the discriminative models: 1. This framework largely improves the modeling capability of exist-ing discriminative models. Despite some recent efforts in combining discriminative models in the random ﬁelds model [13], discrimina-tive model

32 Views

3y ago

Combining information theoretic kernels with generative ...

Combining information theoretic kernels with generative embeddings . images, sequences) use generative models in a standard Bayesian framework. To exploit the state-of-the-art performance of discriminative learning, while also taking advantage of generative models of the data, generative

33 Views

3y ago

A Hybrid Discriminative/Generative Approach for Modeling ...

Feature Selection and Discriminative Activity Models Earlier work has shown that discriminative methods often outperform generative models in classification tasks [Ng and Jordan, 2002]. Additionally, techniques such as bagging and boosting that combine a set of weak classifiers

25 Views

3y ago

[ Hui Jiang and Xinwei Li ] - York University

combining generative and discriminative learning methods. One active research topic in speech and language processing is how to learn generative models using discriminative learning approaches. For example, discriminative training (DT) of hidden Markov models (HMMs) fo

24 Views

3y ago

API Recommended Practice 2A-WSD

API Recommended Practice 2A-WSD Planning, Designing, and Constructing Fixed Offshore Platforms—Working Stress Design TWENTY-SECOND EDITION NOVEMBER 2014 310 PAGES 395.00 PRODUCT NO. G2AWSD22 This recommended practice is based on global industry best practices and serves as a guide for those who are concerned with the design and construction of new fixed offshore platforms and for the .

123 Views

3y ago

Recent Views

The Family and Civil Law Needs of Aboriginal People

2 ABORIGINAL USE OF LEGAL AID CIVIL AND FAMILY LAW SERVICES 41 2.1 Legal Aid for Civil Law Matters 2.1.1 Applications for Civil Aid 2.1.2 Applications for Civil Aid by Gender 2.1.3 Successful Grants of Legal Aid for Civil Law Matters 2.1.4 Grants of Civil Aid by Gender 2.2 The Provision of Minor Assistance for Civil Law Matters

1y ago

133 Views

What is Civil Engineering? - Memphis

What is Civil Engineering? Civil Engineering: The Present The first self-proclaimed civil engineer was John Smeaton (1724 -1792). What is Civil Engineering? Civil Engineering: The Present In 1818 the Institution of Civil Engineers was founded in London and received a Royal Charter in 1828, formally recognizing civil engineering as a profession.File Size: 2MBPage Count: 17Explore furtherIntroduction to Civil DF] Civil Engineering Books Huge Collection (Subject g Books Recommended to you b

2y ago

209 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

The Civil Code of the Republic of Azerbaijan - ASK

7.3. Civil law may not have retroactive effect where it causes harm to subjects of the civil law or worsens their position. Article 8. Territorial Application of Civil Law 8.1. Civil law is effective throughout the territory of the Republic of Azerbaijan without exception. 8.2. Rights specified by civil law are freely exercised and obligatorily .

1y ago

121 Views

American Legion Post 210 - s3-us-west-2.amazonaws

Bockus, John Civil War 0-48 Knapp, Leonard Civil War 0-62 Bryson, Frank T. Civil War 0-6 Lampson, G. W. Civil War 0-25 Burkley, John I. Civil War 0-65A Martin, Jacob A. Civil War 0-49 Carr, Asa M. Civil War 0-39 Martin, Pembrooke Civil War 0-9A Carr, Julius Civil War 0-39 Mather, Jonathan War of 1812 0-78

1y ago

140 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

Intermediate Law Law and You Worksheet 3: Australian law - Home Affairs

4. There are different kinds of law to deal with different kinds of problems. Four important kinds of law are civil law, criminal law, family law and administrative law. Civil law deals with disputes between individuals; for example, if someone sells you goods that are faulty, or that cause you injury or damage, you can take that person to court.

4m ago

110 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Civil Law's Influence on American Constitutionalism

6 Experts in Roman law and civil law may object to this very broad use of the phrase "civil law tradition." Strictly speaking, "civil law" (ius civile) refers to law governing the individual relations of members of a state or commonwealth (civitas). Dig.1.1.1; Dig. 1.1.9 (G. Inst. 1). But I hope that they will understand why I have

1y ago

122 Views

Direito Civil Brasileiro - Vol 1

DIREITO CIVIL 1. Conceito de direito civil 2. Histórico do direito civil 3. A codificação 4. O Código Civil brasileiro 4.1. O Código Civil de 1916 4.2. O Código Civil de 2002 4.2.1. Estrutura e conteúdo 4.2.2. Princípios básicos 4.2.3. Direito civil-constituci

2y ago

176 Views

Civil Code of Georgia Law of Georgia - International Labour Organization

Article 10 - Independence of civil rights from political rights; imperative norms of civil law 1. The exercise of civil rights shall not depend on political rights regulated by the Constitution or by other laws of public law. 2. Participants in a civil relationship may exercise any action not prohibited by law, including any action not .

1y ago

128 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

Common-Law Courts in a Civil-Law System: The Role of United Stat-es .

He learns the law, not by reading statutes that promulgate it or treatises that summarize it, but rather by studying the judicial opinions that invented it. This is the famous case-law method, 1 Oliver Wendell Holmes, Jr., The Common Law (1881). · : .·· ' COMMON-LAW COURTS IN A CIVIL-LAW SYSTEM pioneered by Harvard Law School in the last .

1y ago

197 Views

Generative Models: Gaussian Discriminative Analysis And

It looks like you're using an ad-blocker