Probabilistic Graphical Models - University Of Toronto

2y ago

17 Views

2 Downloads

424.67 KB

30 Pages

Last View : 16d ago

Last Download : 3m ago

Upload by : Lucca Devoe

Report this link

Download PDF

Transcription

Probabilistic Graphical ModelsRaquel Urtasun and Tamir HazanTTI ChicagoMay 23, 2011Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20111 / 30

SummaryPreviously in classRepresentation of directed and undirected networksInference in these networksExact inference in trees via message passingInference via samplingInference via optimizationTwo tasks of inference:marginalsMAP assignmentThe rest of this course:Parameter learningStructure learning (if time)Today we will refresh your memory about what learning is.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20112 / 30

How to acquire a model?Possible things to do:Use expert knowledge to determine the graph and the potentials.Use learning to determine the potentials, i.e., parameter learning.Use learning to determine the graph, i.e., structure learning.Manual design is difficult to do and can take a long time for an expert.We usually have access to a set of examples from the distribution we wish tomodel, e.g., a set of images segmented by a labeler.We call this task of constructing a model from a set of instances modellearning.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20113 / 30

More rigorous definitionLets assume that the domain is governed by some underlying distributionP , which is induced by some network model M (K , θ ).We are given a dataset D of M samples from P .The standard assumption is that the data instances are independent andidentically distributed (IID).We are also given a family of models M, and our task is to learn somemodel M̂ in this family that defines a distribution PM̂ .We can learn model parameters for fix structure, or structure and modelparameters.We might be interested in returning a single model, a set of hypothesis thatare likely, a probability distribution over models, or even a confidence of themodel we return.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20114 / 30

Goal of learningThe goal of learning is to return a model M̂ that precisely captures thedistribution P from which our data was sampled.This is in general not achievable becausecomputational reasons.limited data only provides a rough approximation of the true underlyingdistribution.We need to select M̂ to construct the ”best” approximation to M .What is ”best”?Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20115 / 30

What is ”best”?This depends on what we want to do1Density estimation: we are interested in marginals.2Specific prediction tasks: we are interested in conditional probabilities.3Structure or knowledge discovery: we are interested in the model itself.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20116 / 30

1) Learning as density estimationWe want to answer probabilistic inference queries.In this setting we can reformulate the learning problem as densityestimation.We want to construct M̂ as ”close” as possible to P .How do we evaluate ”closeness”?Relative entropy is one possibility" D(P P̂) Eξ P logP (ξ)!#P̂(ξ)D(P P̂) 0 iff the two distributions are the same.It measures the ”compression loss” (in bits) of using P̂ instead of P .Problem: In general we do not know P .Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20117 / 30

Expected log-likelihoodWe can simplify this metric for any two distributions over X"!#hiP (ξ) D(P P̂) Eξ P log HP (X ) Eξ P log P̂(ξ)P̂(ξ)The first term does not depend on P̂.We can then maximize the expected log-likelihoodhiEξ P log P̂(ξ)It assigns high probability to instances sampled from P , so to reflect thetrue distribution.We can now compare models, but since we are not computing HP (X ), wedon’t know how close we are to the optimum.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20118 / 30

Likelihood, Loss and RiskWe are interested in the (log) likelihood of the data given a model, calledthe log-loss log P(D, M).This is an example of loss function.A loss function loss(ξ, M) measures the loss that a model M makes on aparticular instance ξ.When instances are sampled from some distribution P , our goal is to findthe model that minimizes the expected loss or riskEξ P [loss(ξ, M)]P is unknown, but we can approximate the expectation using the empiricalaverage, i.e., empirical risk1 XED [loss(ξ, M)] loss(ξ, M) D ξ DIt is intuitive in the case of log loss, whereP(D, M) MYP(ξm , M)m 1Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 20119 / 30

2) Specific Prediction TaskWe want to predict a set of variables Y given some others X, e.g.,segmentation.We concentrate on predicting P(Y X).A model trained should be able to produce P̂(Y x) and the MAP assignmentargmax P̂(y x)yAn example of loss metric is the classification errorhiE(x,y) P 1I{P̂(y x)}which is the probability over all (x, y) pairs sampled from P that ourclassifier selects the right label.This metric is not well suited for situations with multiple labels. Why?Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201110 / 30

Better MetricsHamming loss counts the number of variables in which the MAP differsfrom the ground truth label.Conditional log-likelihood takes into account the confidence in thepredictionhiE(x,y) P log P̂(y x)Unlike the density estimation, we do not have to predict the distributionover X.We negate this expression to get a loss, and compute an empirical estimateby taking the average with respect to D.Good choice if we know that we are only going to care about this task.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201111 / 30

3) Knowledge DiscoveryWe hope that looking at the learned model we can discover somethingabout P What are the direct and undirect dependencies.Nature of the dependencies, e.g., positive or negative correlation.We may want to learn the structure of the model.Simple statistical models (e.g., looking at correlations) can be used.But the learned network can have a direct causal interpretation and revealfiner structure, e.g., distinguish between direct and undirect dependencies.In this setting we care about discovering the correct model M , rather thana different model M̂ that induces a distribution similar to M .Metric is in terms of the differences between M and M̂.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201112 / 30

This is not always achievableThe true model might not be identifiablee.g., Bayesian network with several I-equivalent structures.In this case the best we can hope is to discover an I-equivalentstructure.Problem is worst when the amount of data is limited and therelationships are weak.When the number of variables is large relative to the amount of trainingdata: pairs that appear strongly correlated just by chance.In knowledge discovery it is very important to asses the confidence in aprediction.Taking into account the number of data available and the number ofhypothesis.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201113 / 30

Learning as optimizationWe define a numerical criteria that we would like to optimize.Learning is generally treated as an optimization problem where we haveHypothesis space: a set of candidate models.Objective function: a criterion for our preference for the models.We can formulate learning as finding a high-scoring model within our modelclass.Different approaches choose different hypothesis spaces and differentobjective functions.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201114 / 30

Empirical RiskChoose a model M that optimizes the expectation of a particular lossEξ P [loss(ξ, M)]We don’t know P so we use an empirical estimate by defining the empiricaldistribution1 X1I{ξm A}P̂D (A) M mThe prob. of the event A is the fraction of training examples that satisfy A.P̂D is a prob. distribution.Let ξ1 , ξ2 , · · · be a sequence of IID samples from P (X ), and letDM hξ1 , · · · , ξM i, thenlim P̂DM (A) P (A)M For sufficiently large training set, P̂D is close to P with high probability.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201115 / 30

Empirical Risk and OverfittingWe can use P̂D as a proxy.Unfortunately a naive implementation will not work, e.g, consider the case ofN random binary variables, and M number of training examples, e.g.,N 100, M 1000Empirical risk minimization tends to overfit the data.Problem when using empirical risk as a surrogate for our true risk:Generalization to unseen examples.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201116 / 30

Bias-Variance trade offIf the hypothesis space is very limited, it might not be able to represent P ,even with unlimited data.This type of limitation is called bias, as the learning is limited on how closeit can approximate the target distribution.If we select a highly expressive hypothesis class, we might represent betterthe data.When we have small amount of data, multiple models can fit well, or evenbetter than the true model.Moreover, small perturbations on D will result in very different estimates.This limitation is call the variance.There is an inherent bias-variance trade off when selecting the hypothesisclass.Error due to both things: bias and variance.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201117 / 30

How to avoid overfitting?Hard constraints, by selecting a less expressive hypothesis classSoft preference for simpler models: Occam Razor.Augment the objective function with regularization.objective(ξ, M) loss(ξ, M) R(M)Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201118 / 30

Evaluating Generalization PerformanceRaquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201119 / 30

Goodness of fitCross-validation and hold out test do not allow us to evaluate whether ourlearned model captures everything that we need in the distribution.In statistics, goodness of fit.After learning the model parameters, we can evaluate if the data behaves asif it was sampled from the this distribution.Compare properties of the training data f (Dtrain ) and of datasets generatedfrom the model of the same size f (D).Many choices of f , e.g., empirical log-loss ED [loss(ξ, M)] is the entropy forthe model.Look at the tales to compute the significance.This can be approximate with the variance of the log-loss as a function of M.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201120 / 30

PAC bounds IWe hope that a model that achieves low training loss also achieves lowexpected loss (risk).We cannot guarantee with certainty the quality of our learned model.This is because the data is sample stochastically from P , and it might beunlucky sample.The goal is to prove that the model is approximately correct: for most D,the learning procedure returns a model whose error is low.Assume that we have the relative entropy to the true distribution as our lossfunction. Let PMbe the distribution over datasets D of size M sampled IID from P .Assume that we have a learner L that given D returns ML(D) .Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201121 / 30

PAC bounds IIWe want to prove results of the form: with M large enough PM({D : D(P PML(D) ) }) 1 δwith 0 the approximation parameter and δ our confidence parameter.For sufficiently large M, for most datasets D of size M sampled from P ,the learning procedure applied to D will learn a close approximation to P .This bound can only be obtained if the hypothesis class can be correctlyrepresent P .Such a setting is called consistent.In many cases this is not included in the hypothesis class.In this case, the best we can hope to get error at most worse than thelowest error found within our hypothesis space.The expected loss beyond the minimal possible error is called the excessrisk.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201122 / 30

Generative vs Discriminative Training IWe often know in advance that we want to perform a particular task, e.g.,predicting Y from X.The training procedure we have described is to compute the jointdistribution P (Y, X).This is called generative training as we train the model to generate all thevariables.We can also do discriminative training, where the goal is to get P̂(Y X) asclose as possible to P (Y X).The model that is trained generatively can be used for the prediction task.However, the discriminatively trained model does not model P(X), andcannot say anything about these variables.Discriminative training in BN changes the meaning of the parameters andthey no longer correspond to conditional distributions of P .Discriminative training is done in the context of undirected models, i.e.,conditional random fields (CRFs).Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201123 / 30

Generative vs Discriminative Training IIGenerative models have a higher bias, as they make assumptions about theform of P̂(X).Discriminative models make assumptions only about P̂(Y X).The bias reduces the ability of the model to overfit the data, and thusgenerative models work usually better with small training sets.Discriminative models make less assumptions and thus they are less impactedby their incorrect assumptions, and work better with larger training sets.Discriminative models can make use of a much reacher feature set. This canresult in much higher classification performance, e.g., segmentation.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201124 / 30

Learning tasksThe input to the learning isSome prior knowledge, or constraints about M̂.A set D of IID samples.The output of the learning is a model M̂, which may include the structure, theparameters or both.The learning problem varies along 3 axisThe output: type of graphical model we are trying to learn, i.e, BN orMarkov network.The extent of the constraints we are given on X̂ .The extent to which the data in our training set is fully observed.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201125 / 30

Model ConstraintsExtent that our input constrains the hypothesis space, i.e., the class of modelsthat we are allow to learn.We are given a graph structure, and we have to learn only (some of) theparameters.Learn both the parameters and the structure.We might not even know the complete set of variables over which P isdefined, i.e., we might only observe some subset of variables.The less prior knowledge we are given, the larger the hypothesis space. Thiscomplexity depends onstatistical: If we restrict too much it might be unable to represent P . Ifthe model is too flexible, we might have models with high score and bad fitto P .computational: the richer the hypothesis class, the more difficult to search.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201126 / 30

Data observabilityFully observed: each training instance sees all the variables.Partially observed: In each training instance, some variables are notobserved, e.g., patients and medical tests.Hidden variables: The value of certain variables is never observed in any ofthe training instances. This arrives if we don’t know all the variables, butalso if we simple don’t have observations for some.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201127 / 30

Missing dataIf the data is missing, we have to hypothesize their value.The larger the number of these variables, the less reliable we canhypothesize.For the task of knowledge discovery the hidden variables might be veryimportant.Raquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201128 / 30

Taxonomy of Learning Tasks in BNRaquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201129 / 30

Taxonomy of Learning Tasks in Markov NetworksRaquel Urtasun and Tamir Hazan (TTI-C)Graphical ModelsMay 23, 201130 / 30

Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago May 23, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models May 23, 2011 1 / 30. . Graphical Models May 23, 2011 16 / 30. Bias-Variance trade o If the hypothesis space is very limited, it mig

Related Documents:

Exploring Word Order Universals: a Probabilistic Graphical ...

Probabilistic graphical models combine a graphical representation with a complex distri-bution over a high-dimensional space (Koller & Friedman, 2009).There are two ad-vantages of using this model to study word order universals. First the graphical structure can rev

17 Views

2y ago

Graphical Models - login.cs.utexas.edu

Graphical Models with R. Soren Hojsgaard, David Edwards, Steﬀen Lauritzen Probabilistic Graphical Models: Principles and Techniques. D. Koller and N. Friedman. Applications of Graphical Models . But do often have a “pre-

10 Views

2y ago

ECE 175B Probabilistic Reasoning & Graphical Models

Machine Learning: A Probabilistic Perspective Kevin Murphy, MIT Press, 2012 Probabilistic Graphical Models Daphne Koller & Nir Friedman, MIT Press, 2009 Supplemental Texts Pattern Recognition & Machine Learning, C.M. Bishop, Springer, 2007. Especially Chapter 8 Artificial Intellige

12 Views

2y ago

Tutorial on Probabilistic Topic Modeling: Additive Regularization for ...

non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models. Keywords: Probabilistic topic modeling · Regularization of ill-posed inverse problems · Stochastic matrix factorization · Probabilistic .

8 Views

1y ago

Graphical Models, Exponential Families, and Variational ...

The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Graphical models have become a focus of research in many statisti-cal, comput

13 Views

2y ago

2 Graphical Models in a Nutshell - Stanford AI Lab

14 Graphical Models in a Nutshell the mechanisms for gluing all these components back together in a probabilistically coherent manner. Eﬀective learning, both parameter estimation and model selec-tion, in probabilistic graphical models is enabled by the compact parameterization. This chapter provides a compactgraphicalmodels tutorialbased on [8].

39 Views

3y ago

Lecture 7: graphical models and belief propagation

Today we study graphical models and belief propagation. Probabilistic graphical models describe joint probability distributions in a way that allows us to reason about them and calculate with them even when we’re modeling very complicated situations. These things are pervasive in vision, because we

35 Views

2y ago

AN5156 Introduction Application note - STMicroelectronics

A programming manual is also available for each Arm Cortex version and can be used for MPU (memory protection unit) description: STM32 Cortex -M33 MCUs programming manual (PM0264) STM32F7 Series and STM32H7 Series Cortex -M7 processor programming manual (PM0253) STM32 Cortex -M4 MCUs and MPUs programming manual (PM0214)

137 Views

3y ago

Recent Views

Dear Members of the Harvard Community,

Life science graduate education at Harvard is comprised of 14 Ph.D. programs of study across four Harvard faculties—Harvard Faculty of Arts and Sciences, Harvard T. H. Chan School of Public Health, Harvard Medical School, and Harvard School of Dental Medicine. These 14 programs make up the Harvard Integrated Life Sciences (HILS).

3y ago

182 Views

Xavier Du Maine, Lara Roach, Perspectives - Harvard University

Sciences at Harvard University Richard A. and Susan F. Smith Campus Center 1350 Massachusetts Avenue, Suite 350 Cambridge, MA 02138 617-495-5315 gsas.harvard.edu Office of Diversity and Minority Affairs minrec@fas.harvard.edu gsas.harvard.edu/diversity Office of Admissions and Financial Aid admiss@fas.harvard.edu gsas.harvard.edu/apply

1y ago

146 Views

PROGRAM ON CRISIS LEADERSHIP - Harvard Kennedy School

Harvard Kennedy School Arnold M. Howitt Harvard Kennedy School Philip B. Heymann Harvard Law School April 2014 An earlier version of this white paper provided background for an expert dialogue on lessons learned from the events of the Boston Marathon bombing that was held at the John F. Kennedy School of Government at Harvard

2y ago

330 Views

Harvard Law School - WordPress

Law & Business, Harvard Law School, and H. Douglas Weaver Professor of Business Law. Harvard Business School. 10.30-10.55h. 13th Lecture "Cross-border Insolvency: the New European Regime". Pedro de Miguel Asensio. Full Professor of Private International Law. UCM. 11.00-12.00h. Round Table. "Latest reforms and tendencies on Insolvency Law".

1y ago

145 Views

Harvard Buildings Emergency Phones Harvard University .

Faculty of Arts and Sciences, Harvard University Class of 2018 LEGEND Harvard Buildings Emergency Phones Harvard University Police Department Designated Pathways Harvard Shuttle Bus Stops l e s R i v e r a C h r YOKE ST YMOR E DRIVE BEACON STREET OXFORD ST VENUE CAMBRIDGE STREET KIRKLAND STREET AUBURN STREET VE MEMORIAL

3y ago

171 Views

THE FIRST CENTURY OF THE AMERICAN . - Princeton

Harvard University Press, 1935) and Harvard College in the Seventeenth Century (Cambridge: Harvard University Press, 1936). Quotes, Founding of Harvard, 168, 449. These works are summarized in Three Centuries of Harvard (Cambridge: Harvard U

2y ago

225 Views

Catherine G. Barrera HARVARD UNIVERSITY

danbjork@fas.harvard.edu HARVARD UNIVERSITY Placement Director: Gita Gopinath GOPINATH@HARVARD.EDU 617-495-8161 Placement Director: Nathan Nunn NNUNN@FAS.HARVARD.EDU 617-496-4958 Graduate Administrator: Brenda Piquet BPIQUET@FAS.HARVARD.EDU 617-495-8927 Office Contact Information Department of Economics

2y ago

363 Views

SEAS Lab Safety Officer Orientation

Kuan ebrandin@harvard.edu akuan@fas.harvard.edu Donhee Ham MD B129, MDB132 Dongwan Ha dha@seas.harvard.edu Lene Hau Cruft 112-116 Danny Kim dannykim@seas.harvard.edu Robert Howe 60 Oxford, 312-317,319-321 Paul Loschak loschak@seas.harvard.edu Evelyn Hu McKay 222,226,232 Kathryn Greenberg greenber@fas.harvard.edu

2y ago

359 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Dangerous Defendants - Yale Law Journal

Law School, Louisiana State University Paul M. Hebert Law Center, Roger Williams University School of Law, Rutgers Law School, Sandra Day O'Connor College of Law, Southern Methodist University Dedman School of Law, University of Georgia School of Law, and University of Utah S.J. Quinney College of Law. For institutional support, I am grateful .

1y ago

169 Views

EMPLOYER GUIDE - Harvard Kennedy School

HARVARD KENNEDY SCHOOL EMPLOYER GUIDE At Harvard Kennedy School, our students are being trained in public policy . Post in the HKS JACK job bank or send to HKS_Career@hks.harvard.edu 2. Browse our resume book to identify a student with the skills and experience you need 3. Visit us and meet our talent HARVARD

2y ago

138 Views

2008-2009 FACT BOOK - Harvard University

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

364 Views

HARVARD UNIVERSITY 2007-08

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies* American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

314 Views

ANNA BRADY - Harvard University

Jun 02, 2008 · ANNA BRADY 12 Oxford Street Apt. 9 Cambridge, MA 02138 (617) 495-3108 abrady@jd11.law.harvard.edu EDUCATION HARVARD LAW SCHOOL, Candidate for J.D., June 2011 Activities: Harvard Civil Rights-Civil Liberties Law Review UNIVERSITY OF CHICAGO, B.A. i

2y ago

137 Views

Probabilistic Graphical Models - University Of Toronto

It looks like you're using an ad-blocker