Today’s Lecture

3y ago
12 Views
2 Downloads
868.71 KB
54 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Helen France
Transcription

Today’s lectureExact inference in graphical models.Variable EliminationElimination as Graph TransformationTreewidthSum-product belief propagationMax-product belief propagationJunction tree algorithm (during tutorial)Figures from Koller book as well as Nowozin et al.Zemel & Urtasun (UofT)CSC 412Feb 9, 20163 / 56

Inference: conditional probabilitiesToday we will look into exact inference in graphical models.We will focus on conditional probability queriesp(Y E e) P(Y, e)P(e)Let W X Y E be the random variables that are neither the query northe evidence. Each of the distributions can be computed by marginalizingover the other variables:XXp(Y, e) P(Y, e, w),P(e) P(y, e, w)wy,wNaively marginalizing over all unobserved variables requires an exponentialnumber of computations. Why?Is there a more efficient algorithm?Zemel & Urtasun (UofT)CSC 412Feb 9, 20164 / 56

Probabilistic inference in practiceProbabilistic inference in NP-hard (both Directed and Undirected models)Should we give up?NP-hardness simply says that there exist difficult inference problemsReal-world inference problems are not necessarily as hard as these worst-caseinstancesIt is easy to do inference in some graphs, e.g., inference in hidden Markovmodels (HMMs) and other tree-structured modelsZemel & Urtasun (UofT)CSC 412Feb 9, 20165 / 56

Variable Elimination (VE)Exact algorithm for probabilistic inference in any graphical modelRunning time will depend on the graph structureUses dynamic programming to avoid enumerating all assignmentsLet’s first look at computing marginal probabilities p(Xi ) in Directedmodels, and then generalize this to conditional queries on MRFsZemel & Urtasun (UofT)CSC 412Feb 9, 20166 / 56

Basic idea of variable eliminationLet’s start with a simple chain A B C D, and we want to computeP(D)This is nothing but computing a tableBy the chain rule we can write the joint distributionp(A, B, C , D) p(A)p(B A)p(C B)p(D C )To compute p(D), we can marginalize all other variablesXp(D) p(A a, B b, C c, D)a,b,cThis looks exponential, but · · ·Zemel & Urtasun (UofT)CSC 412Feb 9, 20167 / 56

Let’s be a bit more explicit.There is structure on the summation (repeated terms)Let’s modify the computation to first computeP(a1 )P(b 1 a1 ) P(a2 )P(b 1 a2 )P(a1 )P(b 2 a1 ) P(a2 )P(b 2 a2 )Zemel & Urtasun (UofT)CSC 412Feb 9, 20168 / 56

Let’s be a bit more explicit.Let’s modify the computation to first computeP(a1 )P(b 1 a1 ) P(a2 )P(b 1 a2 )andP(a1 )P(b 2 a1 ) P(a2 )P(b 2 a2 )Then we getWe define τ1 : Val(B) , τ1 (b i ) P(a1 )P(b i a1 ) P(a2 )P(b i a2 )Zemel & Urtasun (UofT)CSC 412Feb 9, 20169 / 56

Let’s be a bit more explicit.We now haveWe can once more reverse the order of the product and the sum and getWe have other repetitive patterns.Zemel & Urtasun (UofT)CSC 412Feb 9, 201610 / 56

Let’s be a bit more explicit.We define τ2 : Val(C ) , withτ2 (c 1 ) τ1 (b 1 )P(c 1 b 1 ) τ1 (b 2 )P(c 1 b 2 )τ2 (c 2 ) τ1 (b 1 )P(c 2 b 1 ) τ1 (b 2 )P(c 2 b 2 )Thus we can compute the marginal P(D) asZemel & Urtasun (UofT)CSC 412Feb 9, 201611 / 56

Even more explicit.The joint isP(D) Xa,b,cp(a, b, c, D) XP(a)P(b a)P(c b)P(D c)a,b,cXXXcbP(D c)P(c b)P(b a)P(a)aWe can push the summation insideXXXP(D) P(D c)P(c b)P(b a)P(a) {z}cabψ1 (a,b) Let’s call ψ1 (A, B) P(A)P(B A) and τ1 (B) {zτ1 (b)}Pψ1 (A, B).PWe can define ψ2 (B, C ) τ1 (B)P(C B) and τ2 (C ) B ψ1 (B, C ).AThis procedure is dynamic programming: computation is inside out insteadof outside in.Zemel & Urtasun (UofT)CSC 412Feb 9, 201612 / 56

Complexity of a chainGeneralizing to a chain X1 X2 · · · Xn , and each node has k states.We can compute at each step i 1 up to n 1XP(Xi 1 ) P(Xi 1 xi )P(xi )xiWe need to multiply P(xi ) with each CPD P(Xi 1 Xi ) for each value of xi .P(Xi ) has k values, and the CPD P(Xi 1 Xi ) has k 2 values.k 2 multiplications and k(k 1) additions.The cost of the total chain is O(nk 2 ).In comparison, generating the full joint and summing it up has complexityO(k n ).We have done inference over the joint without generating it explicitly.Zemel & Urtasun (UofT)CSC 412Feb 9, 201613 / 56

Summary so farWorst case analysis says that computing the joint is NP-hard.In practice due to the structure of the Bayesian network some subexpressionsin the joint depend only on a subset of variables.We can catch up computations that are otherwise computed exponentiallymany times.This depends on our having a good variable elimination orderingZemel & Urtasun (UofT)CSC 412Feb 9, 201614 / 56

Sum-product InferenceWe want an algorithm to compute P(Y) for directed and undirected modelsThis can be reduced to the following sum-product inference taskXYτ (Y) φ(zScope[φ] Z , yScope[φ] Y ) Yzφ Φwhere Φ is a set of potentials or factorsFor directed models, Φ is given by the conditional probability distributionsfor all variables,Φ {φXi }ni 1 {p(Xi XPa(Xi ) )}ni 1where the sum is over the set Z X YFor MRFs, the factors Φ correspond to the set of potentialsSum productPreturns unnormalized distributions, so we normalize bydividing by y τ (y)Zemel & Urtasun (UofT)CSC 412Feb 9, 201615 / 56

Variable eliminationLet φ(X, Y ) be a factor where X is a set of variables and Y /XFactor marginalization of φ over Y (also called ”summing out Y in φ”) tobe a new factorXτ (X) φ(X, Y )YWe only sum up the entries that X matches upZemel & Urtasun (UofT)CSC 412Feb 9, 201616 / 56

Sum-product Variable EliminationOrder the variables Z (called the elimination ordering)Iteratively marginalize out variable Zi one at a timeFor each i123Multiply all factors that have Zi in their scope, generating a newproduct factorMarginalize this product factor over Zi , generating a smaller factorRemove the old factors from the set of all factors, and add the new oneZemel & Urtasun (UofT)CSC 412Feb 9, 201617 / 56

Zemel & Urtasun (UofT)CSC 412Feb 9, 201618 / 56

Some propertiesIf we sum out all the variables in a normalized distribution, what do we get?If we sum out all the variables in an unnormalized distribution, what do weget?Important property is that sum and product are commutative, and theproduct is associative, i.e., (φ1 φ2 )φ3 φ1 (φ2 φ3 ).Therefore, if X / Scope(φ1 ) thenXX(φ1 φ2 ) φ1φ2XZemel & Urtasun (UofT)XCSC 412Feb 9, 201619 / 56

Chain example againLet’s look at the chain againP(A, B, C , D) φA φB φC φDThe marginal distribution over DXP(D) φA φB φC φDA,B,C!! XCφDXBφCX(φB φA )Awhere we have used the limited scope of the factors.Marginalizing involves taking the product of all CPDs and sum over all butthe variables in the query.We can do this in any order we want; some more efficient than others.Effective as the scope is limited, we push in some of the summations.Zemel & Urtasun (UofT)CSC 412Feb 9, 201620 / 56

Example of Directed GraphThe joint distributionp(C , D, I , G , S, L, H, J) p(C )p(D C )p(I )p(G D, I )p(L G )P(S I )P(J S, L)p(H J, G )with factorsΦ {φc (C ), φD (C , D), φI (I ), φG (G , D, I ), φL (L, G ),φS (S, I ), φJ (J, S, L), φH (H, J, G )}Let’s do variable elimination with ordering {C , D, I , H, G , S, L} on the board!Zemel & Urtasun (UofT)CSC 412Feb 9, 201621 / 56

Example of Directed GraphZemel & Urtasun (UofT)CSC 412Feb 9, 201622 / 56

Elimination OrderingWe can pick any order we want, but some orderings introduce factors withmuch larger scope.Alternative ordering.Zemel & Urtasun (UofT)CSC 412Feb 9, 201623 / 56

Semantics of FactorsIn the chain example the factors were marginal or conditional probabilities,but this is not true in general.The result of eliminating X is not a marginal or conditional probability ofthe networkXτ (A, B, C ) P(X )P(A X )P(C B, X )XZemel & Urtasun (UofT)CSC 412Feb 9, 201624 / 56

How to introduce evidence?But we wanted to answer conditional probability queriesp(Y E e) p(Y, e)p(e)Apply variable elimination algorithm to the task of computing P(Y, e).Replace each factor φ Φ that has E Scope[φ] 6 0 withφ0 (xScope[φ] E ) φ(xScope[φ] E , eE Scope[φ] )Then eliminate the variables X Y E. The return factor φ (Y) is p(Y, e)To obtain the conditional p(Y e) normalize the resulting product of factors– the normalization constant is p(e).Zemel & Urtasun (UofT)CSC 412Feb 9, 201625 / 56

Sum-product VE for conditional distributionsZemel & Urtasun (UofT)CSC 412Feb 9, 201626 / 56

ExampleWe want to compute P(J).To compute P(J i 1 , h0 )Zemel & Urtasun (UofT)CSC 412Feb 9, 201627 / 56

Complexity of Variable EliminationLet n be the number of random variables and m the number of initial factors.At each step we pick a variable Xi and multiply all factors involving thevariable, resulting in a single factor ψi . The variable gets sum out of ψi ,resulting in a new factor τi with scope(τ ) scope(φ) Xi .Let Ni be the number of entries in the factor ψi , and let Nmax maxi Ni .Therefore the total cost is O(mk Nmax ), with k Val(X ) .Problem: Nmax can be as large as nZemel & Urtasun (UofT)CSC 412Feb 9, 201628 / 56

Complexity in Graph termsLet’s try to analyze the complexity in terms of the graph structure.The algorithm does not care if the graph is directed or undirected, onlydepends on the scope of the factors.Let GΦ be the undirected graph with one node per variable, where we havean edge iff there exists a factor φ Φ such that Xi , Xj Scope(φ).Ignoring evidence, this is either the original MRF or the ”moralized” directedgraphZemel & Urtasun (UofT)CSC 412Feb 9, 201629 / 56

Elimination as Graph TransformationWhen a variable X is eliminated:We create a single factor ψ that contains X and all of the variables Y withwhich it appears in factors.We eliminate X from ψ, replacing it with a new factor τ that contains all ofthe variables Y, but not X . Let’s call this ΦX .How does this modify the graph from GΦ to GΦX ?Constructing ψ generates edges between all of the variables Y Y.Some of these edges were in GΦ , some are new.The new edges are called fill edges.The step of removing X from φ to construct τ removes X and all it’sincident edges from the graph.Zemel & Urtasun (UofT)CSC 412Feb 9, 201630 / 56

ExampleZemel & Urtasun (UofT)(Graph)(Elim. C )(Elim. D)(Elim. I )CSC 412Feb 9, 201631 / 56

Induced GraphWe can summarize the computation cost using a single graph that is theunion of all the graphs resulting from each step of the eliminationWe call this the induced graph IΦ, where is the elimination orderingNmax is the size of the largest clique in IΦ, The running time is O(mk Nmax ), exponential in the size of the largest cliqueof the induced graphZemel & Urtasun (UofT)CSC 412Feb 9, 201632 / 56

Example(Induced graph)Zemel & Urtasun (UofT)(Maximal Cliques)CSC 412Feb 9, 201633 / 56

Example(Maximal Cliques)(VE)The maximal cliques in IG, areZemel & Urtasun (UofT)C1 {C , D}C2 {D, I , G }C3 {G , L, S, J}C4 {G , J, H}CSC 412Feb 9, 201634 / 56

Induced WidthThe width of an induced graph is #nodes in the largest clique -1.We define the induced width wG, to be the width of the graph IG, induced by applying VE to G using ordering .We define the tree width or minimal induced width of a graph K to bewG min w (IK, ) The treewidth provides a bound on the best running time achievable by VE on a distribution that factorizes over G: O(mk wG 1 )Finding the best elimination ordering (also computing the tree width) isNP-hardUse heuristics to find a good elimination orderingZemel & Urtasun (UofT)CSC 412Feb 9, 201635 / 56

Choosing an Elimination OrderingSet of possible heuristics:Min-fill: the cost of a vertex is the number of edges that need to be addedto the graph due to its elimination.Weighted-Min-Fill: the cost of a vertex is the sum of weights of the edgesthat need to be added to the graph due to its elimination. Weight of anedge is the product of weights of its constituent vertices.Min-neighbors: The cost of a vertex is the number of neighbors it has inthe current graph.Min-weight: the cost of a vertex is the product of weights (domaincardinality) of its neighbors.Which one better?None of these criteria is better than others.Often try severalZemel & Urtasun (UofT)CSC 412Feb 9, 201636 / 56

Sum-product belief propagation (BP)What if you want to compute marginals for many variables, e.g.,p(Xi ), Xi X ?We can run VE for each variable, but, can we do something more efficient?Consider a tree, we havep(x1 , · · · , xn ) Y1 Yφi (xi )φi,j (xi , xj )Zi(i,j) TThe sum-product BP computes all marginals with just two passesIt is based on message-passing of ”messages” (tables of partialsummations) between neighboring vertices of the graphZemel & Urtasun (UofT)CSC 412Feb 9, 201637 / 56

Sum-product messageThe message sent from variable j to i N (j) isiii"jii"Mj imj i (xi ) Xφj (xj )φij (xi , xj )xjkii"Ymk j (xj )k N (j)\ilii"Each message mj i is a vector with one value for each state of xiIn order to compute mj i , we must already have mk j (xj ) for k N (j) \ iThus we need a specific ordering of the messagesZemel & Urtasun (UofT)CSC 412Feb 9, 201638 / 56

ExampleSuppose we want to compute p(x1 ), with x1 the rootX YYp(x1 ) φi (xi )φi,j (xi , xj )x2 ,··· ,xnm5 1 (x1 )1ii"(i,j) T Xφ5 (x5 )φ15 (x1 , x5 )x5m3 2 (x2 )ii"5ii"2i Xφ3 (x3 )φ23 (x2 , x3 )x3ii"34ii"m4 2 (x2 ) Xφ4 (x4 )φ24 (x2 , x4 )x4m2 1 (x1 ) Xφ2 (x2 )φ12 (x1 , x2 )m3 2 (x2 )m4 2 (x2 )x2p(x1 ) φ1 (x1 )m2 1 (x1 )m5 1 (x1 ),Z Xp(x1 )x1Elimination algorithm in trees is equivalent to message passingWhat about we want p(x5 )? rerun the algorithm?Zemel & Urtasun (UofT)CSC 412Feb 9, 201639 / 56

Belief Propagation (BP)Input: Tree T with potentials φi (xi ), φij (xi , xj ) (i, j) TChoose root r arbitrarilyPass messages from leafs to rPass messages from r to leafsComputeYp(xi ) φi (xi )mj i (xi ), ij N (i)Running time is 2 times the cost of VE O(nk 2 ) with n #nodes andk #states per nodeNumerically difficult: renormalized the messages to sum to 1 as constantsare taking care by normalization constant laterold(xi )mj inewmj i(xi ) Pold (x̂ )mj i ix̂iEven better to work in log domainZemel & Urtasun (UofT)CSC 412Feb 9, 201640 / 56

Generalization to Tree-Structured Factor GraphsIteratively updates and passes messages:rF yi Yi : factor to variable messageqyi F Yi : variable to factor messageZemel & Urtasun (UofT)CSC 412Feb 9, 201641 / 56

Variable to factorLet M(i) be the factors adjacent to variable i, M(i) {F F : (i, F ) E}Variable-to-factor messageqyi F (yi ) XrF 0 yi (yi )F 0 M(i)\{F }Zemel & Urtasun (UofT)CSC 412Feb 9, 201642 / 56

Factor to variableFactor-to-variable message rF yi (yi ) logXexp θ(yF0 ) yF0 YF ,yi0 yiZemel & Urtasun (UofT) Xqyj0 F (yj0 ) j N(F )\{i}CSC 412Feb 9, 201643 / 56

Message Scheduling1Select one variable as tree root2Compute leaf-to-root messages3Compute root-to-leaf messagesZemel & Urtasun (UofT)CSC 412Feb 9, 201644 / 56

Computing marginalsPartition function can be evaluated at the root XXlog Z logexp rF yr (yr ) yrF M(r )Marginal distributions, for each factor X1qyi F (yi ) µF (yF ) p(yF ) exp θF (yF ) Zi N(F )Zemel & Urtasun (UofT)CSC 412Feb 9, 201645 / 56

MAP InferenceRecall the MAP inference taskarg max p(x),p(x) x1 Yφc (xc )Zc Cwe assume any evidence has been subsumed into the potentialsAs the partition function is a constant we can alternativelyYarg maxφc (xc )xc CThis is the max product inference taskSince the log is monotonic, let θc (xc ) log φc (xc )Xarg maxθc (xc )xc CThis is called the max-sumZemel & Urtasun (UofT)CSC 412Feb 9, 201646 / 56

Semi-ringsCompare the sum-product problem with the max-product (equivalently,max-sum in log space):XYsum-productφc (xc )xmax-productc CmaxxXθc (xc )c CCan exchange operators ( ,*) for (max, ) and, because both are semiringssatisfying associativity and commutativity, everything works!We get ”max-product variable elimination” and ”max-product beliefpropagation”Zemel & Urtasun (UofT)CSC 412Feb 9, 201647 / 56

Simple ExampleSuppose we have a chain, A B C D and we want MAPmax φAB (a, b)φBC (b, c)φCD (c, d)a,b,c,dWe can push the maximizations insidemax φAB (a, b) max φBC (b, c) max φCD (c, d)a,bcdor equivalentlymax θAB (a, b) max θBC (b, c) max φCD (c, d)a,bcdTo find the actual maximizing assignment, we do a tracebackZemel & Urtasun (UofT)CSC 412Feb 9, 201648 / 56

Max Product VEZemel & Urtasun (UofT)CSC 412Feb 9, 201649 / 56

Max-product belief propagation (for tree-structured MRFs)Same as sum-product BP except that the messages are nowYmj i (xi ) max φj (xj )φij (xi , xj )mk j (xj )xjk N (j)\iAfter passing all messages, can compute single node maxYmi (xi ) φi (xi )mj i (xi ), ij N (i)If the MAP assignment x is unique, we can find it by locally decoding eachof the single node max-marginalsxi arg max mi (xi )xiZemel & Urtasun (UofT)CSC 412Feb 9, 201650 / 56

Max-sum belief propagation (for tree-structured MRFs)Same as sum-product BP except that the messages are nowXmj i (xi ) max θj (xj ) θij (xi , xj )mk j (xj )xjk N (j)\iAfter passing all messages, can compute single node maxXmi (xi ) θi (xi ) mj i (xi ), ij N (i)If the MAP assignment x is unique, we can find it by locally decoding eachof the single node max-marginalsxi arg max mi (xi )xiWorking in log-space prevents numerical underflow/overflowZemel & Urtasun (UofT)CSC 412Feb 9, 201651 / 56

Factor Graph Max ProductIteratively updates and passes messages:rF yi Yi : factor to variable messageqyi F Yi : variable to factor messageZemel & Urtasun (UofT)CSC 412Feb 9, 201652 / 56

Variable to factorLet M(i) be the factors adjacent to variable i, M(i) {F F : (i, F ) E}Variable-to-factor messageqyi F (yi ) XrF 0 yi (yi )F 0 M(i)\{F }Zemel & Urtasun (UofT)CSC 412Feb 9, 201653 / 56

Factor to variableFactor-to-variable message rF yi (yi ) Zemel & Urtasun (UofT)maxyF0 YF ,yi0 yi θ(yF0 ) CSC 412 Xqyj F (yj0 ) j N(F )\{i}Feb 9, 201654 / 56

Message Scheduling1Select one variable as tree root2Compute leaf-to-root messages3Compute root-to-leaf messagesZemel & Urtasun (UofT)CSC 412Feb 9, 201655 / 56

Max Product v Sum Product (log domain)Max sum version of max-product1Compute leaf-to-root messagesXqyi F (yi ) rF 0 yi (yi )F 0 M(i)\{F }2Compute root-to-leaf messagesrF yi (yi ) maxyF0 YF ,yi0 yi X θ(yF0 ) qyj F (yj0 ) j N(F )\{i}Sum-product1Compute leaf-to-root messagesqyi F (yi ) XrF 0 yi (yi )F 0 M(i)\{F }2 Compute root-to-leaf messagesXrF yi (yi ) logexp θ(yF0 ) yF0 YF ,yi0 yiZemel & Urtasun (UofT) Xqyj0 F (yj0 ) j N(F )\{i}CSC 412Feb 9, 201656 / 56

Today’s lecture Exact inference in graphical models. Variable Elimination Elimination as Graph Transformation Treewidth Sum-product belief propagation Max-product belief propagation Junction tree algorithm (during tutorial) Figures from Koller book as well as Nowozin et al. Zemel & Urtasun (UofT) CSC 412 Feb 9, 2016 3 / 56

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

EE141 1 EECS141EE141 Lecture #14 1 EECS141EE141 Lecture #14 2 Hw 5 due Today. Hw 6 posted early next week.You get TWO weeks for this one. Project phase 1 will be launched today Out of town next week We lecture offered by Stan

Independent Personal Pronouns Personal Pronouns in Hebrew Person, Gender, Number Singular Person, Gender, Number Plural 3ms (he, it) א ִוה 3mp (they) Sֵה ,הַָּ֫ ֵה 3fs (she, it) א O ה 3fp (they) Uֵה , הַָּ֫ ֵה 2ms (you) הָּ תַא2mp (you all) Sֶּ תַא 2fs (you) ְ תַא 2fp (you

Introduction to Quantum Field Theory for Mathematicians Lecture notes for Math 273, Stanford, Fall 2018 Sourav Chatterjee (Based on a forthcoming textbook by Michel Talagrand) Contents Lecture 1. Introduction 1 Lecture 2. The postulates of quantum mechanics 5 Lecture 3. Position and momentum operators 9 Lecture 4. Time evolution 13 Lecture 5. Many particle states 19 Lecture 6. Bosonic Fock .