2y ago

42 Views

2 Downloads

1.31 MB

19 Pages

Transcription

IntroductionBayesian Networks (aka Bayes Nets, Belief Nets,Directed Graphical Models) Chapter 14.1, 14.2, and 14.4plus optional paper “Bayesiannetworks without tears” [based on slides by Jerry Zhu and Andrew Moore]12Full Joint Probability DistributionUsing the Full Joint Distribution Making a joint distribution of N variables:1. List all combinations of values (if each variablehas k values, there are kN combinations)2. Assign each combination a probability 3. They should sum to 13Probabilistic models allow us to useprobabilistic inference (e.g., Bayes’s rule) tocompute the probability distribution over a setof unobserved (“hypothesis”) variables givena set of observed variablesFull joint probability distribution table is greatfor inference in an uncertain world, but isterrible to obtain and storeBayesian Networks allow us to represent jointdistributions in manageable chunks using§ Independence, conditional independenceBayesian Networks can do any inferenceOnce you have the joint distribution, you can doanything, e.g. marginalization:P(E) rows matching E P(row)e.g., P(Sunny or Hot) (150 50 40 5)/365Convince yourself this is the same as P(sunny) P(hot) - P(sunny and inyCold60/365RainyCold60/36541

Using the Joint Distribution The Bad NewsYou can also do inference: rows matching Q AND E P(row) P(Q E) rows matching E P(row)P(Hot 365RainyCold60/3655 For N variables, each taking k values, the jointdistribution has kN numbers (and kN – 1 degrees offreedom)It would be nice to use fewer numbers Bayesian Networks to the rescue!§ Provides a decomposed / factorizedrepresentation of the FJPD§ Encodes a collection of conditionalindependence relations6Bayesian Networks 14Full Joint distribution requires a lot of storage spaceExampleRepresent (direct) dependencies graphically§ A: your alarm soundsDirected, acylic graphs (DAGs)Nodes random variables§ J: your neighbor John calls you§ M: your other neighbor Mary calls you§ “CPT” stored at each node quantifies conditionalprobability of node’s r.v. given all its parentsDirected arc from A to B means A “has a directinfluence on” or “causes” B§ Evidence for A increases likelihood of B(deductive influence from causes to effects) § John and Mary do not communicate (they promisedto call you whenever they hear the alarm)What kind of independence do we have?What does the Bayes Net look like?§ Evidence for B increases likelihood of A(abductive influence from effects to causes)Encodes conditional independence assumptions152

Conditional IndependenceConditional Independence Random variables can be dependent, butconditionally independentExample: Your house has an alarm§ Neighbor John will call when he hears the alarm§ Neighbor Mary will call when she hears the alarm § Assume John and Mary don’t talk to each other Is JohnCall independent of MaryCall?§ No – If John called, it is likely the alarm went off,which increases the probability of Mary calling§ P(MaryCall JohnCall) P(MaryCall)16But, if we know the status of the alarm, JohnCallwill not affect whether or not Mary callsP(MaryCall Alarm, JohnCall) P(MaryCall Alarm)We say JohnCall and MaryCall areconditionally independent given AlarmIn general, “A and B are conditionallyindependent given C” means:P(A B, C) P(A C)P(B A, C) P(B C)P(A, B C) P(A C) P(B C)17ExampleExamples§ A: your alarm sounds§ J: your neighbor John calls you§ M: your other neighbor Mary calls you § John and Mary do not communicate (they promisedto call you whenever they hear the alarm)What kind of independence do we have? Conditional independence: P(J,M A) P(J A)P(M A)What does the Bayes Net look like? Our BN: P(A,J,M) P(A) P(J A) P(M A)§ A: yoursoundsChainrule:alarmP(A,J,M) P(A) P(J A) P(M A,J)§ J: your neighbor John calls youOurassumesconditionalindependence,§ M:BNyourother neighborMarycalls youso P(M A,J) P(M A)§ John and Mary do not communicate (they promisedto call you whenever they hear the alarm)What kind of independence do we have? Conditional independence P(J,M A) P(J A)P(M A)What does the Bayes Net look like?AJ18AMJM193

A Simple Bayesian NetworkS Î {no , light , heav y}Smoking0.80P(S no)P(S light) 0.15P(S heavy) 0.05CancerC Î {none , benign , m alignant }noDon’t needto storeA Bayesian Networklight0.96 0.88P(C benign S ) 0.03 0.08P(C malig S ) 0.01 0.04P(C none S )21GenderExposureto ngTumor27Applications 28AgePathfinder Pathfinder was one of the first BN systems It performed diagnosis of lymph-nodeMedical diagnosis systemsManufacturing system diagnosisComputer systems diagnosisNetwork systems diagnosisHelpdesk troubleshootingInformation retrievalCustomer modelingdiseases It dealt with over 60 diseases and 100symptoms and test results 14,000 probabilities Commercialized and applied to about 20tissue types324

PathfinderBayesNet448 nodes,906 arcs3335Conditional Independence in Bayes NetsConditional Independence§ A node is conditionally independent of itsnon-descendants, given its parents§ A node is conditionally independent of all othernodes, given its “Markov blanket” (i.e., its parents,children, and children’s parents)AgeGenderSmokingCancer is conditionallyindependent of Ageand Gender givenSmokingCancer36375

Interpreting Bayesian NetsConditional Independence CancerSerumCalcium2 nodes are (unconditionally) independent ifthere’s no undirected path between themIf there is an undirected path between 2 nodes, thenwhether or not they are independent or dependentdepends on what other evidence is knownSerum Calcium isconditionally independent ofLung Tumor, given CancerLungTumorAA and B are independentgiven nothing else, but aredependent given CCP(L SC, C) P(L C)38B39Example with 5 VariablesCreating a Bayes Net § B: there’s burglary in your house§ E: there’s an earthquake§ A: your alarm soundsStep 1: Add variables. Choose the variables youwant to include in the Bayes NetB§ J: your neighbor John calls you§ M: your other neighbor Mary calls you EAB, E are independentJJ is directly influenced by only A (i.e., J isconditionally independent of B, E, M, given A)MB: there’s burglary in your houseM is directly influenced by only A (i.e., M isconditionally independent of B, E, J, given A)E: there’s an earthquakeA: your alarm soundsJ: your neighbor John calls youM: your other neighbor Mary calls you40416

Creating a Bayes Net Creating a Bayes Net Step 2: Add directed edges The graph must be acyclic If node X is given parents Q1, , Qm, you areThe table at node X must list P(X Parent values) forall combinations of parent valuese.g., you must specifyP(J A) AND P(J A)P(B) 0.001BEsince they don’t haveP(E) 0.002to sum to 1!saying that any variable that’s not a descendant ofX is conditionally independent of X given Q 1, , Q mBStep 3: Add CPTs (Conditional Probability Tables)P(A P(A P(A P(A EB, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJB: there’s burglary in your houseAME: there’s an earthquakeJME: there’s an earthquakeP(J A) 0.9P(J A) 0.05A: your alarm soundsJ: your neighbor John calls youM: your other neighbor Mary calls you42B: there’s burglary in your houseP(M A) 0.7P(M A) 0.01A: your alarm soundsJ: your neighbor John calls youM: your other neighbor Mary calls you43The Bayesian Network Created from aDifferent Variable OrderingCreating a Bayes Net1. Choose a set of relevant variables2. Choose an ordering of them, call them X1, , XN3. for i 1 to N:1. Add node Xi to the graph2. Set parents(Xi) to be the minimal subset of{X1 Xi-1}, such that xi is conditionallyindependent of all other members of {X1 Xi-1}given parents(Xi)3. Define the CPTs forP(Xi assignments of parents(Xi)) Different ordering leads to different graph, in general Best ordering when each variable is considered after allvariables that directly influence it44457

The Bayesian Network Created from aDifferent Variable OrderingCompactness of Bayes Nets A Bayesian Network is a graph structure forrepresenting conditional independence relations ina compact wayA Bayes net encodes the full joint distribution (FJPD),often with far less parameters (i.e., numbers)A full joint table needs kN parameters (N variables, kvalues per variable)§ grows exponentially with NIf the Bayes net is sparse, e.g., each node has at mostM parents (M N), only needs O(NkM) parameters§ grows linearly with N§ can’t have too many parents, though4647Variable Dependencies Directed arc from one variable to another variableA Causal ChainBAA Is A guaranteed to be independent of B?§ No – Information can be transmitted over 1 arc (ineither direction) Example: My knowing the Alarm went off,increases my belief there has been a Burglary,and, similarly, my knowing there has been aBurglary increases my belief the Alarm went off48This local configuration is called a “causal chain:”BBCIs A guaranteed to be independent of C?§ No – Information can be transmitted between A andC through B if B is not observed Example: B à A à MNot knowing Alarm means that my knowing thata Burglary has occurred increases my belief thatMary calls, and similarly, knowing that MaryCalls increases my belief that there has been aBurglary498

Causal Chain This local configuration is called a “causal chain:”AA Common CauseBBThis configuration is called “common cause:”ACIs A independent of C given B? § Yes – Once B is observed, information cannot betransmitted between A and C through B; B“blocks” the information path; “C is conditionallyindependent of A given B” Example: B à A à M CIs it guaranteed that B and C are independent?§ No – Information can be transmitted through A to thechildren of A if A is not observedIs it guaranteed that B and C are independent given A?§ Yes – Observing the cause, A, blocks the influencebetween effects B and C; “B is conditionallyindependent of C given A”Knowing that the Alarm went off means thatalso knowing that a Burglary has taken placewill not increase my belief that Mary Calls50B51Common Effect This configuration is called “common effect:”A Common EffectBACAre A and B independent? Are A and B independent given C?§ No – Information can be transmitted through Camong the parents of C if C is observed Example: B à A ß EIf I already know that the Alarm went off, my further knowingthat there has been an Earthquake, decreases my belief thatthere has been a Burglary. Called “explaining away.”§ Proof: P(a,b) Σc P(a,b,c) by marginalization Σc P(a) P(b a) P(c a,b) by chain rule Σc P(a) P(b) P(c a,b) by cond. indep. P(a) P(b) Σc P(c a,b)52BC§ Yes Example: B à A ß EBurglary and Earthquake cause the Alarm togo off, but they are not correlated P(a) P(b)This configuration is called “common effect:”§ Similarly, if C has descendant D and D is given,then A and B are not independentsince last term 1539

Computing a Joint Entry from a Bayes NetD-SeparationHow to compute an entry in the joint distribution (FJPD)?Determining if two variables in a BayesianNetwork are independent or conditionallyindependent given a set of observed evidencevariables, is determined using “d-separation”E.g., what is P(S, M, L, R, T)?P(S) 0.3P(L M,S) 0.05P(L M, S) 0.1P(L M,S) 0.1P(L M, S) 0.2P(R M) 0.3P(R M) 0.6LP(T L) 0.3P(T L) 0.8R55Computing with Bayes NetP(S) 0.3P(L M,S) 0.05P(L M, S) 0.1P(L M,S) 0.1P(L M, S) 0.2MSVariable OrderingP(M) 0.6P(R M) 0.3P(R M) 0.6LP(T L) 0.3P(T L) 0.8Before applying chain rule, best to reorder all ofthe variables, listing first the leaf nodes, then allthe parents of the leaves, etc. Last variableslisted are those that have no parents, i.e., theroot nodes.RTApply the Chain Rule conditional independence!P(T, R, L, M, S) P(T R, L, M, S) * P( R, L, M, S) P(T L) * P( R, L, M, S) P(T L) * P( R L, M, S) * P(L, M, S) P(T L) * P( R M) * P(L, M, S) P(T L) * P( R M) * P(L M, S) * P( M, S) P(T L) * P( R M) * P(L M, S) * P( M S) * P(S) P(T L) * P( R M) * P(L M, S) * P( M) * P(S)57P(M) 0.6TD-separation is covered in CS 76054MSSo, for previous example,P(S,L,M,T,R) P(T,R,L,S,M)5810

The General CaseComputing Joint Probabilitiesusing a Bayesian Network AP(X1 x1 , X2 x2 , ., Xn-1 xn-1 , Xn xn) P(Xn xn , Xn-1 xn-1 , ., X2 x2 , X1 x1) P(Xn xn Xn-1 xn-1 , ., X2 x2 , X1 x1) * P(Xn-1 xn-1 , , X2 x2 , X1 x1)How is any marginal probability computed?Sum the relevant joint probabilities: P(Xn xn Xn-1 xn-1 , ., X2 x2 , X1 x1) * P(Xn-1 xn-1 X2 x2 , X1 x1) *P(Xn-2 xn-2 , ., X2 x2 , X1 x1) P (( Xn i 1n(i xi )(( Xi 1 xi 1 ) , ,( X1 x1 ))) P ( Xi xi ) Assignments of Parents ( Xi )i 1 A BN can answer any query (i.e., marginal probability)about the domain by marginalization (“summing out”) overthe relevant joint probabilities60Inference by EnumerationWhere Are We Now? 61DCompute: P(c) P(a,b,c,d) P(a, b,c,d) P( a,b,c,d) P( a, b,c,d) P(a,b,c, d) P(a, b,c, d) P( a,b,c, d) P( a, b,c, d))59CCompute: P(a,b) P(a,b,c,d) P(a,b,c, d) P(a,b, c,d) P(a,b, c, d):B joint matching Q AND E P(joint)We defined a Bayes net, using a small numberof parameters, to describe the joint probabilityAny joint probability can be computed asP(x1, , xN) i P(xi parents(xi))The above joint probability can be computed intime linear in the number of nodes, NWith this joint distribution, we can compute anyconditional probability, P(Q E); thus we canperform any inferenceHow?by def. of cond. prob.P(Q E) joint matching E P(joint)P(E) 0.002For example: P(B J, M)1. Compute P(B,J, M)2. Compute P(J, M)P(B) 0.001P(A P(A P(A P(A BB, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.0013. Return P(B,J, M)/P(J, M)JP(J A) 0.9P(J A) 0.05EAMP(M A) 0.7P(M A) 0.016211

Inference byenumerationCompute the joints (4 of them)Inference by enumerationCompute the joints (8 of them)P(B, J, M, A, E) joint matching Q AND E P(joint)P(Q E) joint matching EFor example: P(B J, M)1. Compute P(B,J, M)2. Compute P(J, M)P(B, J, M, A, E)P(B, J, M, A, E)P(joint)P(B, J, M, A, E)P(E) 0.002Each is O(N) for sparse graphP(x1, xN) P(B) 0.001 i P(xi parents(xi))Sum them upP(A P(A P(A P(A BB, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.0013. Return P(B,J, M)/P(J, M)E joint matching Q AND E P(joint)P(J, M,B,A,E)P(Q E) joint matching EFor example: P(B J, M)1. Compute P(B,J, M)A2. Compute P(J, M)JP(J A) 0.9P(J A) 0.05MP(AP(x B,, x E) 0.0011N) i P(xi parents(xi))3. Return P(B,J, M)/P(J, M)P(M A) 0.7P(M A) 0.0163P(J, M,B,A, E)P(J, M,B, A,E)P(joint)P(J, M,B, A, E)P(E) 0.002P(J, M, B,A,E)P(B) 0.001P(J, M, B,A, E)BEP(J, M, B, A,E)P(A B, P(J, M, B, A, E)E) 0.95P(A B, E) 0.94A graphEachisE) 0.29O(N) for sparseP(A B,Sum them upJMP(J A) 0.9P(J A) 0.05P(M A) 0.7P(M A) 0.0164Another ExampleInference by Enumeration joint matching Q AND E P(joint)Sum up 4 jointsP(Q E) joint matching E P(joint)Sum up 8 jointsFor example: P(B J, M)1. Compute P(B,J, M)2. Compute P(J, M)3. Return P(B,J, M)/P(J, M)P(E) 0.002P(B) 0.001BEIn general, if there areBoolean variables, jP(A B,NE) 0.95P(A B,query E) 0.94vars, and kAP(A B, E) 0.29P(A B,evidence E) 0.001 vars, howmany joints to sum up?JMP(J A) 0.9P(J A) 0.0565Compute P(R T, S) from the following Bayes NetP(S) 0.3MSP(M) 0.6P(R M) 0.3P(R M) 0.6P(LP(LP(LP(LP(M A) 0.7P(M A) 0.01 M, S) 0.05M, S) 0.1 M, S) 0.1 M, S) 0.2LP(T L) 0.3P(T L) 0.8RT6612

Another ExampleAnother ExampleStep 1: Compute P(R, T, S)Step 1: Compute P(R, T, S)Step 2: Compute P(T, S)Step 2: Compute P(T, S)Step 3: ReturnStep 3: ReturnP(R, T, S)-----------------P(T, S)P(s) 0.3P(LP(LP(LP(L Sum of all the rows in theJoint that match T SP(R, T, S)-----------------P(T, S)MSM, S) 0.05M, S) 0.1 M, S) 0.1 M, S) 0.2Sum of all the rows in theJoint that match R T SP(M) 0.6P(s) 0.3P(R M) 0.3P(R M) 0.6LP(T L) 0.3P(T L) 0.8P(LP(LP(LP(LRT M, S) 0.05M, S) 0.1 M, S) 0.1 M, S) 0.2MSP(M) 0.6P(R M) 0.3P(R M) 0.6LP(T L) 0.3P(T L) 0.8RTCompute P(R T, S)Compute P(R T, S)6869Another ExampleStep 1: Compute P(R, T, S)Sum of all the rows in theJoint that match R T SStep 2: Compute P(T, S)Step 3: ReturnP(L P(L P(L P(L M, S) 0.05M, S) 0.1 M, S) 0.1 M, S) 0.2S P(T, S)Note: Inference through a Bayes Net can goboth “forward” and “backward” across arcsSSum of all the rows in theJoint that match T SP(R, T, S)------------------------------------P(R, T, S) P( R, T, S)P(s) 0.34 joint computes 8 joint computesMEach of these obtained bythe “computing a jointP(M) 0.6probability entry” method inthe earlierslidesP(R½M) 0.3 P(R M) 0.6LP(T½L) 0.3P(T L) 0.8RTMCausal (top-down) inferenceL§ Given a cause, infer its effectsT§ E.g., P(T S)RDiagnostic (bottom-up) inference§ Given effects/symptoms, infer a cause§ E.g., P(S T)Compute P(R T, S)707113

The Good NewsThe Bad News We can do inference. That is, we can computeany conditional probability:P( Some variables Some other variable values )P( E1 Ù E2 )P( E1 E2 ) P ( E2 ) å P( joint entry)joint entries matching E1 and E2å P( joint entry) joint entries matching E2 “Inference by Enumeration” Algorithm 72In general if there are N variables and evidence contains jvariables, and each variable has k values, how manyjoints to sum up? k(N-j)It is this summation that makes inference byenumeration inefficient Computing conditional probabilities by enumerating allmatching entries in the joint is expensive:Exponential in the number of variablesSome computation can be saved by carefully ordering theterms and re-using intermediate results (variableelimination algorithm)A more complex algorithm called a join tree (junction tree)can save even more computationBut, even so, exact inference with an arbitrary BayesNet is NP-Complete77Compute P(d)Variable Elimination AlgorithmVTGeneral idea: Write query in the formxkx3x2XBDInitial factors:P(v, s, t, l, a, b, x, d) P(v)P(s)P(t v)P(l s)P(b s)P(a t,l)P(x a)P(d a,b)i IterativelyInference by Enumeration (i.e., brute force) approach:§ Move all irrelevant terms outside of innermost sum§ Perform innermost sum, getting a new termP(d) P(v, s,t,l,a,b, x,d)§ Insert the new term into the product79LANeed to eliminate: v, s, x, t, l, a, bP(xn , e ) ! P(xi pai )Sxbaltsv8014

VWe want to compute P(d)Need to eliminate: v, s, x, t, l, a, bSTAInitial factorsParameter (CPT) Learning for BN L§ Ask domain experts, or§ Learn from dataBXWhere do you get these CPT numbers?DP(v)P(s)P(t v)P(l s)P(b s)P(a t,l)P(x a)P(d a,b)Eliminate: vCompute:P(B) 0.001f v (t) P(v)P(t v)P(A P(A P(A P(A v f v (t)P(s)P(l s)P(b s)P(a t,l)P(x a)P(d a,b)BEP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJNote: fv(t) P(t)P(J A) 0.9P(J A) 0.05Idea behind VariableElimination Algorithm81Parameter (CPT) Learning for BN§ Learn from a data set like this:( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)94P(M A) 0.7P(M A) 0.0193Parameter (CPT) Learning for BN M§ Learn from a data set like this:How to learn this CPT?P(B) 0.001P(A P(A P(A P(A BP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJP(J A) 0.9P(J A) 0.05EMP(M A) 0.7P(M A) 0.01( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M) Count #(B) and #( B) in dataset.P(B) #(B) / [#(B) #( B)]P(B) 0.001P(A P(A P(A P(A BP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJP(J A) 0.9P(J A) 0.05EMP(M A) 0.7P(M A) 0.019515

Parameter (CPT) Learning for BNParameter (CPT) Learning for BN§ Learn from a data set like this:( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)§ Learn from a data set like this:Count #(E) and #( E) in dataset.P(E) #(E) / [#(E) #( E)]P(B) 0.001P(A P(A P(A P(A BEP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJMP(J A) 0.9P(J A) 0.05 P(M A) 0.7P(M A) 0.0196( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)P(A P(A P(A P(A BEP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJMP(J A) 0.9P(J A) 0.05P(M A) 0.7P(M A) 0.0197Parameter (CPT) Learning for BN§ Learn from a data set like this:( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)98P(B) 0.001 Parameter (CPT) Learning for BN Count #(A) and #( A) in datasetwhere B true and E true.P(A B,E) #(A) / [#(A) #( A)]§ Learn from a data set like this:Count #(A) and #( A) in datasetwhere B true and E false.P(A B, E) #(A) / [#(A) #( A)]P(B) 0.001P(A P(A P(A P(A BP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJP(J A) 0.9P(J A) 0.05EMP(M A) 0.7P(M A) 0.01( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M) Count #(A) and #( A) in datasetwhere B false and E true.P(A B,E) #(A) / [#(A) #( A)]P(B) 0.001P(A P(A P(A P(A BP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJP(J A) 0.9P(J A) 0.05EMP(M A) 0.7P(M A) 0.019916

Parameter (CPT) Learning for BNParameter (CPT) Learning for BN§ Learn from a data set like this:( B, E, A, J, M)( B, E, A, J, M)( B,p E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B,p E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B,p E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M) p§ ‘Unseen event’ problemCount #(A) and #( A) in datasetwhere B false and E false.P(A B, E) #(A) / [#(A) #( A)]P(B) 0.001P(A P(A P(A P(A BEP(E) 0.002B, E) 0.95B, E) 0.94 B, E) 0.29 B, E) 0.001AJP(J A) 0.9P(J A) 0.05MP(M A) 0.7P(M A) 0.01100( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)( B, E, A, J, M)(B, E, A, J, M)( B, E, A, J, M)Count #(A) and #( A) in datasetwhere B true and E true.P(A B,E) #(A) / [#(A) #( A)]What if there’s no row with(B, E, A, *, *) in the dataset?Do you want to setP(A B,E) 1P( A B,E) 0?Why or why not? 101Smoothing CPTsNaive Bayes Classifier Testing Phase§ “Add-1” smoothing: add 1 to all counts § In the previous example, count #(A) and #( A) indataset where B true and E true§ P(A B,E) [#(A) 1] / [#(A) 1 #( A) 1]For a given test instance defined by X1 v1, , Xn vn,compute𝑎𝑟𝑔𝑚𝑎𝑥( P(Y c) -* , 𝑃(𝑋* 𝑣* Y c)§ If #(A) 1, #( A) 0:§ without smoothing P(A B,E) 1, P( A B,E) 0Class variable§ with smoothing P(A B,E) 0.67, P( A B,E) 0.33§ If #(A) 100, #( A) 0: § without smoothing P(A B,E) 1, P( A B,E) 0§ with smoothing P(A B,E) 0.99, P( A B,E) 0.01§ Smoothing saves you when you don’t have enoughdata, and hides away when you do Evidence variableAssumes all evidence variables are conditionallyindependent of each other given the class variableRobust because it gives the right answer as long as thecorrect class is more likely than all others§ It’s a form of Maximum a posteriori (MAP) estimation10310417

A Special BN: Naïve Bayes ClassifiersBN Special Case: Naïve Bayes A special Bayes Net structure:J§ a ‘class’ variable Y at root, compute P(Y X1, , XN)§ evidence nodes Xi (observed features) are allleavesC§ conditional independence between all evidenceassumed. Usually not valid, but often empiricallyOK ZHJPerson is JuniorCBrought coat to classZLives in zipcode 53706HSaw “Hunger Games 1” morethan onceWhat’s stored in the CPTs?Y X1X2XN105106A Special BN: Naïve Bayes ClassifiersP(J) CP(C J) P(C J) 107JZHJPerson is JuniorCBrought coat to classZLives in zipcode 53706HSaw “Hunger Games 1” morethan onceBayesian Network PropertiesP(Z J) P(H J) P(Z J) P(H J) Bayesian Networks compactly encode jointdistributions Topology of a Bayesian Network is onlyguaranteed to encode conditionalindependencies§ Arcs do not necessarily represent causalrelations12918

What You Should Know Inference with full joint distributionProblems of full joint distributionBayesian Networks: representation (nodes,arcs, CPT) and meaningCompute joint probabilities from Bayes netInference by enumerationNaïve Bayes classifierYou are NOT responsible for the VariableElimination Algorithm13019

1.List all combinations of values (if each variable has kvalues, there are kNcombinations) 2.Assign each combination a probability 3.They should sum to 1 Weather Temperature Prob. Sunny Hot 150/365 Sunny Cold 50/365 Cloudy Hot 40/365 Cloudy Cold 60/365 Rainy Hot 5/365 Rainy Cold 60/365 3 Using the Full Joint Distribution

Related Documents: