Learning Optimal Bayesian Networks With Heuristic Search

1y ago
9 Views
2 Downloads
1.99 MB
80 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Noelle Grant
Transcription

LEARNING OPTIMAL BAYESIANNETWORKS WITH HEURISTIC SEARCHBrandon Malone

Acknowledgements and Collaborators Changhe YuanSusan BridgesEric HansenZhaohua PengAndy PerkinsFeng TanZhifa LiuNan Wang Others

1/49What is a Bayesian network?A Bayesian network has ADAG which encodesthe conditionaldependencies amongthe variables. Conditionalprobabilitydistributions whichquantify the relationsamong the variables.[Pearl, 1988]

2/49Structure LearningMany datasets are available online.?

3/49What is structure learning?Decomposability[Heckerman, JMLR ‘95, v.20 197-243 ]

4/49Exact Structure Learning Contributions Heuristic Search Perspective, A* [Yuan et al., IJCAI ‘11] Memory-Efficient DP [Malone et al., AAAI ‘11] Frontier BFBnB with DDD [Malone et al., UAI ’11] k-cycle Conflict Heuristic, PD [Yuan and Malone, UAI ‘12] (sub) Anytime DFBnB [Malone and Yuan, KDD ‘12] (sub.) Parallel Anytime Weighted A* [Malone et al., PGM‘12] (maybe)

5/49Dynamic ProgrammingPick one variable Pick another leaf.Begin with aFind its optimalsingle variable. as leaf. Find itsoptimal parents. parents from current.Continue picking leavesand finding optimal parents.Silander, T. and P. Myllymaki. UAI-06, AUAI Press, 2006

6/49Heuristic Search, Order GraphA path from the top tothe bottom nodecorresponds to acomplete network.The shortest pathcorresponds to theoptimal network.[Yuan et al., IJCAI ‘11]

6/49Heuristic Search, Order GraphA path from the top tothe bottom nodecorresponds to acomplete network.The shortest pathcorresponds to theoptimal network.[Yuan et al., IJCAI ‘11]

6/49Heuristic Search, Order GraphA path from the top tothe bottom nodecorresponds to acomplete network.The shortest pathcorresponds to theoptimal network.[Yuan et al., IJCAI ‘11]

6/49Heuristic Search, Order GraphA path from the top tothe bottom nodecorresponds to acomplete network.The shortest pathcorresponds to theoptimal network.[Yuan et al., IJCAI ‘11]

6/49Heuristic Search, Order GraphA path from the top tothe bottom nodecorresponds to acomplete network.The shortest pathcorresponds to theoptimal network.[Yuan et al., IJCAI ‘11]

7/49Heuristic Search, Order GraphEach node storesScore(V) for thegiven subset ofvariables.Following an arc in thegraph correspondsto evaluatingBestScore(X,V).[Yuan et al., IJCAI ‘11]

8/49Full Parent GraphsEach node storesBestScore(X, V) for thegiven subset ofvariables.Following an arc in thegraph corresponds toevaluating Score(X V).[Yuan et al., IJCAI ‘11]

9/49Sparse Parent GraphsWe have to calculate BestScore(X U).

9/49Sparse Parent GraphsWe have to calculate BestScore(X U).Many parent sets are never optimal.If S T and MDL(X S) MDL(X T), then T is never optimal.

9/49Sparse Parent GraphsWe have to calculate BestScore(X U).Many parent sets are never optimal.So do we need those suboptimal entries?

10/49Sparse Parent GraphsMost parent sets are not optimal for any network. Existingmethods (parent graphs) consider all sets.Our idea only retain possibly optimal scores.ParentsX2,X3X3X2X2,X3,X4 X2,X4 ϕX3,X4 X4Score(X1 Parents)568812ParentsX2,X3X3X2ϕScore(X1 Parents)5681091013We can use bitwise operators to efficiently scan the list.

11/49A* and PruningWe expand nodesaccording tof(U) g(U) h(U).g(U) is edge costs sofar.Therefore, we neverexplore unpromisingnetwork structures.[Yuan et al., IJCAI ‘11]

11/49A* and PruningWe expand nodesaccording tof(U) g(U) h(U).g(U) is edge costs sofar.Therefore, we neverexplore unpromisingnetwork structures.[Yuan et al., IJCAI ‘11]

11/49A* and PruningWe expand nodesaccording tof(U) g(U) h(U).g(U) is edge costs sofar.Therefore, we neverexplore unpromisingnetwork structures.[Yuan et al., IJCAI ‘11]

12/49A* Runtime ResultsBBDPA*, FullA*, SparseRunning Time (s)100001000100101Dataset (name, variables)

13/49A* Order Graph Pruning ResultsNodes ExpandedCompleteA*1.00E 081.00E 071.00E 061.00E 051.00E 041.00E 031.00E 021.00E 011.00E 00Dataset (name, variables)

14/49Sparse Parent Graph PruningMaximum Parent SetsFullLargest LayerSparse1.00E 181.00E 161.00E 141.00E 121.00E 101.00E 081.00E 061.00E 041.00E 021.00E 00Selected Dataset (name, variables)

15/49Exploiting Layers and BFBnBϕ12Problem: The size of the order graph isexponential in the number of variables.34Observation: To generate a node, onlynodes in the previous layer of the ordergraph are required.Solution: Only store a layer as long asnecessary.The technique also applies to parent graphs.Before expanding each layer of the ordergraph, the corresponding layer of theparent graphs are expanded.[Malone et al., AAAI ‘11]

15/49Exploiting Layers and BFBnBProblem: The size of the order graph isexponential in the number of variables.11,221,32,3341,42,4Observation: To generate a node, onlynodes in the previous layer of the ordergraph are required.3,4Solution: Only store a layer as long asnecessary.The technique also applies to parent graphs.Before expanding each layer of the ordergraph, the corresponding layer of theparent graphs are expanded.[Malone et al., AAAI ‘11]

15/49Exploiting Layers and BFBnBProblem: The size of the order graph isexponential in the number of variables.Observation: To generate a node, onlynodes in the previous layer of the ordergraph are required.1,21,32,31,42,43,4Solution: Only store a layer as long asnecessary.The technique also applies to parent graphs.1,2,31,2,41,3,42,3,4Before expanding each layer of the ordergraph, the corresponding layer of theparent graphs are expanded.[Malone et al., AAAI ‘11]

15/49Exploiting Layers and BFBnBProblem: The size of the order graph isexponential in the number of variables.Observation: To generate a node, onlynodes in the previous layer of the ordergraph are required.Solution: Only store a layer as long asnecessary.The technique also applies to parent graphs.1,2,31,2,41,3,41,2,3,42,3,4Before expanding each layer of the ordergraph, the corresponding layer of theparent graphs are expanded.[Malone et al., AAAI ‘11]

16/49Delayed Duplicate DetectionImmediateDuplicate Detection(Hash Table In RAM)DelayedDuplicate Detection(External Memory Sorting)[Korf, AAAI ‘04]

DP100,000adult, 14houseVotes, 17letter, 17statlog, 19hepatitis, 20segment, 20imports, 22heart, 23mushroom, 23parkinsons, 23sensors, 25autos, 26horseColic, 28steel, 28flag, 29wdbc, 31Running Time (s)17/49BFBnB Runtime ResultsBFBnB, FullA*, SparseBFBnB, Sparse10,0001,000100101Dataset (name, variables)

adult, 14houseVotes, 17letter, 17statlog, 19hepatitis, 20segment, 20imports, 22heart, 23mushroom, 23parkinsons, 23sensors, 25autos, 26horseColic, 28steel, 28flag, 29wdbc, 31Nodes Expanded18/49BFBnB Order Graph Memory ResultsCompleteLargest Layer1.00E 101.00E 091.00E 081.00E 071.00E 061.00E 051.00E 041.00E 031.00E 021.00E 011.00E 00Dataset (name, variables)

19/49Why did A* expand so many nodes?Problem: The heuristic for pruning does not considercycles, so it ignores interactions among parent sets(akin to Manhattan distance heuristic for sliding tile).Observation: Breaking cycles among groups variableswould still give a lower bound on the network score.Solution: Perform a BFS through the reverse order graphand select optimal parents by removing leaves. Theresult will have no cycles within each group.

20/49The simple heuristic and cyclesBy allowing all variables to select optimal parents, weoften introduce many cycles among the variables.h(n) h(X1) h(X2) h(X3) h(X4)To address this problem, we split the variables intogroups and break cycles within each group.*h(n) h(X1, X2) h(X3, X4)* We also use another technique based on the same intuition that does not split the variables.

21/49Splitting CyclesWe then figure out the optimal way to break cycles.This could introduce new edges, so we use a searchthrough the reverse order graph to calculate this.Score: 6Score: 8Score: 30Score: 20h(X1,X2) h(X1) h(X2) 6 8 14

21/49Splitting CyclesWe then figure out the optimal way to break cycles.This could introduce new edges, so we use a searchthrough the reverse order graph to calculate this.Score: 6Score: 8h(X1,X2) h(X1) h(X2) 6 8 14h’(X1,X2) 20Score: 30Score: 20

21/49Splitting CyclesWe then figure out the optimal way to break cycles.This could introduce new edges, so we use a searchthrough the reverse order graph to calculate this.Score: 5Score: 12Score: 22Score: 27h(X3,X4) h(X3) h(X4) 5 12 17

21/49Splitting CyclesWe then figure out the optimal way to break cycles.This could introduce new edges, so we use a searchthrough the reverse order graph to calculate this.Score: 5Score: 12h(X3,X4) h(X3) h(X4) 5 12 17h’(X3,X4) 22Score: 22Score: 27

22/49Additive Pattern DatabasesAfter splitting into groups, we calculate the heuristic foreach group independently and add them together.h’ is guaranteed to always be h.Score: 6Score: 8Score: 5Score: 12h(X1,X2,X3,X4) h(X1) h(X2) h(X3) h(X4) 6 8 5 12 31h’(X1,X2,X3,X4) h’(X1,X2) h’(X3,X4)h’(X1,X2,X3,X4) 20 22h’(X1,X2,X3,X4) 42Score: 20Score: 22[Felner et al., JAIR ‘04]

23/49k-Cycle Pruning – Autos, 26A*BFBnB7.E 07Nodes Expanded6.E 075.E 074.E 073.E 072.E 071.E 072.E 06LooseDynamic, Dynamic, Dynamic, Static,k 2k 3k 49-9-8k-Cycle Conflict Heuristic StrategyStatic,13-13

24/49k-Cycle Runtimes – Autos, 26A*BFBnB800Running Time (s)7006005004003002001000LooseDynamic, Dynamic, Dynamic, Static,k 2k 3k 49-9-8k-Cycle Conflict Heuristic StrategyStatic,13-13

25/49k-Cycle Pruning – Flag, 29A*BFBnB4.0E 08Nodes Expanded3.5E 083.0E 082.5E 082.0E 081.5E 081.0E 085.0E 070.0E 00Loose Dynamic, Dynamic, Dynamic, Static, Static,k 2k 3k 4 10-10-9 15-14k-Cycle Conflict Heuristic Strategy

26/49k-Cycle Runtimes – Flag, 29A*BFBnB4000Running Time (s)3500300025002000150010005000LooseDynamic, Dynamic, Dynamic, Static,k 2k 3k 410-10-9k-Cycle Conflict Heuristic StrategyStatic,15-14

27/49k-Cycle Runtimes – A*A*A*, PDRunning Time (s)10,0001,000100101Dataset (name, variables)

28/49k-Cycle Runtimes – BFBnBBFBnBBFBnB, PDRunning Time (s)100,00010,0001,000100101Dataset (name, variables)

29/4920,000 seconds seems like a long time Problem: DP, A*, BFBnB return no solution until the veryend of the search.Observation: With the sparse parent graphs, we caneasily remove one candidate parent at a time.Solution: Perform a DFBnB search through the reverseorder graph and let the call stack do the heavy lifting.

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.Find multiple pathsfrom the start fromthe goal thatimprove the score.[Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.Find multiple pathsfrom the start fromthe goal thatimprove the score.[Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.Find multiple pathsfrom the start fromthe goal thatimprove the score.[Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.Find multiple pathsfrom the start fromthe goal thatimprove the score.[Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.Find multiple pathsfrom the start fromthe goal thatimprove the score.We found a solution![Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.BacktrackFind multiple pathsfrom the start fromthe goal thatimprove the score.[Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.BacktrackFind multiple pathsfrom the start fromthe goal thatimprove the score.[Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.Find multiple pathsfrom the start fromthe goal thatimprove the score.[Malone and Yuan, KDD ‘12]

30/49Anytime Behavior and DFBnBLeave breadcrumbs ofincremental sparseparent graphs onthe call stack.Find multiple pathsfrom the start fromthe goal thatimprove the score.We found a solution![Malone and Yuan, KDD ‘12]

31/49Caching and DFBnBProblem: DFS strategies sufferfrom expanding duplicates.Observation: After backtrackingto a node, we know the exactcost from it to the goal.Solution: Use a closed list tostore that.So we never need to re-expanda node.[Malone and Yuan, KDD ‘12]

31/49Caching and DFBnBProblem: DFS strategies sufferfrom expanding duplicates.Observation: After backtrackingto a node, we know the exactcost from it to the goal.Solution: Use a closed list tostore that.So we never need to re-expanda node.[Malone and Yuan, KDD ‘12]

32/49Caching and DFBnBProblem: If we find a betterpath to an expanded node,we may inadmissibly prune.gold 20Observation: We may findmultiple better paths.Solution: Use a closed list tostore that, too. Iteratively“repair” nodes.[Likhachev et al., NIPS ‘03][Malone and Yuan, KDD ‘12]

32/49Caching and DFBnBProblem: If we find a betterpath to an expanded node,we may inadmissibly prune.gold 20gnew 15Observation: We may findmultiple better paths.Solution: Use a closed list tostore that, too. Iteratively“repair” nodes.[Likhachev et al., NIPS ‘03][Malone and Yuan, KDD ‘12]

33/49DFBnB AnytimeResults – Autos, 26No iterations of repairing were performed.

34/49DFBnB AnytimeResults – Flag, 29No iterations of repairing were performed.

35/49DFBnB AnytimeResults – Water, 38No iterations of repairing were performed.

36/49DFBnB AnytimeResults – Lung, 57No iterations of repairing were performed.

37/49The future of computing

38/49Parallel Window SearchIDA* is a depth-first (limited memory) searchalgorithm that mimics A* with a threshold.Parallel window search distributes a series IDA* withdifferent threshold to multiple cores.Powley, C. & Korf, R. E. IJCAI, 1989, 36-41

39/49Parallel DovetailingMany algorithms have tunable parameters.Weighting, beam size, k, Limitation of study: Quit after find a single solution(regardless of optimality)Valenzano, R. et al. ICAPS, 2010

40/49Leveraging Distributed MemoryObservation: Weighted A* gives an ε-boundedsolution (but typically only a single solution)Simple Idea: Run WA* with a gradient of ε on multiplecores/processors (dovetailing)

41/49PAWN*

42/49PAWN * Advantages Embarrassingly parallel (no communication overhead)Proper selection of εs gives good anytime behavior forimproving the upper boundRunning with ε 1 ensures eventual optimality(assuming resources are available)Running with ε 1 also gives good anytime behaviorfor improving the lower bound

43/49Anytime Convergence ResultsFlag, 29 variablesMDL ScoreLower BoundUpper Bound282028002780276027402720270026800.2 0.4 0.6 0.8 1 9 18 27 36 45 54 63 72 81 90Running Time (s)Error Bound: 0% (Optimal)

44/49Anytime Convergence ResultsAlarm, 37 variablesMDL ScoreLower BoundUpper Bound1470146014501440143014201410140013900.2 0.4 0.6 0.8 1 9 18 27 36 45 54 63 72 81 90Running Time (s)Error Bound: 1.7%

45/49Anytime Convergence ResultsSPECTF, 45 variablesMDL ScoreLower BoundUpper Bound817081608150814081308120811081000.2 0.611.4 1.8 10 30 50Running Time (s)Error Bound: 0.3%7090

46/49Anytime Convergence ResultsLung Cancer, 57 variablesMDL ScoreLower BoundUpper Bound650640630620610600590Running Time (s)Error Bound: 4.9%

47/49Proven Error nytime WA*BB

48/49Conclusions Efficient data structures and regularities in problemstructure can give orders of magnitude improvementfor learning Bayesian network structures.A variety of search strategies are applicable, all ofwhich offer their own advantages and challenges.Bayesian network structure learning is usable in anydomain in which we wish to reason and are unsureof the relationships among the variables.

49/49Future Work Expert knowledge is available in many domains,and taking it into account could ease learning.Phase transitions, making hard problems easy Hybridsearch techniques combine score- andconstraint-based learning techniques. Statistical testscan restrict the search space significantly. Computinglocal scores is often the most time-consumingstep of the algorithm. Better data structures andsuboptimal pruning can make sparse parent graphsmuch more efficient.

CALCULATING OPTIMALPARENT SCORES WITHOUTPARENT GRAPHS

Say we have the scores for 1220Sort them according to 2131520Mark which scores use which parentsuses[1]Xuses[2]Xuses[3]XXXXXXX

The last leaf can use all scoresusableXXXXXXXSay {X1} is used as a leaf. Then the optimalscore cannot include it. Find all scores whichdo not use {X1} with (usable & uses[1]).usableXuses[1]XusableXXXXXXXXXXXXLook the first set bit up in the score and parents table.The optimal parents are {X2} and the score is 10.

Say {X2} is then used as a leaf. Then the optimal scorecannot include it either. Find all remaining scores whichdo not use {X2} with (usable & uses[2]).usableuses[2]usableXXXXXXXXXLook the first set bit up in the score and parents table.The optimal parents are {X3} and the score is 13.

Memory Requirements usable data structure for each variable not yetadded as a leaf in each order node*One copy each of parents, scores and usesfor the entire search* Some preliminary results suggest that these can becached and reused.

Exact Structure Learning Contributions Heuristic Search Perspective, A* [Yuan et al., IJCAI '11] Memory-Efficient DP [Malone et al., AAAI '11] Frontier BFBnB with DDD [Malone et al., UAI '11] k-cycle Conflict Heuristic, PD [Yuan and Malone, UAI '12] (sub) Anytime DFBnB [Malone and Yuan, KDD '12] (sub.) Parallel Anytime Weighted A* [Malone et al., PGM'12] (maybe)

Related Documents:

Learning Bayesian Networks and Causal Discovery Reasoning in Bayesian networks The most important type of reasoning in Bayesian networks is updating the probability of a hypothesis (e.g., a diagnosis) given new evidence (e.g., medical findings, test results). Example: What is the probability of Chronic Hepatitis in an alcoholic patient with

Key words Bayesian networks, water quality modeling, watershed decision support INTRODUCTION Bayesian networks A Bayesian network (BN) is a directed acyclic graph that graphically shows the causal structure of variables in a problem, and uses conditional probability distributions to define relationships between variables (see Pearl 1988, 1999;

Alessandro Panella (CS Dept. - UIC) Probabilistic Representation and Reasoning May 4, 2010 14 / 21. Bayesian Networks Bayesian Networks Bayesian Networks A Bayesian (or belief) Network (BN) is a direct acyclic graph where: nodes P i are r.v.s

Bayesian networks can also be used as influence diagramsinstead of decision trees. . Bayesian networks do not necessarily imply influence by Bayesian uentists’methodstoestimatethe . comprehensible theoretical introduction into the method illustrated with various examples. As

evaluation of performance robustness, i.e., sensitivity, of Bayesian networks, d) the sensi-tivity inequivalent characteristic of Markov equivalent networks, and the appropriateness of using sensitivity for model selection in learning Bayesian networks, e) selective refinement for

Computational Bayesian Statistics An Introduction M. Antónia Amaral Turkman Carlos Daniel Paulino Peter Müller. Contents Preface to the English Version viii Preface ix 1 Bayesian Inference 1 1.1 The Classical Paradigm 2 1.2 The Bayesian Paradigm 5 1.3 Bayesian Inference 8 1.3.1 Parametric Inference 8

value of the parameter remains uncertain given a nite number of observations, and Bayesian statistics uses the posterior distribution to express this uncertainty. A nonparametric Bayesian model is a Bayesian model whose parameter space has in nite dimension. To de ne a nonparametric Bayesian model, we have

Keywords: Multiagent Learning, Bayesian Reinforcement Learning, POMDPs Acknowledgements Research supported in part by AFOSR MURI project #FA9550-091-0538. 1 Introduction Bayesian reinforcement learning (RL) techniques are promising in that, in principle, they provide an optimal explo-