Management Of Probabilistic Data: Foundations And Challenges

1y ago
7 Views
1 Downloads
602.15 KB
62 Pages
Last View : 23d ago
Last Download : 3m ago
Upload by : Mia Martinelli
Transcription

Management of Probabilistic Data:Foundations and ChallengesNilesh Dalvi and Dan SuciuUniverisity of Washington11

Databases Are Deterministic Applications since 1970’s requiredprecise semantics– Accounting, inventory Database tools are deterministic– A tuple is an answer or is not Underlying theory assumes determinism– FO (First Order Logic)22

Future of Data ManagementWe need to cope with uncertainties ! Represent uncertainties as probabilities Extend data management tools to handleprobabilistic dataMajor paradigm shift affecting both foundationsand systems33

Uncertainties Everywhere In the schema mappings:– Data spaces– Pay as you go data integration[Halevy’2007] In the data mapping– Life science data integration– Object reconciliation, fuzzy joins[Philippi&Kohler’2006][Arasu’06] In the data itself– Data “by the masses”– Information Extraction– RFID data, sensor data[Gupta&Sarawagi’2006][Welbourne’2007]44

[B.Louie et al.2007]Example 1Data Integration in Life Sciences U2 integrates several biological databasesExample: find functional annotations of ABCD1User types: “Gene ABCD1”U2 finds 80 “related” proteinsRanks them by uncertainty scoreCorrect 9 functions are among top 11Need to represent uncertainties explicitlyEntrezProtein,Pfam,TIGRFAM,NCBI Blast,EntrezGene55

Example 2Information Extraction.52 A Goregaon West Mumbai .[Gupta&Sarawagi’2006]IDHouse-No StreetCity152Goregaon West Mumbai0.1152-AGoregaon West Mumbai0.4152GoregaonWest Mumbai0.2152-AGoregaonWest Mumbai0.22. . . . 20%2.Here probabilities are meaningfulPof suchextractionsare correct66

Example 3RFID Ecosystem at UW[Welbourne’2007]77

RFID data noisy– SIGHTING(tagID, antennaID, time) Derived data Probabilistic– “John entered Room 524 at 9:15” prob 0.6– “John carried laptop x77 at 11:03” prob 0.8– . Queries– “Which people were in Room 478 yesterday ?”Massive amounts of probabilistic data from RFIDs, 8sensors8

A Model for Uncertainties Data is probabilistic Queries formulated in a standard language Answers are annotated with probabilitiesThis talk: Probabilistic Databases99

Probabilistic databases:Long , Fuhr&Roellke:1997Dalvi&S:2004Widom:2005Focus today: the Query Evaluation Problem1010

Has this been solved by AI ?Fix qInput: DBInput: singProbabilisticProbabilisticinference[this talk]1111

Outline Data model Query evaluation Challenges1212

[Barbara et al.1992]What is a ProbabilisticDatabase (PDB) PJohn0.62Jim0.34Mary0.45John0.33Fred0.111313

[Barbara et al.1992]What is a ProbabilisticDatabase (PDB) n0.33Fred0.11What does it mean ? 1313

BackgroundFinite probability space (Ω, P)Ω {ω1, . . ., ωn} set of outcomesP : Ω [0,1]P(ω1) . . . P(ωn) 1Event: E Ω, P(E) ω E P(ω)“Independent”:P(E1 E2) P(E1) P(E2)“Mutual exclusive” or “disjoint”: P(E1E2) 014

Possible Worlds SemanticsObjectLaptop77Book302Ω PDBPossibleworlds1515

Possible Worlds SemanticsObjectTimeLaptop779:07Book302Ω {p 1p worlds1515

Possible Worlds SemanticsObjectLaptop77Book302Ω {Time9:079:18ObjectTime PersonObjectTime aryBook3029:18Johnp 1p 3p 1p orlds1515

Possible Worlds ohnp1Jimp2Maryp3Johnp4Fredp5ObjectTime PersonObjectTime PersonLaptop77 Object9:07JohnTime PersonLaptop77 Object9:07JohnTime PersonBook302 Laptop779:18MaryObject9:07JohnTime PersonObjectTime PersonBook302 Laptop779:18John9:07JimBook302 Laptop779:18Fred9:07Object9:07JimTimeLaptop77Jim PersonBook3029:18Mary1 3Book302 Laptop779:18John1 4Book3029:18 9:07Fred JohnΩ {pp}PDBPossibleworldsppp1(1- p3-p4-p5)1515

Possible Worlds ohnp1Jimp2Maryp3Johnp4Fredp5PDBObjectTime PersonObjectTime PersonLaptop77 Object9:07JohnTime PersonLaptop77 Object9:07JohnTime PersonBook302 Laptop779:18MaryObject9:07JohnTime PersonObjectTime PersonBook302 Laptop779:18John9:07JimBook302 7JimBook3029:18Mary1 3ObjectTime PersonBook302 Laptop779:18John1 49:07Fred JohnTime PersonBook3029:18ObjectLaptop77 Object9:07JimTime PersonBook302 Object9:18MaryTime Person15Time PersonBook302 Object9:18John13 4 5Book3029:18FredΩ {pp}Possibleworldsppp (1- p -p -p )15

DefinitionsDefinition: A tuple-disjoint/independent table is:R(A1, A2, , Am, B1, , Bn, P)Definition: A tuple-independent table is:R(A1, A2, , Am, P)Definition: Semantics is given by possible worlds1616

HasObject(Object, Time, Person, ets(Person1, Person2, Time, nMary9:20p3Independent1717

Query SemanticsA boolean query q is an event: {ω ω q }P(q) ω q P(ω)Did someone take MyBook to the CoffeeRoom ?q HasObject(‘MyBook’,x,t), EnterRoom(x,’CoffeeRoom’,t) P(q) 0.96 (meaning: quite likely18!)18

Discussion of Data ModelTuple-disjoint/independent tables: Simple model, can store in any DBMSMore advanced models:Fuhr and Roellke Symbolic boolean expressions[Widom05, Das Sarma’06, Benjelloun 06] Trio: add lineage[Getoor’2006] Probabilistic Relational Models Graphical models[Sen&Desphande’07]1919

Outline Data model Query evaluation– Probability of Boolean expressions– From queries to Boolean expressions– Data complexity of query evaluation Challenges2020

Probability of BooleanExpressionsΦ X1X2 Ç X1X3 Ç X2X3P(X1) p1 , P(X2) p2, P(X3) p3Compute P(Φ)2121

Probability of BooleanExpressionsΦ X1X2 Ç X1X3 Ç X2X3P(X1) p1 , P(X2) p2, P(X3) p3Ω )1111p 1p 2p 31P(1-p1)p2p3Compute P(Φ)Φ102121

Probability of BooleanExpressionsΦ X1X2 Ç X1X3 Ç X2X3P(X1) p1 , P(X2) p2, P(X3) p3Ω )1111p 1p 2p 31P(1-p1)p2p3Compute P(Φ)Φ10Pr(Φ) (1-p1)p2p3 p1(1-p2)p3 p1p2(1-p3) p1p2p32121

BackgroundFix P(X1) P(X2) . . . P(Xn) 1/2[Valiant:1979]TheoremExact evaluation of Pr(Φ) is #P-completeTheorem For DNF ΦApproximation of Pr(Φ) is in PTIME(FPTRAS)[Karp&Luby:1983]Both theorems extend to rational P(X1), . . . , P(Xn)[Graedel,Gurevitch,Hirsch:1998]2222

Query q Database PDB Φq R(x, y), S(x, z)SpRpPDB ABPa1b1p1a2b2p2ACPa1c1q1X1 a1X2 aY1c2q2Y22c3q3Y3a2c4q4Y4a2c5q5Y52323

Query q Database PDB Φq R(x, y), S(x, z)SpRpPDB ABPa1b1p1a2b2p2ACPa1c1q1X1 a1X2 aY1c2q2Y22c3q3Y3a2c4q4Y4a2c5q5Y5 Φ X1Y1 Ç X1Y2 Ç X2Y3 Ç X2Y4 Ç X2Y52323

Application to Query EvaluationCorollary Fix FO query qExact evaluation of Pr(q) on input PDB is in #PCorollary Fix a conjunctive query q.Approximation of Pr(q) on input PDB is in PTIME(FPTRAS)[Graedel,Gurevitch,Hirsch:1998]2424

Background:Probabilistic NetworksR(x, y), S(x, z)Φ X1Y1ÇX1Y2ÇX2Y3ÇX2Y4ÇX2Y5Inference: hard in generalKR techniques exploit localÇproperties:ÇÇE.g. bounded treewidth PTIME[Zabiyaka&Darwiche’06]Note: for this querythe treewidth 5

[D&S’2004]Π q R(x, y), S(x, z)JoinΠAΠARp(A,B)Sp(A,C)2626

[D&S’2004]Π q R(x, y), S(x, a2c3q3a2b2p2a2c4q4a2c5q52626

[D&S’2004]Π q R(x, y), S(x, p2a2c4q4a2c5q52626

P(q) [D&S’2004]1 - (1-p1(1-(1-q1)(1-q2))) * (1-p2(1-(1-q3)(1-q4)(1-q5)))Π q R(x, y), S(x, a2c4q4a2c5q52626

P(q) [D&S’2004]1 - (1-p1(1-(1-q1)(1-q2))) * (1-p2(1-(1-q3)(1-q4)(1-q5)))Π “safe plan”q JoinR(x, y), S(x, z)ΠAΠARp(A,B)ABPa1b1p1a2b2p2Sp(A,C)The data complexityof this query is q1a1c2q2a2c3q3a2c4q4a2c5q52626

Dichotomy TheoremLet q be a conjunctive query without self-joinsTheorem One of the following holds:[D&S’2004](1) Either q is in PTIME(2) Or q is #P hardIn Case (1) q can be computed by a “safe plan” and wecall it a “safe query”[Andritsos 27et al’2006]27

PTIME Queries #P-Hard Queries28

PTIME Queries #P-Hard QueriesR(x, y), S(x, z)R(x, y), S(y), T(‘a’, y)R(x), S(x, y), T(y), U(u, y), W(‘a’, u). . .h1 R(x), S(x, y), T(y)h2 R(x,y), S(y)h3 R(x,y), S(x,y). . .How do we decide if a query is in PTIME or #P hard ?28

Hierarchical Queriessg(x) set of subgoals containing the variable x in key positionsDefinition A query q is hierarchical if forall x, y:sg(x) sg(y) or sg(x) sg(y) or sg(x) sg(y) 2929

Hierarchical Queriessg(x) set of subgoals containing the variable x in key positionsDefinition A query q is hierarchical if forall x, y:sg(x) sg(y) or sg(x) sg(y) or sg(x) sg(y) Non-hierarchicalHierarchicalq R(x, y), S(x, z)xy RSzh1 R(x), S(x, y), T(y)xyRST2929

[D&S’2004]Case 1: Independent Tuples OnlyPTIME Queries:Fact If q is hierarchical then q is in PTIME3030

[D&S’2004]Case 1: Independent Tuples OnlyPTIME Queries:Fact If q is hierarchical then q is in PTIMEq R(x, y), S(x, z)xy RSIndependentprojectΠ-x JoinxzThe hierarchy gives the safe plan !1. Root variable u Π-u2. Connected components JoinΠ-yΠ-zRp(x,y)Sp(x,z)3030

[D&S’2004]Case 1: Independent Tuples Only#P-hard Queries:Recall:h1 R(x), S(x, y), T(y)h1 is #P-hard (reduction from Partitioned Positive 2DNF)[Provan&Ball’83]Fact If q is non-hierarchical then it is #P-hard.Proof: it “contains” h1:q . . . R(x, . .), S(x, y, . . .), T(y, . . .) . . .3131

[D&S’2004]Case 1: Independent Tuples Only#P-hard Queries:Recall:h1 R(x), S(x, y), T(y)h1 is #P-hard (reduction from Partitioned Positive 2DNF)[Provan&Ball’83]Fact If q is non-hierarchical then it is #P-hard.Proof: it “contains” h1:q . . . R(x, . .), S(x, y, . . .), T(y, . . .) . . .Theorem Testing if q is PTIME or #P-hard is in AC31031

Case 2: Independent/disjointTuplesΠPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)TSRuUJoinuΠ-yDyxDJoinyΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs ΠDWp(‘a’,u)Tp(y)Up(u,y)JoinxRp(x)Sp(x,y)3232

Case 2: Independent/disjointTuplesΠDisjointPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)TSRuUWp(‘a’,u)JoinyΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs y)3232

Case 2: Independent/disjointTuplesΠDisjointPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)yxTSRuUΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs jectTp(y)Up(u,y)JoinxRp(x)Sp(x,y)3232

Case 2: Independent/disjointTuplesΠDisjointPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)yxTSRuUΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs )3232

Case 2: Independent/disjointTuples#P-hard Queries:Recall:h1 R(x), S(x, y), T(y)h2 R(x,y), S(y)h3 R(x,y), S(x,y)#P-hard by reductionfrom PERMANENTIf the safe-plan algorithm fails on q, then q can be“rewritten” to either h1 or h2 or h3 and hence is #P-hard(see paper for details)Theorem Testing if q is PTIME or #P-hard is PTIME 33complete33

Summary on Query EvaluationWe understand completely only queries w/o self-joinsLessons learned from our system MystiQ: When the query is safe:– Evaluate it exactly, in the database engine– Performance: close to regular SQL When the query is unsafe[Re’2007]– Approximate it, compute only top-k– Performance: one or two orders of magnitude worse3434

Outline Data model Query evaluation Challenges3535

[Re’2007,Re’2007b]Query OptimizationEven a #P-hard query often has subqueriesthat are in PTIME. Needed: Combine safe plans probabilistic inference “Interesting independence/disjointness” Model a probabilistic engine as black-boxCHALLENGE: Integrate a black-boxprobabilistic inference in a query processor.3636

Probabilistic Inference AlgorithmsOpen the box ! Logical to physicalExamine specific algorithms from KR: Variable elimination[Sen&Deshpande’2007] Junction trees[Bravo&Ramakrishnan’2007] Bounded treewidthCHALLENGE: (1) Study the space ofoptimization alternatives. (2) Estimate the costof specific probabilistic inference algorithms.3737

Open Theory Problems Self-joins are much harder to study– Solved only for independent tuples[D&S’2007] Extend to richer query language– Unions, predicates ( , , ), aggregates Do hardness results still hold for Pr 1/2 ?CHALLENGE: Complete the analysis of thequery complexity over probabilistic databases3838

Complex Probabilistic Model Independent and disjoint tuples areinsufficient for real applications Capturing complex correlations:– Lineage– Graphical models[Das nde’07]CHALLENGE: Explore the connectionbetween complex models and views39[Verma&Pearl’1990]39

[Shen’06, Andritsos’06, Richardson’06,Chaudhuri’07]ConstraintsNeeded to clean uncertainties in the data Hard constraints:– Semantics conditional probability Soft constraints:– What is the semantics ?Lots of prior work, but still little understoodCHALLENGE: Study the impact of hard/soft constraints on query evaluation 4040

on LeakageA view V should not leak information abouta secret SP(S) P(S V) Issues: Which prior P ? What is ?Probability Logic: U V means P(V U) 1[Pearl’88, Adams’98]CHALLENGE: Define a probability logic forreasoning about information leakage 4141

Conclusions Prohibitive cost of cleaning data Represent uncertainties explicitly Need to re-examine many assumptions4242

Conclusions Prohibitive cost of cleaning data Represent uncertainties explicitly Need to re-examine many assumptionsA call to arms:The management of probabilistic data4242

A Model for Uncertainties Data is probabilistic Queries formulated in a standard language Answers are annotated with probabilities This talk: Probabilistic Databases 9. 10 Probabilistic databases: Long History Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:1997

Related Documents:

deterministic polynomial-time algorithms. However, as argued next, we can gain a lot if we are willing to take a somewhat non-traditional step and allow probabilistic veriflcation procedures. In this primer, we shall survey three types of probabilistic proof systems, called interactive proofs, zero-knowledge proofs, and probabilistic checkable .

non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models. Keywords: Probabilistic topic modeling · Regularization of ill-posed inverse problems · Stochastic matrix factorization · Probabilistic .

Oct 04, 2010 · COS 513: Foundations of Probabilistic Modeling Young-suk Lee Lecture 5 1 Administrative Midterm report is due Oct. 29th. Recitation is at 4:26pm in Friend 108. R is a computer language for statistical computing and graphics, and is highly recommended for this class. RSee

In contrast, pile-supported foundations transmit design loads into the adjacent soil mass through pile friction, end bearing, or both. This chapter addresses footing foundations. Pile foundations are covered in Chapter 5, Pile Foundations-General. Each individual footing foundation must be sized so that the maximum soil-bearing pressure does not exceed the allowable soil bearing capacity of .

It is an honour for Assifero to present this guide to community foundations in Italy. The community philanthropy movement is growing rapidly all over the world. In Italy, the establishment of community foundations began in 1999 with foundations in Lecco and Como. There are now 37 registered Italian community foundations (based on the atlas of

Probability and Computing Randomized Algorithms and Probabilistic Analysis '. '. . . \ Michael Mitzenmacher Eli Upfal . Probability and Computing Randomization and probabilistic techniques play an important role in modern com .

vides a language for expressing probabilistic polynomial-time protocol steps, a speci cation method based on a compositional form of equivalence, and a logical basis for reasoning about equivalence. The process calculus is a variant of CCS, with bounded replication and probabilistic polynomial-time expressions allowed in messages and boolean tests.

JAPANESE SECOND LANGUAGE Written examination Thursday 26 November 2020 Reading time: 11.45 am to 12.00 noon (15 minutes) Writing time: 12.00 noon to 2.00 pm (2 hours) QUESTION AND ANSWER BOOK Structure of book Section Number of questions Number of questions to be answered Number of marks 1 – Part A – Part B 1 1 1 1 10 10 2 – Part A – Part B 1 1 1 1 20 15 3 4 1 20 Total 75 Students .