Management Of Probabilistic Data: Foundations And Challenges

1y ago

7 Views

1 Downloads

602.15 KB

62 Pages

Last View : 23d ago

Last Download : 3m ago

Upload by : Mia Martinelli

Report this link

Download PDF

Transcription

Management of Probabilistic Data:Foundations and ChallengesNilesh Dalvi and Dan SuciuUniverisity of Washington11

Databases Are Deterministic Applications since 1970’s requiredprecise semantics– Accounting, inventory Database tools are deterministic– A tuple is an answer or is not Underlying theory assumes determinism– FO (First Order Logic)22

Future of Data ManagementWe need to cope with uncertainties ! Represent uncertainties as probabilities Extend data management tools to handleprobabilistic dataMajor paradigm shift affecting both foundationsand systems33

Uncertainties Everywhere In the schema mappings:– Data spaces– Pay as you go data integration[Halevy’2007] In the data mapping– Life science data integration– Object reconciliation, fuzzy joins[Philippi&Kohler’2006][Arasu’06] In the data itself– Data “by the masses”– Information Extraction– RFID data, sensor data[Gupta&Sarawagi’2006][Welbourne’2007]44

[B.Louie et al.2007]Example 1Data Integration in Life Sciences U2 integrates several biological databasesExample: find functional annotations of ABCD1User types: “Gene ABCD1”U2 finds 80 “related” proteinsRanks them by uncertainty scoreCorrect 9 functions are among top 11Need to represent uncertainties explicitlyEntrezProtein,Pfam,TIGRFAM,NCBI Blast,EntrezGene55

Example 2Information Extraction.52 A Goregaon West Mumbai .[Gupta&Sarawagi’2006]IDHouse-No StreetCity152Goregaon West Mumbai0.1152-AGoregaon West Mumbai0.4152GoregaonWest Mumbai0.2152-AGoregaonWest Mumbai0.22. . . . 20%2.Here probabilities are meaningfulPof suchextractionsare correct66

Example 3RFID Ecosystem at UW[Welbourne’2007]77

RFID data noisy– SIGHTING(tagID, antennaID, time) Derived data Probabilistic– “John entered Room 524 at 9:15” prob 0.6– “John carried laptop x77 at 11:03” prob 0.8– . Queries– “Which people were in Room 478 yesterday ?”Massive amounts of probabilistic data from RFIDs, 8sensors8

A Model for Uncertainties Data is probabilistic Queries formulated in a standard language Answers are annotated with probabilitiesThis talk: Probabilistic Databases99

Probabilistic databases:Long , Fuhr&Roellke:1997Dalvi&S:2004Widom:2005Focus today: the Query Evaluation Problem1010

Has this been solved by AI ?Fix qInput: DBInput: singProbabilisticProbabilisticinference[this talk]1111

Outline Data model Query evaluation Challenges1212

[Barbara et al.1992]What is a ProbabilisticDatabase (PDB) PJohn0.62Jim0.34Mary0.45John0.33Fred0.111313

[Barbara et al.1992]What is a ProbabilisticDatabase (PDB) n0.33Fred0.11What does it mean ? 1313

BackgroundFinite probability space (Ω, P)Ω {ω1, . . ., ωn} set of outcomesP : Ω [0,1]P(ω1) . . . P(ωn) 1Event: E Ω, P(E) ω E P(ω)“Independent”:P(E1 E2) P(E1) P(E2)“Mutual exclusive” or “disjoint”: P(E1E2) 014

Possible Worlds SemanticsObjectLaptop77Book302Ω PDBPossibleworlds1515

Possible Worlds SemanticsObjectTimeLaptop779:07Book302Ω {p 1p worlds1515

Possible Worlds SemanticsObjectLaptop77Book302Ω {Time9:079:18ObjectTime PersonObjectTime aryBook3029:18Johnp 1p 3p 1p orlds1515

Possible Worlds ohnp1Jimp2Maryp3Johnp4Fredp5ObjectTime PersonObjectTime PersonLaptop77 Object9:07JohnTime PersonLaptop77 Object9:07JohnTime PersonBook302 Laptop779:18MaryObject9:07JohnTime PersonObjectTime PersonBook302 Laptop779:18John9:07JimBook302 Laptop779:18Fred9:07Object9:07JimTimeLaptop77Jim PersonBook3029:18Mary1 3Book302 Laptop779:18John1 4Book3029:18 9:07Fred JohnΩ {pp}PDBPossibleworldsppp1(1- p3-p4-p5)1515

Possible Worlds ohnp1Jimp2Maryp3Johnp4Fredp5PDBObjectTime PersonObjectTime PersonLaptop77 Object9:07JohnTime PersonLaptop77 Object9:07JohnTime PersonBook302 Laptop779:18MaryObject9:07JohnTime PersonObjectTime PersonBook302 Laptop779:18John9:07JimBook302 7JimBook3029:18Mary1 3ObjectTime PersonBook302 Laptop779:18John1 49:07Fred JohnTime PersonBook3029:18ObjectLaptop77 Object9:07JimTime PersonBook302 Object9:18MaryTime Person15Time PersonBook302 Object9:18John13 4 5Book3029:18FredΩ {pp}Possibleworldsppp (1- p -p -p )15

DefinitionsDefinition: A tuple-disjoint/independent table is:R(A1, A2, , Am, B1, , Bn, P)Definition: A tuple-independent table is:R(A1, A2, , Am, P)Definition: Semantics is given by possible worlds1616

HasObject(Object, Time, Person, ets(Person1, Person2, Time, nMary9:20p3Independent1717

Query SemanticsA boolean query q is an event: {ω ω q }P(q) ω q P(ω)Did someone take MyBook to the CoffeeRoom ?q HasObject(‘MyBook’,x,t), EnterRoom(x,’CoffeeRoom’,t) P(q) 0.96 (meaning: quite likely18!)18

Discussion of Data ModelTuple-disjoint/independent tables: Simple model, can store in any DBMSMore advanced models:Fuhr and Roellke Symbolic boolean expressions[Widom05, Das Sarma’06, Benjelloun 06] Trio: add lineage[Getoor’2006] Probabilistic Relational Models Graphical models[Sen&Desphande’07]1919

Outline Data model Query evaluation– Probability of Boolean expressions– From queries to Boolean expressions– Data complexity of query evaluation Challenges2020

Probability of BooleanExpressionsΦ X1X2 Ç X1X3 Ç X2X3P(X1) p1 , P(X2) p2, P(X3) p3Compute P(Φ)2121

Probability of BooleanExpressionsΦ X1X2 Ç X1X3 Ç X2X3P(X1) p1 , P(X2) p2, P(X3) p3Ω )1111p 1p 2p 31P(1-p1)p2p3Compute P(Φ)Φ102121

Probability of BooleanExpressionsΦ X1X2 Ç X1X3 Ç X2X3P(X1) p1 , P(X2) p2, P(X3) p3Ω )1111p 1p 2p 31P(1-p1)p2p3Compute P(Φ)Φ10Pr(Φ) (1-p1)p2p3 p1(1-p2)p3 p1p2(1-p3) p1p2p32121

BackgroundFix P(X1) P(X2) . . . P(Xn) 1/2[Valiant:1979]TheoremExact evaluation of Pr(Φ) is #P-completeTheorem For DNF ΦApproximation of Pr(Φ) is in PTIME(FPTRAS)[Karp&Luby:1983]Both theorems extend to rational P(X1), . . . , P(Xn)[Graedel,Gurevitch,Hirsch:1998]2222

Query q Database PDB Φq R(x, y), S(x, z)SpRpPDB ABPa1b1p1a2b2p2ACPa1c1q1X1 a1X2 aY1c2q2Y22c3q3Y3a2c4q4Y4a2c5q5Y52323

Query q Database PDB Φq R(x, y), S(x, z)SpRpPDB ABPa1b1p1a2b2p2ACPa1c1q1X1 a1X2 aY1c2q2Y22c3q3Y3a2c4q4Y4a2c5q5Y5 Φ X1Y1 Ç X1Y2 Ç X2Y3 Ç X2Y4 Ç X2Y52323

Application to Query EvaluationCorollary Fix FO query qExact evaluation of Pr(q) on input PDB is in #PCorollary Fix a conjunctive query q.Approximation of Pr(q) on input PDB is in PTIME(FPTRAS)[Graedel,Gurevitch,Hirsch:1998]2424

Background:Probabilistic NetworksR(x, y), S(x, z)Φ X1Y1ÇX1Y2ÇX2Y3ÇX2Y4ÇX2Y5Inference: hard in generalKR techniques exploit localÇproperties:ÇÇE.g. bounded treewidth PTIME[Zabiyaka&Darwiche’06]Note: for this querythe treewidth 5

[D&S’2004]Π q R(x, y), S(x, z)JoinΠAΠARp(A,B)Sp(A,C)2626

[D&S’2004]Π q R(x, y), S(x, a2c3q3a2b2p2a2c4q4a2c5q52626

[D&S’2004]Π q R(x, y), S(x, p2a2c4q4a2c5q52626

P(q) [D&S’2004]1 - (1-p1(1-(1-q1)(1-q2))) * (1-p2(1-(1-q3)(1-q4)(1-q5)))Π q R(x, y), S(x, a2c4q4a2c5q52626

P(q) [D&S’2004]1 - (1-p1(1-(1-q1)(1-q2))) * (1-p2(1-(1-q3)(1-q4)(1-q5)))Π “safe plan”q JoinR(x, y), S(x, z)ΠAΠARp(A,B)ABPa1b1p1a2b2p2Sp(A,C)The data complexityof this query is q1a1c2q2a2c3q3a2c4q4a2c5q52626

Dichotomy TheoremLet q be a conjunctive query without self-joinsTheorem One of the following holds:[D&S’2004](1) Either q is in PTIME(2) Or q is #P hardIn Case (1) q can be computed by a “safe plan” and wecall it a “safe query”[Andritsos 27et al’2006]27

PTIME Queries #P-Hard Queries28

PTIME Queries #P-Hard QueriesR(x, y), S(x, z)R(x, y), S(y), T(‘a’, y)R(x), S(x, y), T(y), U(u, y), W(‘a’, u). . .h1 R(x), S(x, y), T(y)h2 R(x,y), S(y)h3 R(x,y), S(x,y). . .How do we decide if a query is in PTIME or #P hard ?28

Hierarchical Queriessg(x) set of subgoals containing the variable x in key positionsDefinition A query q is hierarchical if forall x, y:sg(x) sg(y) or sg(x) sg(y) or sg(x) sg(y) 2929

Hierarchical Queriessg(x) set of subgoals containing the variable x in key positionsDefinition A query q is hierarchical if forall x, y:sg(x) sg(y) or sg(x) sg(y) or sg(x) sg(y) Non-hierarchicalHierarchicalq R(x, y), S(x, z)xy RSzh1 R(x), S(x, y), T(y)xyRST2929

[D&S’2004]Case 1: Independent Tuples OnlyPTIME Queries:Fact If q is hierarchical then q is in PTIME3030

[D&S’2004]Case 1: Independent Tuples OnlyPTIME Queries:Fact If q is hierarchical then q is in PTIMEq R(x, y), S(x, z)xy RSIndependentprojectΠ-x JoinxzThe hierarchy gives the safe plan !1. Root variable u Π-u2. Connected components JoinΠ-yΠ-zRp(x,y)Sp(x,z)3030

[D&S’2004]Case 1: Independent Tuples Only#P-hard Queries:Recall:h1 R(x), S(x, y), T(y)h1 is #P-hard (reduction from Partitioned Positive 2DNF)[Provan&Ball’83]Fact If q is non-hierarchical then it is #P-hard.Proof: it “contains” h1:q . . . R(x, . .), S(x, y, . . .), T(y, . . .) . . .3131

Case 2: Independent/disjointTuplesΠPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)TSRuUJoinuΠ-yDyxDJoinyΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs ΠDWp(‘a’,u)Tp(y)Up(u,y)JoinxRp(x)Sp(x,y)3232

Case 2: Independent/disjointTuplesΠDisjointPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)TSRuUWp(‘a’,u)JoinyΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs y)3232

Case 2: Independent/disjointTuplesΠDisjointPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)yxTSRuUΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs jectTp(y)Up(u,y)JoinxRp(x)Sp(x,y)3232

Case 2: Independent/disjointTuplesΠDisjointPTIME Queries:-uR(x), S(x, y), T(y), U(u, y), W(‘a’, u)yxTSRuUΠ-xIW1. Root variable ΠI2. CC’s Join3. Constant key attrs )3232

Case 2: Independent/disjointTuples#P-hard Queries:Recall:h1 R(x), S(x, y), T(y)h2 R(x,y), S(y)h3 R(x,y), S(x,y)#P-hard by reductionfrom PERMANENTIf the safe-plan algorithm fails on q, then q can be“rewritten” to either h1 or h2 or h3 and hence is #P-hard(see paper for details)Theorem Testing if q is PTIME or #P-hard is PTIME 33complete33

Summary on Query EvaluationWe understand completely only queries w/o self-joinsLessons learned from our system MystiQ: When the query is safe:– Evaluate it exactly, in the database engine– Performance: close to regular SQL When the query is unsafe[Re’2007]– Approximate it, compute only top-k– Performance: one or two orders of magnitude worse3434

Outline Data model Query evaluation Challenges3535

[Re’2007,Re’2007b]Query OptimizationEven a #P-hard query often has subqueriesthat are in PTIME. Needed: Combine safe plans probabilistic inference “Interesting independence/disjointness” Model a probabilistic engine as black-boxCHALLENGE: Integrate a black-boxprobabilistic inference in a query processor.3636

Probabilistic Inference AlgorithmsOpen the box ! Logical to physicalExamine specific algorithms from KR: Variable elimination[Sen&Deshpande’2007] Junction trees[Bravo&Ramakrishnan’2007] Bounded treewidthCHALLENGE: (1) Study the space ofoptimization alternatives. (2) Estimate the costof specific probabilistic inference algorithms.3737

Open Theory Problems Self-joins are much harder to study– Solved only for independent tuples[D&S’2007] Extend to richer query language– Unions, predicates ( , , ), aggregates Do hardness results still hold for Pr 1/2 ?CHALLENGE: Complete the analysis of thequery complexity over probabilistic databases3838

Complex Probabilistic Model Independent and disjoint tuples areinsufficient for real applications Capturing complex correlations:– Lineage– Graphical models[Das nde’07]CHALLENGE: Explore the connectionbetween complex models and views39[Verma&Pearl’1990]39

[Shen’06, Andritsos’06, Richardson’06,Chaudhuri’07]ConstraintsNeeded to clean uncertainties in the data Hard constraints:– Semantics conditional probability Soft constraints:– What is the semantics ?Lots of prior work, but still little understoodCHALLENGE: Study the impact of hard/soft constraints on query evaluation 4040

on LeakageA view V should not leak information abouta secret SP(S) P(S V) Issues: Which prior P ? What is ?Probability Logic: U V means P(V U) 1[Pearl’88, Adams’98]CHALLENGE: Define a probability logic forreasoning about information leakage 4141

Conclusions Prohibitive cost of cleaning data Represent uncertainties explicitly Need to re-examine many assumptions4242

Conclusions Prohibitive cost of cleaning data Represent uncertainties explicitly Need to re-examine many assumptionsA call to arms:The management of probabilistic data4242

A Model for Uncertainties Data is probabilistic Queries formulated in a standard language Answers are annotated with probabilities This talk: Probabilistic Databases 9. 10 Probabilistic databases: Long History Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:1997

Related Documents:

Probabilistic Proof Systems: A Primer

deterministic polynomial-time algorithms. However, as argued next, we can gain a lot if we are willing to take a somewhat non-traditional step and allow probabilistic veriﬂcation procedures. In this primer, we shall survey three types of probabilistic proof systems, called interactive proofs, zero-knowledge proofs, and probabilistic checkable .

26 Views

3y ago

Tutorial on Probabilistic Topic Modeling: Additive Regularization for ...

non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models. Keywords: Probabilistic topic modeling · Regularization of ill-posed inverse problems · Stochastic matrix factorization · Probabilistic .

8 Views

1y ago

COS 513: Foundations of Probabilistic Modeling

Oct 04, 2010 · COS 513: Foundations of Probabilistic Modeling Young-suk Lee Lecture 5 1 Administrative Midterm report is due Oct. 29th. Recitation is at 4:26pm in Friend 108. R is a computer language for statistical computing and graphics, and is highly recommended for this class. RSee

5 Views

2y ago

Foundation Manual Chapter4, Footing Foundations

In contrast, pile-supported foundations transmit design loads into the adjacent soil mass through pile friction, end bearing, or both. This chapter addresses footing foundations. Pile foundations are covered in Chapter 5, Pile Foundations-General. Each individual footing foundation must be sized so that the maximum soil-bearing pressure does not exceed the allowable soil bearing capacity of .

80 Views

3y ago

A guide to Community Foundations in Italy - ECFI

It is an honour for Assifero to present this guide to community foundations in Italy. The community philanthropy movement is growing rapidly all over the world. In Italy, the establishment of community foundations began in 1999 with foundations in Lecco and Como. There are now 37 registered Italian community foundations (based on the atlas of

18 Views

1y ago

Randomized Algorithms and Probabilistic Analysis Michael ...

Probability and Computing Randomized Algorithms and Probabilistic Analysis '. '. . . \ Michael Mitzenmacher Eli Upfal . Probability and Computing Randomization and probabilistic techniques play an important role in modern com .

78 Views

3y ago

A PROBABILISTIC POLYNOMIAL-TIME PROCESS CALCULUS FOR THE ...

vides a language for expressing probabilistic polynomial-time protocol steps, a speci cation method based on a compositional form of equivalence, and a logical basis for reasoning about equivalence. The process calculus is a variant of CCS, with bounded replication and probabilistic polynomial-time expressions allowed in messages and boolean tests.

29 Views

3y ago

JAPANESE SECOND LANGUAGE

JAPANESE SECOND LANGUAGE Written examination Thursday 26 November 2020 Reading time: 11.45 am to 12.00 noon (15 minutes) Writing time: 12.00 noon to 2.00 pm (2 hours) QUESTION AND ANSWER BOOK Structure of book Section Number of questions Number of questions to be answered Number of marks 1 – Part A – Part B 1 1 1 1 10 10 2 – Part A – Part B 1 1 1 1 20 15 3 4 1 20 Total 75 Students .

52 Views

3y ago

Recent Views

AUTOMOTIVE INDUSTRY ANALYSIS REPORT and GUIDE

3.1 General Outlook of the Automotive Industry in the World 7 3.2 Overview of the Automotive Industry in Turkey 10 3.3 Overview of the Automotive Industry in TR42 Region 12 4 Effects of COVID-19 Outbreak on the Automotive Industry 15 5 Trends Specific to the Automotive Industry 20 5.1 Special Trends in the Automotive Industry in the World 20

1y ago

86 Views

Automotive Pathway Automotive Services Fundamentals

Automotive Pathway Automotive Services Fundamentals Course Number: IT11 Prerequisite: None Aligned Industry Credential: S/P2- Safety and Pollution Prevention and SP2- Mechanical and Pollution Prevention Description: This course introduces automotive safety, basic automotive terminology, system & component identification, knowledge and int

2y ago

228 Views

Articulation Agreements: College of Applied Technologies .

Hernando High School FL Automotive . Central Nine Career Center IN Automotive Elkhart Area Career Center IN Automotive . Kokomo Area Career Center IN Automotive North Lawrence Vo-Tech IN MLR Porter County Career Center IN Automotive Richmond High School IN Automotive Southeastern Career

2y ago

376 Views

Automotive Basics - Auto Upkeep

Automotive Basics - Course Description "Automotive Basics includes knowledge of the basic automotive systems and the theory and principles of the components that make up each system and how to service these systems. Automotive Basics includes applicable safety and environmental rules and regulations. In Automotive Basics, students will gain

1y ago

197 Views

Automotive Automotive Automotive - HSBC Bank Malaysia

This Merchant list is subject to change from time to time. Merchant(s) who are terminated from the Instalment program after the published date might still be reflected in this list. HSBC Cardholder(s) are advised to confirm the availability of HSBC Card Instalment Plan with the merchant. Automotive Automotive Automotive

1y ago

173 Views

On the Road: U.S. Automotive Parts Industry Annual Assessment

Table 12: Acquisitions of U.S. Automotive Parts Companies (SIC 3714) Table 13: Automotive Parts Exports, 2000-2010 Table 14: Automotive Parts Imports, 2000-2010 . Automotive parts consumption is linked to the demand for new vehicles, since roughly 70 percent of U.S. automotive parts production is for Original Equipment (OE) products. .

10m ago

72 Views

EMC TEST SYSTEMS FOR AUTOMOTIVE

AUTOMOTIVE EMC TEST SYSTEMS FOR AUTOMOTIVE ELECTRONICS AUTOMOTIVE EMC TEST SYSTEMS FOR AUTOMOTIVE ELECTRONICS Step 1 Step 2 Step 3: Set the parameters Step 4: Active test. Load dump pulses have high pulse energy, which can be highly destructive to electrical or electronic equipment. The LD 200N series simulates these pulses with high energy in a range of up to 1.2 seconds. The LD 200N .

3y ago

266 Views

Automotive Manufacturing - Select Georgia

Jobs created by Georgia’s automotive-related locations Toyo Tire North America Manufacturing and expansions in the last three years 32,000 Automotive-related engineers and production workers in Georgia Sources: EMSI 2020.3, press releases and Automotive Database, Georgia Power Community & Economic Development, 2020 Automotive Manufacturing

2y ago

166 Views

#1 OSAT for Automotive Packaging and Test

We Know Automotive Amkor has extensive experience with automotive process requirements shipping billions of units every year for automotive applications. Our packages meet or exceed automotive quality, reliability, burn-in and safe launch plan criteria. Amkor also has failure analysis, tri-temp test and statistical process capability in all .

1y ago

145 Views

Ipsos Automotive Center of Excellence

Global Automotive Center of Excellence -2014 Ipsos Automotive 9 Automotive Center of Excellence As global automotive markets get more sophisticated, they require vehicle manufacturers to offer the most relevant market propositions to match consumer needs. There is greater value than ever before for a global research partner, who understands

1y ago

126 Views

All about automotive engineering in a pocketbook The 8th edition has .

Automotive Automotive Handbook Handbook All about automotive engineering in a pocketbook The 8th edition has been revised and extended. Automotive Handbook Reference handbook for academic and personal use. ISBN 978--7680-4851-3 Contents - central themes Basic principles: physics, materials, machine parts, joining and bonding techniques

1y ago

135 Views

Brochure: Advanced Flash Storage Solutions for Automotive Applications

iNAND Automotive Embedded Flash Drives (EFDs) are designed to support the harsh environments, high reliability and quality required by the automotive industry. The automotive iNAND product portfolio supports both UFS and e.MMC interfaces in a small 11.5x13mm package with a wide range of capacities to provide automotive OEMs and Tier-1

1y ago

161 Views

Industry Skills Forecast and Proposed Schedule of Work Automotive

Executive summary The Automotive Retail, Service and Repair (AUR) and Automotive Manufacturing (AUM) Training Packages are critical elements in the Vocational Education and Training (VET) system, playing central roles in the training of learners that engage in the automotive industries. A productive and valuable Automotive Training

1y ago

134 Views

Automotive Programs Student Handbook - SCCIowa

include a basic knowledge of all facets of the automotive repair industry, followed by classroom practice and drills of basic skills utilized in the automotive repair industry. The curriculum includes an internship experience in an automotive repair business. The curriculum is evaluated and revised as automotive repair needs change in the industry.

10m ago

72 Views

automotIve

automotive manufacturers worldwide. Those companies that take a forward-thinking approach will gain a competitive advantage and secure a leadership position in a realigned automotive value chain. At Seco, we partner with OEMs and other vehicle-based organisations around the globe to help automotive manufacturers overcome their

3y ago

145 Views

Management Of Probabilistic Data: Foundations And Challenges

It looks like you're using an ad-blocker