N Frequent Itemset Mining Methods - Texas State University

1y ago

3 Views

1 Downloads

548.11 KB

30 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Baylee Stein

Report this link

Download PDF

Transcription

Chapter 6: Mining Frequent Patterns, Association and Correlations n Basic concepts n Frequent itemset mining methods n Constraint-based frequent pattern mining (ch7) n Association rules 1

What Is Frequent Pattern Analysis? n Frequent pattern: a pattern (a subset of items, subsequences, substructures, etc.) that occurs frequently in a data set n First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining n n Motivation: Finding inherent regularities in data n What products were often purchased together?— Beer and diapers?! n What are the subsequent purchases after buying a PC? n What kinds of DNA are sensitive to this new drug? n Can we automatically classify web documents? Applications n Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 2

Why Is Freq. Pattern Mining Important? n Freq. pattern: intrinsic and important property of data sets n Foundation for many essential data mining tasks n Association, correlation, and causality analysis n Sequential, structural (e.g., sub-graph) patterns n Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data n Classification: associative classification n Cluster analysis: frequent pattern-based clustering n Data warehousing: iceberg cube and cube-gradient n Semantic data compression: fascicles n Broad applications 3

Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both n n n n Customer buys diaper n itemset: A set of items k-itemset X {x1, , xk} (absolute) support or support count of X: Frequency or occurrence of an itemset X (relative) support of X: the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys beer 4

Closed Patterns and Max-Patterns n A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, , a100} contains 2100 – 1 1.27*1030 sub-patterns! n Solution: Mine closed patterns and max-patterns instead n An itemset X is a closed pattern if X is frequent and there exist no super-patterns with the same support n n n n all super-patterns must have smaller support An itemset X is a max-pattern if X is frequent and there exist no super-patterns that are frequent Relationship between the two? Closed patterns are a lossless compression of freq. patterns, whereas max-patterns are a lossy compression n Lossless: can derive all frequent patterns as well as their support n Lossy: can derive all frequent patterns 5

Closed Patterns and Max-Patterns: Example n DB { a1, , a100 , a1, , a50 } n n n n min sup 1 What is the set of closed patterns? n a1, , a100 : 1 n a1, , a50 : 2 n How to derive frequent patterns and their support values? What is the set of max-patterns? n a1, , a100 : 1 n How to derive frequent patterns? What is the set of all patterns? n {a1}: 2, , {a1, a2}: 2, , {a1, a51}: 1, , {a1, a2, , a100}: 1 n A big number: 2100 – 1 6

Closed Patterns: Derivation For a given dataset with min sup 8 (absolute support), the closed patterns are {a,b,c,d} with support of 10, {a,b,c} with support of 12, and {a, b,d} with support of 14. Derive the frequent 2itemsets together with their support values {a,b}: 14 {b,c}: 12 {a,c}: 12 {b,d}: 14 {a,d}: 14 {c,d}: 10 7

Chapter 6: Mining Frequent Patterns, Association and Correlations n Basic concepts n Frequent itemset mining methods n Constraint-based frequent pattern mining (ch7) n Association rules 8

Scalable Frequent Itemset Mining Methods n Three major approaches n Apriori: A Candidate Generation-and-Test Approach n n FP-Growth: A Frequent Pattern-Growth Approach n n Han, Pei & Yin @SIGMOD’00 ECLAT: Frequent Pattern Mining with Vertical Data Format n n Agrawal & Srikant@VLDB’94 Zaki & Hsiao @SDM’02 FP-Growth and ECLAT improve efficiency 9

Apriori: Downward Closure Property n n Apriori leverages the downward closure property to prune the search space and gain efficiency The downward closure (anti-monotonic) property of frequent patterns n n n Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} 10

Apriori: High-level Description n n Apriori pruning principle: If there is any itemset that is infrequent, its superset should not be generated/tested Method: n n n n Initially, scan DB once to get frequent 1-itemset Generate length (k 1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 11

Apriori: Example DB Tid Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Itemset {a, c} {b, c} {b, e} {c, e} sup {a} 2 {b} 3 {c} 3 {d} 1 {e} 3 C1 1st scan C2 L2 Itemset sup 2 2 3 2 Itemset {a, b} {a, c} {a, e} {b, c} {b, e} {c, e} sup 1 2 1 2 3 2 min sup 2 L1 Itemset sup {a} 2 {b} 3 {c} 3 {e} 3 C2 2nd scan Itemset {a, b} {a, c} {a, e} {b, c} {b, e} {c, e} C3 Itemset {b, c, e} 3rd scan L3 Itemset sup {b, c, e} 2 12

The Apriori Algorithm (Pseudo-code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 {frequent items}; for (k 1; Lk ! Æ; k ) do begin Ck 1 candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck 1 that are contained in t Lk 1 candidates in Ck 1 with min support end return Èk Lk; 13

Implementation of Apriori n Generate candidates, then count support for the generated candidates n How to generate candidates? n n Step 1: self-joining Lk n Step 2: pruning Example: n n L3 {abc, abd, acd, ace, bcd} Self-joining: L3*L3 n n n n n abcd from abc and abd acde from acd and ace Pruning: n acde is removed because ade is not in L3 C4 {abcd} The above procedures do not miss any legitimate candidates. Thus Apriori mines a complete set of frequent patterns. 14

How to Count Support of Candidates? n Why counting support of candidates a problem? n n n The total number of candidates can be very huge One transaction may contain many candidates Method: n Candidate itemsets are stored in a hash-tree n Leaf node of hash-tree contains a list of itemsets and counts n n Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 15

Example: Counting Support of Candidates Subset function 3,6,9 1,4,7 Transaction: 1 2 3 5 6 2,5,8 1 2356 234 567 13 56 145 136 12 356 124 457 125 458 345 356 357 689 367 368 159 16

Further Improvement of the Apriori Method n n Major computational challenges n Multiple scans of transaction database n Huge number of candidates n Tedious workload of support counting for candidates Improving Apriori: general ideas n Reduce passes of transaction database scans n Shrink number of candidates n Facilitate support counting of candidates 17

Apriori applications beyond pattern mining n n n Given a set S of students, we want to find each subset of S such that the age range of the subset is less than 5. n Apriori algorithm, level-wise search using the downward closure property for pruning to gain efficiency Can be used to search for any subsets with the downward closure property (i.e., anti-monotone constraint) CLIQUE for subspace clustering used the same Apriori principle, where the one-dimensional cells are the items 18

Chapter 6: Mining Frequent Patterns, Association and Correlations n Basic concepts n Frequent itemset mining methods n Constraint-based frequent pattern mining (ch7) n Association rules 19

Constraint-based (Query-Directed) Mining n Finding all the patterns in a database autonomously? — unrealistic! n n Data mining should be an interactive process n n The patterns could be too many but not focused! User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining n n n User flexibility: provides constraints on what to be mined Optimization: explores such constraints for efficient mining — constraint-based mining: constraint-pushing, similar to push selection first in DB query processing Note: still find all the answers satisfying constraints, not finding some answers in “heuristic search” 20

Constrained Mining vs. Constraint-Based Search n n Constrained mining vs. constraint-based search/reasoning n Both are aimed at reducing search space n Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI n Constraint-pushing vs. heuristic search n It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS n Database query processing requires to find all n Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing

Constraints n Pattern space pruning constraints n n Monotonic: If c is satisfied, no need to check c again n Succinct: c must be satisfied, so one can start with the data sets satisfying c n n Anti-monotonic: If constraint c is violated, its further mining can be terminated Convertible: c is not monotonic nor anti-monotonic, but it can be converted into it if items in the transaction can be properly ordered Data space pruning constraint n Data succinct: Data space can be pruned at the initial pattern mining process n Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned from its further mining 22

Anti-Monotonicity in Constraint Pushing n Anti-monotonicity n n n n n When an itemset S violates the constraint, so does any of its superset sum(S.Price) v is anti-monotonic sum(S.Price) ³ v is not anti-monotonic C: range(S.profit) 15 is anti-monotonic TDB (min sup 2) TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 n Itemset ab violates C b 0 n So does every superset of ab c -20 d 10 e -30 f 30 g 20 h -10 support count min sup is antimonotonic n core property used in Apriori

Monotonicity for Constraint Pushing n Monotonicity n n When an itemset S satisfies the constraint, so does any of its superset Item Profit a 40 b 0 c -20 n sum(S.Price) ³ v is monotonic d 10 n min(S.Price) v is monotonic e -30 f 30 g 20 h -10 C: range(S.profit) ³ 15 n Itemset ab satisfies C n So does every superset of ab 24

Succinctness n n n n Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items If a constraint is succinct, we can directly generate precisely the sets that satisfy it, even before support counting begins. Avoids substantial overhead of generate-and-test, n i.e., such constraint is pre-counting pushable n min(S.Price) v is succinct n sum(S.Price) ³ v is not succinct

Constraint-Based Mining—A General Picture Constraint Antimonotone Monotone Succinct vÎS no yes yes SÊV no yes yes SÍV yes no yes min(S) v no yes yes min(S) ³ v yes no yes max(S) v yes no yes max(S) ³ v no yes yes count(S) v yes no weakly count(S) ³ v no yes weakly sum(S) v ( a Î S, a ³ 0 ) yes no no sum(S) ³ v ( a Î S, a ³ 0 ) no yes no range(S) v yes no no range(S) ³ v no yes no avg(S) q v, q Î { , , ³ } convertible convertible no support(S) ³ x yes no no support(S) x no yes no 26

Chapter 6: Mining Frequent Patterns, Association and Correlations n Basic concepts n Frequent itemset mining methods n Constraint-based frequent pattern mining (ch7) n Association rules 27

Basic Concepts n An association rule is of the form X à Y, where X,Y Ì I, X Ç Y f n n {onions, potatoes} - {burger} from sales data would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. n n n i.e., support(X- Y) P(X U Y) Can be estimated by support count(X È Y) / total number of transactions (the percentage of transactions in DB that contain X È Y) confidence(X- Y): conditional probability that a transaction having X also contains Y n n n Such information can be used for marketing activities such as promotional pricing or product placements support(X- Y): probability that a transaction contains X È Y n n A rule is strong if it satisfies both support and confidence thresholds. i.e. confidence(X- Y) P(Y X) Can be estimated by support count(X È Y) / support count(X) confidence(X- Y) can be easily derived from the support count of X and the support count of X È Y. Thus association rule mining can be reduced to frequent pattern mining 28

Association rules: Example Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Let minsup 50%, minconf 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 n Association rules: (many more!) n Beer à Diaper (60%, 100%) n Diaper à Beer (60%, 75%) If {a} {b} is an association rule, then {b} {a} is also an association rule? q Same support, different confidence If {a,b} {c} is an association rule, then {b} {c} is also an association rule? Customer buys beer If {b} {c} is an association rule then {a,b} {c} is also an association rule? 29

Interestingness Measure: Correlations (Lift) n play basketball Þ eat cereal [40%, 66.7%] is misleading n The overall % of students eating cereal is 75% 66.7%. n play basketball Þ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence n n Support and confidence are not good to indicate correlations Measure of dependent/correlated events: lift P( AÈ B ) lift P( A) P( B ) Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 2000 / 5000 lift ( B, C ) 0.89 3000 / 5000 * 3750 / 5000 lift ( B, C ) 1000 / 5000 1.33 3000 / 5000 *1250 / 5000 30

Closed Patterns and Max-Patterns n A long pattern contains a combinatorial number of sub-patterns, e.g., {a 1, , a 100} contains2100 -1 1.27*1030 sub-patterns! n Solution: Mine closed patternsand max-patternsinstead n An itemset Xis a closed pattern if X is frequentand there exist no super-patternswith the same support n all super-patterns must have smaller support

Related Documents:

Chapter 7: Frequent Itemsets and Association Rules

IR&DM, WS'11/12 20 December 2011 VII.1-Itemsets, support, and frequency 8 An itemset is a set of items -A transaction t is an itemset with associated transaction ID, t (tid, I), where I is the set of items of the transaction A transaction t (tid, I) contains itemset X if X I The support of itemset X in database D is the number of transactions in D that contain it:

5 Views

1y ago

CS570 Introduction to Data Mining - Emory University

Frequent Itemset Mining Frequent itemset mining: frequent set of items in a transaction data set First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993 SIGMOD Test of Time Award 2003 “This paper started a field of research. In addition to containing an innovative algorithm, its subject matter bro

14 Views

2y ago

FDSMO: Frequent DNA Sequence Mining Using FBSB and Optimization

Step 5: Mine frequent items whose backings are not less than the Min-sup and push the frequent items into the frequent item-set L1. Step 6:Produce the frequent1-itemset L1and bitmap Pos. Algorithm2. Frequent k length sequence mining and (k 1) length sequence generation Input: Bitmap Pos for k length sequences, minimal support min-sup

7 Views

1y ago

Frequent Itemset Mining: a 25 Years Review - Philippe Fournier-Viger

Frequent itemset mining is an essential task within data analysis since it is responsible for . both shared and distributed (non shared) memory (Moens, Aksehirli, & Goethals,2013). 2. . parallel FIM approaches on modern architectures (Apiletti et al.,2017), it is interesting .

4 Views

1y ago

Frequent Pattern Mining - Charu Aggarwal

Imielinski, and Swami. The earlier data mining conferences were often dominated by a large number of frequent pattern mining papers. This is one of the reasons that frequent pattern mining has a very special place in the data mining community. At this point, the ﬁeld of frequent pattern mining is considered a mature one.

10 Views

1y ago

Frequent Itemsets Association Rules

Lk frequent itemsetsof size k Candidate generation Frequent itemset generation 1. k 1,C1 all items 2. While Cknot empty 3. Scan the database to find which itemsetsin Ck are frequent and put them into Lk 4. Use Lkto generate a collection of candidate itemsets Ck 1of size k 1 5. k k 1

6 Views

1y ago

Distributed Mining of High Utility Sequential Patterns with Negative ...

a distributed environment. Mining patterns from big data on a single machine is very costly to execute the mining algorithms. Developing a distributed algorithm that mines HUNSPs is a key to handle the problem. Recently, Lin et al. [13] introduced an algorithm for high utility itemset mining which is applicable for handling big data.

2 Views

1y ago

IDEAL Discharge Planning Overview, Process, and Checklist

care as a way to improve hospital quality and safety. As one indicator of this, the Centers for Medicare and Medicaid Services implemented new guidelines in 2012 that reduce payment to hospitals exceeding their expected readmission rates. To improve quality and reduce preventable readmissions, [insert hospital name] will use the Agency for Healthcare Research and Quality’s Care Transitions .

269 Views

3y ago

Recent Views

PHONE NO. CONTACT TOPIC/SUBTOPIC ORGANIZATION #A

651-757-2762 Deborah Klooz MPCA Paralegal: 651-757-2631 Jean Coleman MPCA Staff Attorney: 651-757-2791 Adonis Neblett MPCA Staff Attorney: 651-757-2017 Carmen Netten MPCA Staff Attorney: 651-757-2759 David Stellmach MPCA Staff Attorney: 651-757-2247 Joseph Dammel MPCA Staff Attorney: 651-757-2545 Michelle Janson MPCA Staff Attorney: #ATTORNEY .

2y ago

403 Views

Local Prosecutors and The Attorney General

Attorney General of Iowa Other Members iii Honorable Arthur K. Bolton Attorney General of Georgia Honorable Chauncey H. Browning, J 1'. Honorable John C. Danforth Attorney General of Missouri Honorable J olm P. Moore Attorney General of Colorado Attorney General of West Virginia Honorable Larry Derryberry Attorney General of Oklahoma

1y ago

178 Views

30th Annual Anti-Fraud Conference Tentative Schedule

Apr 30, 2019 · Jill Nerone, Supervising Deputy District Attorney, Alameda County District Attorney’s Office Laura Meyers, Assistant District Attorney, San Francisco County District Attorney’s, Office Nicole Pantaleo, Deputy District Attorney, Marin County District Attorney’s Office, Insurance F

2y ago

150 Views

Shannon McClellan Hon. Diane O. Leasure Ellery M. “Rick .

Attorney at Law Hon. Pamila J. Brown BOG Liaison District Court, Howard County Alan S. Carmel Attorney at Law Sarah Dawn Cline Attorney at Law Adam Sean Cohen Attorney at Law Delegate Kathleen M. Dumais District 15 Suzanne K. Farace Attorney at Law Barry L. Gogel Attorney at Law Michael I. Gordon

2y ago

142 Views

Powers of Attorney Act 2003 A Commentary - Law Society of New South Wales

POWERS OF ATTORNEY ACT 2003: A COMMENTARY 6 POWERS OF ATTORNEY ACT 2003: COMMENTARY The commentary is provided in black text. Reference to the "Act" is a reference to the Powers of Attorney Act 2003 as amended. Reference to the "Regulation" is a reference to the Powers of Attorney Regulation 2011, recently amended by the Powers of Attorney Amendment Act 2013 and the Powers of

7m ago

94 Views

California Safe Drinking Water and Toxic Enforcement Act .

District Attorney of Madera County 209 West Yosemite Avenue Madera, CA 93637 District Attorney of Marin County 3501 Civic Center Drive, Rm. 130 San Rafael, CA 94903 District Attorney of Mariposa County P.O. Box 730 Mariposa, CA 95338 District Attorney of Mendocino County P.O. Box 1000 Ukiah, CA 95482 District Attorney of Merced County

3y ago

163 Views

IN THE UNITED STATES COURT OF APPEALS FOR THE FIRST

Mar 06, 2020 · Attorney General of New Jersey Assistant Attorney General Counsel of Record Attorney for Amicus Curiae JOHN T. PASSANTE State of New Jersey Deputy Attorney General New Jersey Attorney General’s Office Richard J. Hughes Justice Complex 25 Market Street Trenton, NJ 086

2y ago

128 Views

ATTORNEY HANDBOOK - United States Courts

e. Each attorney's or pro se litigant's name must be typed and signed on the last page of the complaint, with: (1) his/her address (2) telephone number (3) if a Pennsylvania attorney, his/her Pennsylvania Attorney ID Number f. To file a complaint, the attorney must have an electronic signature on the complaint and must have an electronic

1y ago

124 Views

Power of Attorney - FedEx

Show the date the Power of Attorney is signed. Corporation Power of Attorney Partnership 1 10 9 8 7 6 5 4 3 2 12 11 1 10 9 8 7 6 5 4 3 2 12 11 1 10 9 8 7 6 5 4 3 2 12 11 Rev 6/13 The number preceding each instruction corresponds to the same number on the example of the power of attorney form. Customs Power of Attorney, Designation as Export .

1y ago

157 Views

Powers of Attorney - Ontario

attorney, a family member or friend may have to apply to be appointed as guardian. Powers of attorney that were properly made under previous laws of Ontario remain legally valid. The forms for a Continuing Power of Attorney for Property and a Power of Attorney for Personal Care contained in this booklet were revised on March 29, 1996 in accordance

1y ago

155 Views

STATUTORY POWER OF ATTORNEY - eForms

repudiated the power of attorney; and the power of attorney still is in full force and effect. 5. I/we make this affidavit for the purpose of inducing _ to accept delivery of the above described instrument, as executed by me/us in my/our capacity of attorney(s)-in-fact for the Principal. _, Attorney-in-fact

1y ago

118 Views

John J. Hoffman Acting Attorney General of New Jersey

JOHN J. HOFFMAN ACTING ATTORNEY GENERAL OF NEW JERSEY Division of Law 124 Halsey Street — 5th Floor P.O. Box 45029 Newark, New Jersey 07101 Attorney for Plaintiffs By: Jah-Juin Ho - #033032007 Deputy Attorney General 973-648-2500 JOHN J. HOFFMAN, Acting Attorney General of the State of New Jersey, and ERIC T.

1y ago

89 Views

Options in Oregon to Help Another Person Make Decisions

Power of Attorney A “Power of Attorney” is a legal document that allows a person to give another person (called an “agent”) the right to act on the person’s behalf. A “Power of Attorney” in Oregon can only be used for financial decisions. The way a “Power of Attorney” is written is important. The authority given to the agent can

3y ago

134 Views

- fcdfa

FRESNO COUNTY SUPERIOR COURT By DEPT.402 JAN SCULLY District Attorney, County of Sacramento RUTH YOUNG, State Bar No. 133606 Deputy District Attorney 906 G Street, Suite 700 Sacramento, CA 95814 Telephone: (916) 874-6174 JACKIE LACEY District Attorney, County of Los Angeles STUART C. LYTTON, State Bar No. 114241 Deputy District Attorney

3y ago

136 Views

Non-Attorney E-File Registration

your motion for e-filing access. Instructions to submit the Non-Attorney E-File Registration: 1. Register for a Non-Attorney Filer Account on the PACER website at www.pacer.uscourts.gov. If you already have a PACER Account, login to Manage My Account, select Non-Attorney E-File Re

2y ago

181 Views

N Frequent Itemset Mining Methods - Texas State University

It looks like you're using an ad-blocker