Data Mining Algorithms - Stanford University

1y ago
11 Views
2 Downloads
626.66 KB
39 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Laura Ramon
Transcription

Data Mining Algorithms CS102 Spring 2020 Data Mining CS102

Data Tools and Techniques § Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions (“queries”) § Data Mining Looking for patterns in data § Machine Learning Using data to build models and make predictions § Data Visualization Graphical depiction of data § Data Collection and Preparation Data Mining CS102

Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning “Data mining” often associated with specific data types and patterns We will focus on “market-basket” data Widely applicable (despite the name) And two types of data mining patterns Frequent item-sets Association rules Data Mining CS102

Other Data and Patterns Other types of data Networks/graphs Streams Text (“text mining”) Specific techniques for each one Other patterns Similar items Structural patterns in large graphs/networks Clusters, anomalies Data Mining CS102

(In)Famous Early Success Stories Victoria’s Secret Walmart Beer & Diapers Data Mining CS102

Market-Basket Data Originated with retail data Each shopper buys “market basket” of groceries Mine data for patterns in buying habits General definition Domain of items Transaction - one or more items occurring together Dataset – set of transactions (usually large) Data Mining CS102

Market-Basket Examples Items Data Mining Transaction CS102

Market-Basket Examples Items Transaction Groceries Data Mining CS102

Market-Basket Examples Data Mining Items Transaction Groceries Grocery cart CS102

Market-Basket Examples Data Mining Items Transaction Groceries Online goods Grocery cart CS102

Market-Basket Examples Data Mining Items Transaction Groceries Online goods Grocery cart Virtual shopping cart CS102

Market-Basket Examples Data Mining Items Transaction Groceries Online goods University courses Grocery cart Virtual shopping cart CS102

Market-Basket Examples Data Mining Items Transaction Groceries Online goods University courses Grocery cart Virtual shopping cart Student transcript CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Person Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Person Symptoms Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Person Symptoms Patient Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Person Symptoms Patient Menu items Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Person Symptoms Patient Menu items Restaurant customer Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Person Symptoms Patient Menu items Restaurant customer Words Data Mining CS102

Market-Basket Examples Items Transaction Groceries Grocery cart Online goods Virtual shopping cart University courses Student transcript University students Party Movies Person Symptoms Patient Menu items Restaurant customer Words Document Data Mining CS102

Data Mining Algorithms Frequent Item-Sets – sets of items that occur frequently together in transactions Groceries bought together Courses taken by same students Students going to parties together Movies watched by same people Association Rules – When certain items occur together, another item frequently occurs with them Shoppers who buy phone charger also buy case Students who take Databases also take Machine Learning Diners who order curry and rice also order bread Data Mining CS102

Frequent Item-Sets Sets of items that occur frequently together in transactions Ø How large is a “set”? Ø What does “frequently” mean? Data Mining CS102

Frequent Item-Sets Sets of items that occur frequently together in transactions Ø How large is a “set”? Usually specify a minimum min-set-size Possibly also a maximum max-set-size Ø What does “frequently” mean? Notion of support Data Mining CS102

Support Support for a set of items S in a dataset of transactions is the fraction of the transactions containing S: # of transactions containing S total # of transactions Specify support-threshold for frequent item-sets Only return sets where support support-threshold Data Mining CS102

Your Turn Transactions: T1: T2: T3: T4: T5: milk, eggs, juice milk, juice, cookies eggs, chips milk, eggs milk, juice, cookies, chips What are the frequent item-sets if: min-set-size 2 (no max-set-size) support-threshold 0.3 Support: Data Mining # of transactions containing S total # of transactions CS102

Computing Frequent Item-Sets “Apriori” algorithm Efficiency relies on the following property: If S is a frequent item-set satisfying support-threshold t, then every subset of S is also a frequent item-set satisfying support-threshold t. Or the inverse: If S is not a frequent item-set satisfying support-threshold t, then no superset of S can be a frequent item-set satisfying support-threshold t. Data Mining CS102

Association Rules When a set of items S occurs together, another item i frequently occurs with them Sài Ø How large is a “set”? Ø What does “occurs together” mean? Ø What does “frequently occurs with them” mean? Data Mining CS102

Association Rules When a set of items S occurs together, another item i frequently occurs with them Sài Ø How large is a “set”? Usually specify a minimum min-set-size for S Possibly also a maximum max-set-size for S Ø What does “occurs together” mean? Ø What does “frequently occurs with them” mean? Data Mining CS102

Association Rules When a set of items S occurs together, another item i frequently occurs with them Sài Ø How large is a “set”? Usually specify a minimum min-set-size for S Possibly also a maximum max-set-size for S Ø What does “occurs together” mean? Notion of support Ø What does “frequently occurs with them” mean? Notion of confidence Data Mining CS102

Support and Confidence Support for association rule S à i in a dataset of transactions is fraction of transactions containing S: # of transactions containing S total # of transactions Confidence for association rule S à i in a dataset of transactions is the fraction of transactions containing S that also contain i: # of transactions containing S and i # of transactions containing S Data Mining CS102

Support and Confidence Specify support-threshold and confidence-threshold for association rules Only return rules where: support support-threshold and confidence confidence-threshold Data Mining CS102

Your Turn Transactions: T1: T2: T3: T4: T5: Support: milk, eggs, juice # of transactions containing S milk, juice, cookies total # of transactions eggs, chips Reminder: support and milk, eggs confidence must be threshold, not milk, juice, cookies, chips What are the association rules S à i if: min-set-size 1 (no max-set-size) support-threshold 0.5 confidence-threshold 0.5 Confidence: Data Mining # of transactions containing S and i # of transactions containing S CS102

Computing Association Rules 1. Use frequent item-sets to find left-hand sides S satisfying support threshold 2. Then extend to find right-hand sides S à i satisfying confidence threshold NOT a property: Why Not? If S à i is an association rule satisfying support-threshold t and confidence-threshold c, and S’Í S, then S’ à i is an an association rule satisfying support-threshold t and confidence-threshold c. Data Mining CS102

Association Rules: Lift Association rule S à i might have high confidence because item i appears frequently, not because it’s associated with S. Confidence for association rule S à i in a dataset of Lift transactions is the fraction of transactions containing S that also contain i :, divided by the overall frequency of i : # of transactions containing S and i #trans containing S and i #trans containing i # of transactions containing #trans containing S totalS#trans Data Mining CS102

Lift: Examples Transactions: T1: T2: T3: T4: T5: milk, eggs, juice milk, juice, cookies eggs, chips milk, eggs milk, juice, cookies, Lift 1: no association Lift 1: association Lift 1: anti-association chips juice à cookies Lift (2/3) (2/5) 10/6 1.67 eggs à milk Lift (2/3) (4/5) 10/12 0.83 Lift: Data Mining #trans containing S and i #trans containing S #trans containing i total #trans CS102

Data Mining Algorithms CS102 Spring 2020 Data Mining CS102

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

Related Documents:

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity Qingyuan Zhao Stanford University qyzhao@stanford.edu Murat A. Erdogdu Stanford University erdogdu@stanford.edu Hera Y. He Stanford University yhe1@stanford.edu Anand Rajaraman Stanford University anand@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford .

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

Computer Science Stanford University ymaniyar@stanford.edu Madhu Karra Computer Science Stanford University mkarra@stanford.edu Arvind Subramanian Computer Science Stanford University arvindvs@stanford.edu 1 Problem Description Most existing COVID-19 tests use nasal swabs and a polymerase chain reaction to detect the virus in a sample. We aim to

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

no unique set of data mining algorithms that can be used in all application domains. But we can apply different types of the data mining algorithms as an integrated architecture or hybrid models to data sets to increase the robustness of the mining system. GeoMiner, a spatial data mining system prototype was developed on the top of the DBMiner .

Araling Panlipunan Grade 10 . Alternative Delivery Mode . Ikalawang Markahan- Modyul 3: Mga Dahilan at Epekto ng Migrasyon . Unang Edisyon, 2020 . Isinasaad ng Batas Republika 8293, Seksiyon 176na “Hindi maaaring magkaroon ng karapatang- sipi sa anomang akda ang Pamahalaan ng Pilipinas. Gayon pa man, kailangan muna ang pahintulot ng ahensiya o tanggapan ng pamahalaan na naghanda ng akda kung .