Introduction To Data Mining With R1 - WordPress

2y ago
18 Views
3 Downloads
779.54 KB
46 Pages
Last View : 4m ago
Last Download : 2m ago
Upload by : Kamden Hassan
Transcription

Introduction to Data Mining with R1Yanchang Zhaohttp://www.RDataMining.comStatistical Modelling and Computing Workshop at Geoscience Australia8 May 20151Presented at AusDM 2014 (QUT, Brisbane) in Nov 2014, at Twitter (US) in Oct 2014, at UJAT (Mexico) inSept 2014, and at University of Canberra in Sept 20131 / 44

QuestionsIDo you know data mining and its algorithms and techniques?2 / 44

QuestionsIDo you know data mining and its algorithms and techniques?IHave you heard of R?2 / 44

QuestionsIDo you know data mining and its algorithms and techniques?IHave you heard of R?IHave you ever used R in your work?2 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources3 / 44

What is R?IR 2 is a free software environment for statistical computingand graphics.IR can be easily extended with 6,600 packages available onCRAN3 (as of May 2015).IMany other packages provided on Bioconductor4 , R-Forge5 ,GitHub6 , etc.R manuals on CRAN7IIIIIAn Introduction to RThe R Language DefinitionR Data ://cran.r-project.org/manuals.html34 / 44

Why R?IR is widely used in both academia and industry.IR was ranked no. 1 in the KDnuggets 2014 poll on TopLanguages for analytics, data mining, data science 8 (actually,no. 1 in 2011, 2012 & 2013!).The CRAN Task Views 9 provide collections of packages fordifferent tasks.IIIIIII89Machine learning & statistical learningCluster analysis & finite mixture modelsTime series analysisMultivariate statisticsAnalysis of spatial n.r-project.org/web/views/5 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources6 / 44

Classification with RIDecision trees: rpart, partyIRandom forest: randomForest, partyISVM: e1071, kernlabINeural networks: nnet, neuralnet, RSNNSIPerformance evaluation: ROCR7 / 44

The Iris Dataset# iris datastr(iris)## 'data.frame': 150 obs. of 5 variables:## Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 .## Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1.## Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.## Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.## Species: Factor w/ 3 levels "setosa","versicolor",.# split into training and test datasetsset.seed(1234)ind - sample(2, nrow(iris), replace T, prob c(0.7, 0.3))iris.train - iris[ind 1, ]iris.test - iris[ind 2, ]8 / 44

Build a Decision Tree# build a decision treelibrary(party)iris.formula - Species Sepal.Length Sepal.Width Petal.Length Petal.Widthiris.ctree - ctree(iris.formula, data iris.train)9 / 44

plot(iris.ctree)1Petal.Lengthp 0.001 1.9 1.93Petal.Widthp 0.001 1.7 1.74Petal.Lengthp 0.026 4.4 4.4Node 2 (n 40)Node 5 (n 21)Node 6 (n 19)Node 7 (n .20.2000setosasetosa0setosasetosa10 / 44

Prediction# predict on test datapred - predict(iris.ctree, newdata iris.test)# check prediction resulttable(pred, iris.test Species)#### predsetosa versicolor 1411 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources12 / 44

Clustering with RIk-means: kmeans(), kmeansruns()10Ik-medoids: pam(), pamk()IHierarchical clustering: hclust(), agnes(), diana()IDBSCAN: fpcIBIRCH: birchICluster validation: packages clv, clValid, NbClust10Functions are followed with “()”, and others are packages.13 / 44

k-means Clusteringset.seed(8953)iris2 - iris# remove class IDsiris2 Species - NULL# k-means clusteringiris.kmeans - kmeans(iris2, 3)# check resulttable(iris Species, iris.kmeans cluster)##########1 2 3setosa0 50 0versicolor 2 0 48virginica 36 0 1414 / 44

*3.0*2.5*2.0Sepal.Width3.54.0# plot clusters and their centersplot(iris2[c("Sepal.Length", "Sepal.Width")], col iris.kmeans cluster)points(iris.kmeans centers[, c("Sepal.Length", "Sepal.Width")],col 1:3, pch "*", cex 5)4.55.05.56.06.57.07.58.015 / 44

Density-based Clusteringlibrary(fpc)iris2 - iris[-5] # remove class IDs# DBSCAN clusteringds - dbscan(iris2, eps 0.42, MinPts 5)# compare clusters with original class IDstable(ds cluster, iris Species)############0123setosa versicolor virginica2101748000370033316 / 44

# 1-3: clusters; 0: outliers or noiseplotcluster(iris2, ds cluster)033 3303303 3121dc 2111333 300 2 20 2 2222202 200 8 633 33 0 333333 333 3033032232 22022 2 2032 20 2 2232 2 220202230032 2020 0002 2 10110 111 11 111 1 11111111 1 1 11 111 111111 11 11111 11111 4 2dc 100000217 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources18 / 44

Association Rule Mining with RIAssociation rules: apriori(), eclat() in package arulesISequential patterns: arulesSequenceIVisualisation of associations: arulesViz19 / 44

The Titanic .raw)## [1] 22014idx - sample(1:nrow(titanic.raw), 8)titanic.raw[idx, SexAge Survived3rdMale AdultNo3rdMale AdultNo3rdMale AdultNoCrewMale AdultNo3rd Female AdultNo2nd Female AdultNo3rdMale AdultNo3rdMale AdultNo20 / 44

Association Rule Mining# find association rules with the APRIORI algorithmlibrary(arules)rules - apriori(titanic.raw, control list(verbose F),parameter list(minlen 2, supp 0.005, conf 0.8),appearance list(rhs c("Survived No", "Survived Yes"),default "lhs"))# sort rulesquality(rules) - round(quality(rules), digits 3)rules.sorted - sort(rules, by "lift")# have a look at rules# inspect(rules.sorted)21 / 44

########################lhs{Class 2nd,Age Child}2 {Class 2nd,Sex Female,Age Child}3 {Class 1st,Sex Female}4 {Class 1st,Sex Female,Age Adult}5 {Class 2nd,Sex Male,Age Adult}6 {Class 2nd,Sex Female}7 {Class Crew,Sex Female}8 {Class Crew,Sex Female,Age Adult}9 {Class 2nd,Sex Male}10 {Class 2nd,rhssupport confidencelift1 {Survived Yes}0.0111.000 3.096 {Survived Yes}0.0061.000 3.096 {Survived Yes}0.0640.972 3.010 {Survived Yes}0.0640.972 3.010 {Survived No}0.0700.917 1.354 {Survived Yes}0.0420.877 2.716 {Survived Yes}0.0090.870 2.692 {Survived Yes}0.0090.870 2.692 {Survived No}0.0700.860 1.27122 / 44

library(arulesViz)plot(rules, method "graph")Graph for 12 ruleswidth: support (0.006 0.192)color: lift (1.222 3.096){Class 3rd,Sex Male,Age Adult}{Class 2nd,Sex Male,Age Adult}{Survived No}{Class 3rd,Sex Male}{Class 2nd,Sex Male}{Class 1st,Sex Female}{Class 2nd,Sex Female}{Class 1st,Sex Female,Age Adult}{Class 2nd,Sex Female,Age Child}{Survived Yes}{Class Crew,Sex Female}{Class 2nd,Age Child}{Class Crew,Sex Female,Age Adult}{Class 2nd,Sex Female,Age Adult}23 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources24 / 44

Text Mining with RIText mining: tmITopic modelling: topicmodels, ldaIWord cloud: wordcloudITwitter data access: twitteR25 / 44

Retrieve TweetsRetrieve recent tweets by @RDataMining## Option 1: retrieve tweets from Twitterlibrary(twitteR)tweets - userTimeline("RDataMining", n 3200)## Option 2: download @RDataMining tweets from RDataMining.comurl - ownload.file(url, destfile "./data/rdmTweets.RData")## load tweets into Rload(file "./data/rdmTweets.RData")(n.tweet - length(tweets))## [1] 320strwrap(tweets[[320]] text, width 55)## [1] "An R Reference Card for Data Mining is now available"## [2] "on CRAN. It lists many useful R functions and packages"## [3] "for data mining applications."26 / 44

Text Cleaninglibrary(tm)# convert tweets to a data framedf - twListToDF(tweets)# build a corpusmyCorpus - Corpus(VectorSource(df text))# convert to lower casemyCorpus - tm map(myCorpus, tolower)# remove punctuations and numbersmyCorpus - tm map(myCorpus, removePunctuation)myCorpus - tm map(myCorpus, removeNumbers)# remove URLs, 'http' followed by non-space charactersremoveURL - function(x) gsub("http[ [:space:]]*", "", x)myCorpus - tm map(myCorpus, removeURL)# remove 'r' and 'big' from stopwordsmyStopwords - setdiff(stopwords("english"), c("r", "big"))# remove stopwordsmyCorpus - tm map(myCorpus, removeWords, myStopwords)27 / 44

Stemming# keep a copy of corpusmyCorpusCopy - myCorpus# stem wordsmyCorpus - tm map(myCorpus, stemDocument)# stem completionmyCorpus - tm map(myCorpus, stemCompletion,dictionary myCorpusCopy)# replace "miners" with "mining", because "mining" was# first stemmed to "mine" and then completed to "miners"myCorpus - tm map(myCorpus, gsub, pattern "miners",replacement "mining")strwrap(myCorpus[320], width 55)## [1] "r reference card data mining now available cran list"## [2] "used r functions package data mining applications"28 / 44

Frequent TermsmyTdm - TermDocumentMatrix(myCorpus,control list(wordLengths c(1,Inf)))# inspect frequent words(freq.terms - findFreqTerms(myTdm, lowfreq 20))## [1] "analysis"## [5] "examples"## [9] "position"## [13] "slides"## [17] ."universi.29 / 44

Associations# which words are associated with 'r'?findAssocs(myTdm, "r", 0.2)##r## examples 0.32## code0.29## package 0.20# which words are associated with 'mining'?findAssocs(myTdm, "mining", 260.2630 / 44

Network of Termslibrary(graph)library(Rgraphviz)plot(myTdm, term freq.terms, corThreshold 0.1, weighting escomputingslides31 / 44

Word Cloudlibrary(wordcloud)m - as.matrix(myTdm)freq - sort(rowSums(m), decreasing T)wordcloud(words names(freq), freq freq, min.freq 4, random.order F)provided melbourneanalysis outliermapmining networkopengraphicsthanksconference usersprocessingcfp textanalystexampleschapterpostdoctoralslides used bigjobanalytics joinhighsydneytopicchinalargesnowfallcasesee available poll draftperformance applicationsgroup nowreference course code can viavisualizingseries tenuretrackindustrial center due introductionassociation clustering accessinformationpage distributedsentiment videos techniques triedyoutubetop presentation scienceclassification southernwwwrdataminingcomcanberra added researchpackagenotes cardgetdatadatabasestatisticsrdataminingknowledge listgraphfree onlineusingrecentpublishedworkshop findpositionfast callstudiestutorialcaliforniacloudfrequentweek toolsdocumenttechnologyndaustralia social universitydatasetsgoogleshort softwaretime learndetailslecturebookforecasting functions follower submissionbusiness eventskdnuggetsinteractivedetection lingtwitterstarting fellowwebscientistcomputing parallel ibmamp rulesdmappshandling32 / 44

Topic Modellinglibrary(topicmodels)set.seed(123)myLda - LDA(as.DocumentTermMatrix(myTdm), k 8)terms(myLda, [2,][3,][4,][5,]Topic 1Topic 2 Topic 3Topic 4"mining""data""r""position""data""free""examples" "research""analysis" "course" "code""university""network" "online" "book""data""social""ausdm" "mining""postdoctoral"Topic 5Topic 6Topic 7Topic 8"data""data""r""r""r""scientist" "package""data""mining""research" "computing" "clustering""applications" "r""slides""mining""series""package""parallel" "detection"33 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources34 / 44

Time Series Analysis with RITime series decomposition: decomp(), decompose(), arima(),stl()ITime series forecasting: forecastITime Series Clustering: TSclustIDynamic Time Warping (DTW): dtw35 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources36 / 44

Social Network Analysis with RIPackages: igraph, snaICentrality measures: degree(), betweenness(), closeness(),transitivity()IClusters: clusters(), no.clusters()ICliques: cliques(), largest.cliques(), maximal.cliques(),clique.number()ICommunity detection: fastgreedy.community(),spinglass.community()IGraph database Neo4j: package RNeo4jhttp://nicolewhite.github.io/RNeo4j/37 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources38 / 44

R and Big Data PlatformsIHadoopIIISparkIIISpark - a fast and general engine for large-scale dataprocessing, which can be 100 times faster than HadoopSparkR - R frontend for SparkH2OIIIHadoop (or YARN) - a framework that allows for thedistributed processing of large data sets across clusters ofcomputers using simple programming modelsR Packages: RHadoop, RHIPEH2O - an open source in-memory prediction engine for bigdata scienceR Package: h2oMongoDBIIMongoDB - an open-source document databaseR packages: rmongodb, RMongo39 / 44

R and HadoopIPackages: RHadoop, RHiveIRHadoop11 is a collection of R packages:IIIIrmr2 - perform data analysis with R via MapReduce on aHadoop clusterrhdfs - connect to Hadoop Distributed File System (HDFS)rhbase - connect to the NoSQL HBase database.IYou can play with it on a single PC (in standalone orpseudo-distributed mode), and your code developed on thatwill be able to work on a cluster of PCs (in full-distributedmode)!IStep-by-Step Guide to Setting Up an R-Hadoop cs/RHadoop/wiki40 / 44

An Example of MapReducing with R12library(rmr2)map - function(k, lines) {words.list - strsplit(lines, "\\s")words - unlist(words.list)return(keyval(words, 1))}reduce - function(word, counts) {keyval(word, sum(counts))}wordcount - function(input, output NULL) {mapreduce(input input, output output, input.format "text",map map, reduce reduce)}## Submit jobout - wordcount(in.file.path, out.file.path)12From Jeffrey Breen’s presentation on Using R with ts/free-webinars/2013/using-r-with-hadoop/41 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources42 / 44

Online ResourcesIRDataMining website:IIIIhttp://www.rdatamining.comR Reference Card for Data MiningRDataMining Slides SeriesR and Data Mining: Examples and Case StudiesRDataMining Group on LinkedIn (12,000 members)http://group.rdatamining.comIRDataMining on Twitter (2,000 followers)@RDataMiningIFree online sIOnline nedocs43 / 44

The EndThanks!Email: yanchang(at)rdatamining.com44 / 44

Introduction to Data Mining with R1 Yanchang Zhao . "r reference card data mining now available cran list" ## [2] "used r functions package data mining applications" 28/44. . mining computing introduction australia pdf ausdm rdatamining softw

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data Mining Tasks (What?) 5. Components of Data Mining Algorithms(How?) 6. Statistics vs Data Mining 2 Srihari . Flood of Data 3

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

9/14/2005 Brief Introduction to Data & Web Mining 1 Brief Introduction to Data & Web Mining Olfa Nasraoui CECS 694: Web mining for e-commerce and information retrieval. 2 Outline Knowledge Discovery in DB & Data Mining –Motivation &

Data mining process 6 CS590D 12 Data Mining: Classification Schemes General functionality – Descriptive data mining – Predictive data mining Different views, different classifications – Kinds of data to be mined – Kinds of knowledge to be discovered – Kinds of techniqu

Distributed Data Mining: mining data that is located in various different locations Uses a combination of localized data analysis with a global data model Hypertext/Hypermedia Data Mining: mining data which includes text, hype

Data Mining The field of data mining addresses the question of how to best use historical data to discover general regularities and improve future decisions (Mitchell, 1999). Data Mining Data mining is the extraction of implicit, previously unknown, and potentially useful information - structural patterns - from data (Witten et al., 2017).

Imielinski, and Swami. The earlier data mining conferences were often dominated by a large number of frequent pattern mining papers. This is one of the reasons that frequent pattern mining has a very special place in the data mining community. At this point, the field of frequent pattern mining is considered a mature one.

What Is Data Mining? » Data Mining: Essential in a Knowledge Discovery Process » Data Mining: A Confluence of Multiple Disciplines A Multi-Dimensional View of Data Mining » Knowledge to Be Mined » Data to Be Mined » Technology Utilized » Applications Adapted Data Mining Functionalities: What Kinds of Patterns Can Be Mined? » Generalization

have any data mining background if the data mining task is predefined. The only thing they need to do is to set the input and output of the data. Moreover, the users can set the execution plan of the data mining tasks, so whenever the time is up, the scheduled task would automatically execute. For advanced users, they can design ad hoc data mining

Data Mining Popularity lRecent Data Mining explosion based on: lData available -Transactions recorded in data warehouses -From these warehouses specific databases for the goal task can be created lAlgorithms available -Machine Learning and Statistics -Including special purpose Data Mining software products to make it easier for people to work through the entire data mining cycle

In-Database Data Mining Traditional Analytics Hours, Days or Weeks Data Extraction Data Prep & Transformation Data Mining Model Building Data Mining Model "Scoring" Data Preparation and Transformation Data Import Source Data SAS Work Area SAS Proces sing Proces s Output Target Results Faster time for "Data" to "Insights .

Mining Industry of the Future Exploration and Mining Technology Roadmap Table of Contents Foreword i Introduction 1 Exploration and Mine Planning 3 Underground Mining 9 Surface Mining 13 Additional Challenges 17 Achieving Our Goals 19 Exhibits 1. Crosscutting Technologies Roadmap R&

and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar 3/24/2021 Introduction to Data Mining, 2nd Edition 2 Tan, Steinbach, Karpatne, Kumar What is Cluster Analysis? G

DATA MINING CSE 4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington . Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm

Institute (ANSI) A300 Part 7-2006 Vegetation Management standards and the International Society of Arboriculture best management practices. IVM has continued to evolve over the last decade, with examples of expanded emphasis of work on: 1) broad assessment of environmental impact, 2) building social awareness and responsibility; and 3) elevated focus on safety and reliability of service. The .