Introduction To Data Mining With R1 - WordPress

4m ago
8 Views
1 Downloads
779.54 KB
46 Pages
Last View : 7d ago
Last Download : 3m ago
Upload by : Kamden Hassan
Transcription

Introduction to Data Mining with R1Yanchang Zhaohttp://www.RDataMining.comStatistical Modelling and Computing Workshop at Geoscience Australia8 May 20151Presented at AusDM 2014 (QUT, Brisbane) in Nov 2014, at Twitter (US) in Oct 2014, at UJAT (Mexico) inSept 2014, and at University of Canberra in Sept 20131 / 44

QuestionsIDo you know data mining and its algorithms and techniques?2 / 44

QuestionsIDo you know data mining and its algorithms and techniques?IHave you heard of R?2 / 44

QuestionsIDo you know data mining and its algorithms and techniques?IHave you heard of R?IHave you ever used R in your work?2 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources3 / 44

What is R?IR 2 is a free software environment for statistical computingand graphics.IR can be easily extended with 6,600 packages available onCRAN3 (as of May 2015).IMany other packages provided on Bioconductor4 , R-Forge5 ,GitHub6 , etc.R manuals on CRAN7IIIIIAn Introduction to RThe R Language DefinitionR Data ://cran.r-project.org/manuals.html34 / 44

Why R?IR is widely used in both academia and industry.IR was ranked no. 1 in the KDnuggets 2014 poll on TopLanguages for analytics, data mining, data science 8 (actually,no. 1 in 2011, 2012 & 2013!).The CRAN Task Views 9 provide collections of packages fordifferent tasks.IIIIIII89Machine learning & statistical learningCluster analysis & finite mixture modelsTime series analysisMultivariate statisticsAnalysis of spatial n.r-project.org/web/views/5 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources6 / 44

Classification with RIDecision trees: rpart, partyIRandom forest: randomForest, partyISVM: e1071, kernlabINeural networks: nnet, neuralnet, RSNNSIPerformance evaluation: ROCR7 / 44

The Iris Dataset# iris datastr(iris)## 'data.frame': 150 obs. of 5 variables:## Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 .## Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1.## Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.## Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.## Species: Factor w/ 3 levels "setosa","versicolor",.# split into training and test datasetsset.seed(1234)ind - sample(2, nrow(iris), replace T, prob c(0.7, 0.3))iris.train - iris[ind 1, ]iris.test - iris[ind 2, ]8 / 44

Build a Decision Tree# build a decision treelibrary(party)iris.formula - Species Sepal.Length Sepal.Width Petal.Length Petal.Widthiris.ctree - ctree(iris.formula, data iris.train)9 / 44

plot(iris.ctree)1Petal.Lengthp 0.001 1.9 1.93Petal.Widthp 0.001 1.7 1.74Petal.Lengthp 0.026 4.4 4.4Node 2 (n 40)Node 5 (n 21)Node 6 (n 19)Node 7 (n .20.2000setosasetosa0setosasetosa10 / 44

Prediction# predict on test datapred - predict(iris.ctree, newdata iris.test)# check prediction resulttable(pred, iris.test Species)#### predsetosa versicolor 1411 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources12 / 44

Clustering with RIk-means: kmeans(), kmeansruns()10Ik-medoids: pam(), pamk()IHierarchical clustering: hclust(), agnes(), diana()IDBSCAN: fpcIBIRCH: birchICluster validation: packages clv, clValid, NbClust10Functions are followed with “()”, and others are packages.13 / 44

k-means Clusteringset.seed(8953)iris2 - iris# remove class IDsiris2 Species - NULL# k-means clusteringiris.kmeans - kmeans(iris2, 3)# check resulttable(iris Species, iris.kmeans cluster)##########1 2 3setosa0 50 0versicolor 2 0 48virginica 36 0 1414 / 44

*3.0*2.5*2.0Sepal.Width3.54.0# plot clusters and their centersplot(iris2[c("Sepal.Length", "Sepal.Width")], col iris.kmeans cluster)points(iris.kmeans centers[, c("Sepal.Length", "Sepal.Width")],col 1:3, pch "*", cex 5)4.55.05.56.06.57.07.58.015 / 44

Density-based Clusteringlibrary(fpc)iris2 - iris[-5] # remove class IDs# DBSCAN clusteringds - dbscan(iris2, eps 0.42, MinPts 5)# compare clusters with original class IDstable(ds cluster, iris Species)############0123setosa versicolor virginica2101748000370033316 / 44

# 1-3: clusters; 0: outliers or noiseplotcluster(iris2, ds cluster)033 3303303 3121dc 2111333 300 2 20 2 2222202 200 8 633 33 0 333333 333 3033032232 22022 2 2032 20 2 2232 2 220202230032 2020 0002 2 10110 111 11 111 1 11111111 1 1 11 111 111111 11 11111 11111 4 2dc 100000217 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources18 / 44

Association Rule Mining with RIAssociation rules: apriori(), eclat() in package arulesISequential patterns: arulesSequenceIVisualisation of associations: arulesViz19 / 44

The Titanic .raw)## [1] 22014idx - sample(1:nrow(titanic.raw), 8)titanic.raw[idx, SexAge Survived3rdMale AdultNo3rdMale AdultNo3rdMale AdultNoCrewMale AdultNo3rd Female AdultNo2nd Female AdultNo3rdMale AdultNo3rdMale AdultNo20 / 44

Association Rule Mining# find association rules with the APRIORI algorithmlibrary(arules)rules - apriori(titanic.raw, control list(verbose F),parameter list(minlen 2, supp 0.005, conf 0.8),appearance list(rhs c("Survived No", "Survived Yes"),default "lhs"))# sort rulesquality(rules) - round(quality(rules), digits 3)rules.sorted - sort(rules, by "lift")# have a look at rules# inspect(rules.sorted)21 / 44

########################lhs{Class 2nd,Age Child}2 {Class 2nd,Sex Female,Age Child}3 {Class 1st,Sex Female}4 {Class 1st,Sex Female,Age Adult}5 {Class 2nd,Sex Male,Age Adult}6 {Class 2nd,Sex Female}7 {Class Crew,Sex Female}8 {Class Crew,Sex Female,Age Adult}9 {Class 2nd,Sex Male}10 {Class 2nd,rhssupport confidencelift1 {Survived Yes}0.0111.000 3.096 {Survived Yes}0.0061.000 3.096 {Survived Yes}0.0640.972 3.010 {Survived Yes}0.0640.972 3.010 {Survived No}0.0700.917 1.354 {Survived Yes}0.0420.877 2.716 {Survived Yes}0.0090.870 2.692 {Survived Yes}0.0090.870 2.692 {Survived No}0.0700.860 1.27122 / 44

library(arulesViz)plot(rules, method "graph")Graph for 12 ruleswidth: support (0.006 0.192)color: lift (1.222 3.096){Class 3rd,Sex Male,Age Adult}{Class 2nd,Sex Male,Age Adult}{Survived No}{Class 3rd,Sex Male}{Class 2nd,Sex Male}{Class 1st,Sex Female}{Class 2nd,Sex Female}{Class 1st,Sex Female,Age Adult}{Class 2nd,Sex Female,Age Child}{Survived Yes}{Class Crew,Sex Female}{Class 2nd,Age Child}{Class Crew,Sex Female,Age Adult}{Class 2nd,Sex Female,Age Adult}23 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources24 / 44

Text Mining with RIText mining: tmITopic modelling: topicmodels, ldaIWord cloud: wordcloudITwitter data access: twitteR25 / 44

Retrieve TweetsRetrieve recent tweets by @RDataMining## Option 1: retrieve tweets from Twitterlibrary(twitteR)tweets - userTimeline("RDataMining", n 3200)## Option 2: download @RDataMining tweets from RDataMining.comurl - ownload.file(url, destfile "./data/rdmTweets.RData")## load tweets into Rload(file "./data/rdmTweets.RData")(n.tweet - length(tweets))## [1] 320strwrap(tweets[[320]] text, width 55)## [1] "An R Reference Card for Data Mining is now available"## [2] "on CRAN. It lists many useful R functions and packages"## [3] "for data mining applications."26 / 44

Text Cleaninglibrary(tm)# convert tweets to a data framedf - twListToDF(tweets)# build a corpusmyCorpus - Corpus(VectorSource(df text))# convert to lower casemyCorpus - tm map(myCorpus, tolower)# remove punctuations and numbersmyCorpus - tm map(myCorpus, removePunctuation)myCorpus - tm map(myCorpus, removeNumbers)# remove URLs, 'http' followed by non-space charactersremoveURL - function(x) gsub("http[ [:space:]]*", "", x)myCorpus - tm map(myCorpus, removeURL)# remove 'r' and 'big' from stopwordsmyStopwords - setdiff(stopwords("english"), c("r", "big"))# remove stopwordsmyCorpus - tm map(myCorpus, removeWords, myStopwords)27 / 44

Stemming# keep a copy of corpusmyCorpusCopy - myCorpus# stem wordsmyCorpus - tm map(myCorpus, stemDocument)# stem completionmyCorpus - tm map(myCorpus, stemCompletion,dictionary myCorpusCopy)# replace "miners" with "mining", because "mining" was# first stemmed to "mine" and then completed to "miners"myCorpus - tm map(myCorpus, gsub, pattern "miners",replacement "mining")strwrap(myCorpus[320], width 55)## [1] "r reference card data mining now available cran list"## [2] "used r functions package data mining applications"28 / 44

Frequent TermsmyTdm - TermDocumentMatrix(myCorpus,control list(wordLengths c(1,Inf)))# inspect frequent words(freq.terms - findFreqTerms(myTdm, lowfreq 20))## [1] "analysis"## [5] "examples"## [9] "position"## [13] "slides"## [17] ."universi.29 / 44

Associations# which words are associated with 'r'?findAssocs(myTdm, "r", 0.2)##r## examples 0.32## code0.29## package 0.20# which words are associated with 'mining'?findAssocs(myTdm, "mining", 260.2630 / 44

Network of Termslibrary(graph)library(Rgraphviz)plot(myTdm, term freq.terms, corThreshold 0.1, weighting escomputingslides31 / 44

Word Cloudlibrary(wordcloud)m - as.matrix(myTdm)freq - sort(rowSums(m), decreasing T)wordcloud(words names(freq), freq freq, min.freq 4, random.order F)provided melbourneanalysis outliermapmining networkopengraphicsthanksconference usersprocessingcfp textanalystexampleschapterpostdoctoralslides used bigjobanalytics joinhighsydneytopicchinalargesnowfallcasesee available poll draftperformance applicationsgroup nowreference course code can viavisualizingseries tenuretrackindustrial center due introductionassociation clustering accessinformationpage distributedsentiment videos techniques triedyoutubetop presentation scienceclassification southernwwwrdataminingcomcanberra added researchpackagenotes cardgetdatadatabasestatisticsrdataminingknowledge listgraphfree onlineusingrecentpublishedworkshop findpositionfast callstudiestutorialcaliforniacloudfrequentweek toolsdocumenttechnologyndaustralia social universitydatasetsgoogleshort softwaretime learndetailslecturebookforecasting functions follower submissionbusiness eventskdnuggetsinteractivedetection lingtwitterstarting fellowwebscientistcomputing parallel ibmamp rulesdmappshandling32 / 44

Topic Modellinglibrary(topicmodels)set.seed(123)myLda - LDA(as.DocumentTermMatrix(myTdm), k 8)terms(myLda, [2,][3,][4,][5,]Topic 1Topic 2 Topic 3Topic 4"mining""data""r""position""data""free""examples" "research""analysis" "course" "code""university""network" "online" "book""data""social""ausdm" "mining""postdoctoral"Topic 5Topic 6Topic 7Topic 8"data""data""r""r""r""scientist" "package""data""mining""research" "computing" "clustering""applications" "r""slides""mining""series""package""parallel" "detection"33 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources34 / 44

Time Series Analysis with RITime series decomposition: decomp(), decompose(), arima(),stl()ITime series forecasting: forecastITime Series Clustering: TSclustIDynamic Time Warping (DTW): dtw35 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources36 / 44

Social Network Analysis with RIPackages: igraph, snaICentrality measures: degree(), betweenness(), closeness(),transitivity()IClusters: clusters(), no.clusters()ICliques: cliques(), largest.cliques(), maximal.cliques(),clique.number()ICommunity detection: fastgreedy.community(),spinglass.community()IGraph database Neo4j: package RNeo4jhttp://nicolewhite.github.io/RNeo4j/37 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources38 / 44

R and Big Data PlatformsIHadoopIIISparkIIISpark - a fast and general engine for large-scale dataprocessing, which can be 100 times faster than HadoopSparkR - R frontend for SparkH2OIIIHadoop (or YARN) - a framework that allows for thedistributed processing of large data sets across clusters ofcomputers using simple programming modelsR Packages: RHadoop, RHIPEH2O - an open source in-memory prediction engine for bigdata scienceR Package: h2oMongoDBIIMongoDB - an open-source document databaseR packages: rmongodb, RMongo39 / 44

R and HadoopIPackages: RHadoop, RHiveIRHadoop11 is a collection of R packages:IIIIrmr2 - perform data analysis with R via MapReduce on aHadoop clusterrhdfs - connect to Hadoop Distributed File System (HDFS)rhbase - connect to the NoSQL HBase database.IYou can play with it on a single PC (in standalone orpseudo-distributed mode), and your code developed on thatwill be able to work on a cluster of PCs (in full-distributedmode)!IStep-by-Step Guide to Setting Up an R-Hadoop cs/RHadoop/wiki40 / 44

An Example of MapReducing with R12library(rmr2)map - function(k, lines) {words.list - strsplit(lines, "\\s")words - unlist(words.list)return(keyval(words, 1))}reduce - function(word, counts) {keyval(word, sum(counts))}wordcount - function(input, output NULL) {mapreduce(input input, output output, input.format "text",map map, reduce reduce)}## Submit jobout - wordcount(in.file.path, out.file.path)12From Jeffrey Breen’s presentation on Using R with ts/free-webinars/2013/using-r-with-hadoop/41 / 44

OutlineIntroductionClassification with RClustering with RAssociation Rule Mining with RText Mining with RTime Series Analysis with RSocial Network Analysis with RR and Big DataOnline Resources42 / 44

Online ResourcesIRDataMining website:IIIIhttp://www.rdatamining.comR Reference Card for Data MiningRDataMining Slides SeriesR and Data Mining: Examples and Case StudiesRDataMining Group on LinkedIn (12,000 members)http://group.rdatamining.comIRDataMining on Twitter (2,000 followers)@RDataMiningIFree online sIOnline nedocs43 / 44

The EndThanks!Email: yanchang(at)rdatamining.com44 / 44

Introduction to Data Mining with R1 Yanchang Zhao . "r reference card data mining now available cran list" ## [2] "used r functions package data mining applications" 28/44. . mining computing introduction australia pdf ausdm rdatamining softw