Statistical Data Mining And Machine Learning Random Forests

3y ago
25 Views
3 Downloads
971.43 KB
7 Pages
Last View : 4d ago
Last Download : 3m ago
Upload by : Rosemary Rios
Transcription

that in many states, the trials were anything butspeedy. It funded a study of the causes of the delay.I visited many states and decided to do the analysis in Colorado, which had an excellent computerized court data system. A wealth of information wasextracted and processed.The dependent variable for each criminal caseHT2015:SC4was the time from arraignment to the time of senStatistical Datatencing.MiningAllandMachineLearningof theother informationin the trial history were the predictor variables. A large decisiontree was grown, and I showed it on an overhead andDino Sejdinovicexplainedit to the assembled Colorado judges. OneDepartmentof Statisticsof the splitswas on District N which had a largerOxforddelay time than the other districts. I refrained fromcommenting on this. But as I walked out I heard onehttp://www.stats.ox.ac.uk/ sejdinov/sdmml.htmljudge say to another, “I knew those guys in DistrictN were dragging their feet.”While trees rate an A on interpretability, theyare good, but not great, predictors. Give them, say,a B on prediction.variables. At each node choose several of the 20 atModelCombinationForestsOr use a randomrandom to use tosplittheRandomnode.combination of a random selection of a few variables. This idea appears in Ho (1998), in Amit andGeman (1997) and is developed in Breiman (1999).9.2 Forests Compared to TreesWe compare the performance of single trees(CART) to random forests on a number of smalland large data sets, mostly from the UCI bases). Asummary of the data sets is given in Table 1.Table 2 compares the test set error of a single treeto that of the forest. For the five smaller data setsabove the line, the test set error was estimated byleaving out a random 10% of the data, then running CART and the forest on the other 90%. Theleft-out 10% was run down the tree and the forestand the error on this 10% computed for both. Thiswas repeated 100 times and the errors averaged.The larger data sets below the line came with a9.1 Growing Forests for Predictionseparate test set. People who have been in the clasInsteada singletree predictor, grow a forest ofModelCombination ofRandomForestsModel aCombinationsification field forwhileRandomfindForeststhese increases intrees on the same data—say 50 or 100. If we areaccuracy startling. Some errors are halved. OthersRandom Forests andExtremelyRandomizedTreesclassifying,put thenew x down eachtree in the RandomforareForestsreduced by one-third. In regression, where theest and get a vote for the predicted class. Let the forest prediction be the class that gets the most votes.Table 2Random forests are similarbaggeda fewkeyfive years onTheretohasbeendecisiona lot oftreesworkwithin thelastTest set misclassification error (%)differences:ways to grow the forest. All of the well-known methFor each split point, thenot overall pbyvariablesbut just overodssearchgrowistheforestperturbingthe mtrytraining set,Data setForestSingle treerandomly chosen ones (where e.g. mtry bp/3c)growing a tree on the perturbed training set, perNo pruning necessary. Trees can be grown until each node contains justBreast cancer2.95.9turbingvery few observations(1 or 5). the training set again, growing another tree,Ionosphere5.511.2Somebetterfamiliarmethodsare bagging (Breiman,Random forests tendetc.to producepredictionsthan chapire, 1996), arcResults often not sensitiveto theonly tuningparametermtry.Glass22.030.4Implemented in randomForestlibrary.1998), and additive logistic regressionSoybean5.78.6ing (Breiman,Even more random methods,e.g. extremelyrandomizedtrees:(Friedman,Hastie My preferredmethodto adateis randomFor each split point, samplemtry variableseach withrandomvalue to forests. In37.062.0Shuttle 10split on, and pick thethisbestapproachone.successive decision trees are grown byDNA3.96.2Often works even whenmtryequals1!introducing a random element into their construcDigit6.217.1Often produce state-of-the-artresults,and topsupposeperformingmethodstion. Forexample,thereare in20 predictorRandom Forestsmachine learning competitions.From Breiman, Statistical Modelling: the two cultures, 2001.Breiman (2001), Geurts et al (2006)

Model CombinationRandom ForestsModel CombinationRandom ForestsRandom ForestsComparison of 179 classifiers on 121 datasets. Random forests come topwith SVMs close behind.Looking at the Boston Housing data again (and at the help page forrandomForest on)y - Boston[,14]x - Boston[,1:13]?randomForestFrom Delgado et al, 2014Model Combination randomForestRandom Forestspackage:randomForestModel CombinationR DocumentationBoston Housing data, again.Classification and Regression with Random ForestDescription:’randomForest’ implements Breiman’s random forest algorithm (basedon Breiman and Cutler’s original Fortran code) for classificationand regression. It can also be used in unsupervised mode forassessing proximities among data points.Usage:## S3 method for class ’formula’:randomForest(formula, data NULL, ., subset, na.action na.fail)## Default S3 method:randomForest(x, y NULL, xtest NULL, ytest NULL, ntree 500,mtry if (!is.null(y) && !is.factor(y))max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),replace TRUE, classwt NULL, cutoff, strata,sampsize if (replace) nrow(x) else ceiling(.632*nrow(x)),nodesize if (!is.null(y) && !is.factor(y)) 5 else 1,importance FALSE, localImp FALSE, nPerm 1,proximity FALSE, oob.prox proximity,norm.votes TRUE, do.trace FALSE,keep.forest !is.null(y) && is.null(xtest), corr.bias FALSE,keep.inbag FALSE, .)Random Forests

Random Forests plot( predict(rf), y) abline(c(0,1),col 2) plot( predict(rf,newdata x), y) abline(c(0,1),col 2) 2030 30y204010predict(rf)Model CombinationRandom ForestsModel Combination (rf - randomForest(x,y,mtry 2))Call:randomForest(x x, y y, mtry 2)Type of random forest: regressionNumber of trees: 500No. of variables tried at each split: 2 (rf - randomForest(x,y,mtry 8))Call:randomForest(x x, y y, mtry 8)Type of random forest: regressionNumber of trees: 500No. of variables tried at each split: 8 (rf - randomForest(x,y,mtry 4))Call:randomForest(x x, y y, mtry 4)Type of random forest: regressionNumber of trees: 500No. of variables tried at each split: 4Mean of squared residuals: 10.01574% Var explained: 88.14 20304050Random ForestsAnd mtry 8 and 10.Try mtry 4 predict(rf, newdata x)Try mtry 2Mean of squared residuals: 12.17176% Var explained: 85.58 1030y2010 10 plot( predict(rf,newdata x), y) Can plot the predicted values (out-of-bag estimation) vs. true values bySame if treating the training data as new data 40 50Training error.Mean of squared residuals: 10.26161% Var explained: 87.84 plot( predict(rf), y) abline(c(0,1),col 2)Random ForestsOut-of-bag error.40 rf - randomForest(x,y) print(rf) Call:randomForest(x x, y y)Type of random forest: regressionNumber of trees: 500No. of variables tried at each split: 4Model Combination50Model CombinationMean of squared residuals: 9.552806% Var explained: 88.68 (rf - randomForest(x,y,mtry 10))Call:randomForest(x x, y y, mtry 10)Type of random forest: regressionNumber of trees: 500No. of variables tried at each split: 10Mean of squared residuals: 9.774435% Var explained: 88.42mtry is the only real tuning parameter and typically performance not sensitiveto its choice (can use tuneRF to select it automatically).

Model CombinationRandom ForestsVariable “Importance”Model CombinationRandom ForestsExample for Boston Housing data.rf - randomForest(x,y,importance TRUE)varImpPlot(rf)Despite the better predictive performance, single trees seem to have anedge over tree ensembles in terms of interpretability.rm lstat disHow to interpret a forest of trees ? noxIdea: denote by ê the out-of bag estimate of the loss when using the originaldata samples. For each variable k {1, . . . , p}, crim ptratio permute randomly the k-th predictor variable to generate a new set of(k)(k)samples (X̃1 , Y1 ), . . . , (X̃n , Yn ), i.e., X̃i Xτ (i) , for a permutation τ .age tax compute the out-of-bag estimate êk of the prediction error with these newsamples.blackA measure of importance of variable k is then êk ê, the increase in error ratedue to a random permutation of the k-th variable.indus radchaszn 510152025%IncMSEModel CombinationRandom ForestsModel CombinationRandom Forests and Local SmoothingLet P(x, xi ) [0, 1] be the proportion of trees for which a vector x falls intothe same final leaf node as the training vector xi . P(x, xi ) is a proximityvalue, and tends to be large when x and xi are close.If every leaf node contains the same number of training samples, theprediction of random forests (in regression mode) at x is:PnP(x, xi )yiRF,f̂ (x) Pi 1ni 1 P(x, xi )which is a local smoothing estimate.If the nodes contain different number of original observations, P(x, xi ) is aweighted proportion of trees, where the weight of a tree is inverselyproportional to the number of samples in the leaf node containing x.For classification, the prediction will be the weighted majority vote, whereagain weights are proportional to the proximities P(x, xi ).Nearest neighbours and local smoothing techniques do not scale to verylarge datasets, and approximate techniques for these often rely on treedata structures, e.g. kd-trees, cover trees, ball trees. Random forests andother randomized trees can be interpreted in this way.Ensemble MethodsEnsemble Methods3035

Model CombinationEnsemble MethodsModel CombinationEnsemble MethodsEnsemble MethodsStackingAlso called stacked generalization.Use the outputs of M learning algorithms as inputs to a combinerlearner.Bagging and random forests are examples of ensemble methods, wherepredictions are based on an ensemble of many individual predictors.Often, logistic regression is used as a combiner.q (1) P̂(1) (Y 1 X x)Many other ensemble learning methods: boosting, stacking, mixture ofexperts, Bayesian model combination, Bayesian model averaging etc.qOften gives significant boost to predictive performance.q (3) P̂(3) (Y 1 X x)(2)(2) P̂1b(Y 1 X x)w1w2s(.)Pw3P̂(Y 1 X x)bbw q bwMbq (M) P̂(M) (Y 1 X x)Top entries for the 1M Netflix competition used a form of stacking Sill et al, 2009Model CombinationEnsemble MethodsModel CombinationDropout Training of Neural NetworksEnsemble MethodsDropout Training of Neural NetworksNeural network with single layer of hiddenunits:Hidden unit activations: pXhhhik s bk Wjk xij Model as an ensemble of networks:Xp(yi 1 xi , θ) q b (1 q)m b p(yi 1 xi , θ, drop out units b)ŷij 1b {1,.,m}ŷiŷiŷiŷiOutput probability:oŷi s b mXk 1Wko hik!hi1hi1hi2hi3hi4hi5hi6Can prevent co-adaptation by randomlydropping out units from network.hi3xi1Large, overfitted, networks often haveco-adapted hidden units.What each hidden unit learns may in factbe useless, e.g. predicting the negation ofpredictions from other ring among all networks: each network uses a subset of theparameters of the full network (corresponding to the retained units).xi1xi2xi3xi4Hinton et al (2012).Training by stochastic gradient descent: at each iteration a network issampled from ensemble, and its subset of parameters are updated.

Model CombinationEnsemble MethodsModel CombinationBoostingModel CombinationBoostingDropout Training of Neural NetworksClassification of phonemes in speech.BoostingFigure from Hinton et al.Model CombinationBoostingBoostingTypes of BoostingBoosting is an iterative ensemble learning technique. At iteration t, thepredictor is (with 0 ν 1, typically small, say ν 0.1):f̂t (x) tXν f̂ (x) 1For regression, L2 -boosting works as follows:12Fit a first function to the data {(xi , yi )}ni 1 with base learner, yielding f̂1 .For t 2, 3, . . . , T do:12Compute current residualsui yi f̂t 1 (xi )Fit the residuals {(xi , ui )}ni 1 , obtaining f̂t (x).Boosting is a bias-reduction technique, as opposed to bagging anddropout.Boosting works well with simple base learners with low variance and highbias, e.g. decision stumps.Implemented in the mboost library.L2 -Boosting: the squared loss function in regression.n1XR(f ) (yi f (xi ))2 .ni 1LogitBoost: logistic loss function (binary classification, yi { 1, 1}).nR(f ) 1Xlog(1 exp( yi f (xi ))).ni 1AdaBoost: exponential loss function (binary classification, yi { 1, 1}).nR(f ) 1Xexp( yi f (xi )).ni 1Freund and Schapire (1995).

Model CombinationBoostingModel CombinationBoostingBoostingblackboost: Boosting of Regression Treeslibrary(mboost)n - length(y)##Mvec - 1:500##nM - length(Mvec)##loss - numeric(nM)##losscv - numeric(nM) ##for (mc in 1:nM){yhat - numeric(n)yhatcv - numeric(n)M - Mvec[mc]V - 107number of observationsMvec is vector with various stopping timesnumber of possible stopping timesloss contains the training errorlosscv contains the validation error## loop over stopping times (not efficient)## yhat are the fitted values## yhatcv the cross-validated fitted values## use M iterations## 10-fold cross validation## indCV contains the ‘block’ in 1,.,10## each observation falls intoindCV - sample( rep(1:V,each ceiling(n/V)), n)for (cv in 1:V){## loop over all blocksbb - blackboost(y[indCV! cv] .,data x[indCV! cv,],control boost control(mstop M))## predict the unused observationsyhatcv[indCV cv] - predict(bb,x[indCV cv,])}losscv[mc] - sqrt(mean( (y-yhatcv) 2 ))## CV test errorbb - blackboost(y .,data x,control boost control(mstop M))yhat - predict(bb,x)loss[mc] - sqrt(mean( (y-yhat) 2 ))## training error01234560!1exponentialhingelogistictruncated quadratic!2!1012!Figure 1: A plot of the 0-1 loss function and surrogates corresponding to various practical classifiers.These functions are plotted as a function of the margin α yf (x). Note that a classification erroris made if and only if the margin is negative; thus the 0-1 loss is a step function that is equal to oneModel CombinationBoostingfor negative values of the abscissa. The curve labeled “logistic” is the negative log likelihood, orscaled deviance, under a logistic regression model; “hinge” is the piecewise-linear loss used in thesupport vector machine; and “exponential” is the exponential loss used by the Adaboost algorithm.The deviance is scaled so as to majorize the 0-1 loss; see Lemma 8.blackboost: Boosting of Regression Trees}Model CombinationBoostingRF & Boosting: SummaryPlot of validation error in red and training error in black as functions ofConsistency results provide reassurance that optimizing a surrogate does not ultimately hinderiteration.the search for a function that achieves the Bayes risk, and thus allow such a search to proceed withinmatplot( cbind(loss,losscv), type "p",lwd 2,col c(1,2),lty 1)the scope sqrt(mean((of computationallypredict(rf)-y) 2)),lwd 1,lty 2)efficient algorithms. There is, however, an additional motivation forabline(h working with surrogates of 0-1 loss beyond the computational imperative. Minimizing the sample8average of an appropriately-behaved loss function has a regularizing effect: it is possible to obtain7uniform upper bounds on the risk of a function that minimizes the empirical average of the loss φ,even for classes that are so rich that no such upper bounds are possible for the minimizer of theLOSS6empirical average of the 0-1 loss. Indeed a number of such results have been obtained for function5classes with infinite VC-dimension (Bartlett, 1998, Shawe-Taylor et al., 1998), such as the functionBoth RF and Boosting are tree ensembles.Like RF, Boosting does not seem to overfit (the CV curve stays flat).This is not quite true, though: consider limt f̂t (Xi ). Needs earlystopping!The stopping parameter T needs to be adjusted by eithercross-validation, which is computationally expensive ormodel selection, which does not work very well for trees as base learners(what are the degrees of freedom of a tree?)4Predictive performance is usually similar.Properties of Boosting (and why it is successful) are rather wellunderstood (e.g. by bias reduction), but remain more of a mystery for RF.233010203040BOOSTING ITERATIONS5060

Model Combination Random Forests randomForest package:randomForest R Documentation Classification and Regression with Random Forest Description: 'randomForest' implements Breiman's random forest algorithm (based on Breiman and Cutler's original Fortran code) for classification and regression. It can also be used in unsupervised mode for

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

Data Mining: Machine Learning and Statistical Techniques Alfonso Palmer, Rafael Jiménez and Elena Gervilla University of the Balearic Islands Spain 1. Introduction The interdisciplinary field of Data Mining (DM) arises from the confluence of statistics and machine learning (artificial intelligence).

SAS Visual Data Mining and Machine Learning Presentation Content Introduction to SAS Visual Data Mining and Machine Learning Value of SAS Visual Data Mining and Machine Learning Included Algorithms Tour of the interfaces Visual Programming Open Source

In this book, we will explore some of the features of SAS Visual Data Mining and Machine Learning, including: Programming in SAS Studio Programming in the Python interface Data mining and machine learning tasks New, advanced data mining and machine learning procedures available in SAS Viya Pipeline building in Model Studio