Prediction And Interpretation For Machine Learning .

3y ago
75 Views
5 Downloads
514.49 KB
16 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kaydence Vann
Transcription

Paper 1967-2018Prediction and Interpretation for Machine Learning Regression MethodsD. Richard Cutler, Utah State UniversityABSTRACTThe last 30 years has seen extraordinary development of new tools for the prediction of numericaland binary responses. Examples include the LASSO and elastic net for regularization inregression and variable selection, quantile regression for heteroscedastic data, and machinelearning predictive method such as classification and regression trees (CART), multivariateadaptive regression splines (MARS), random forests, gradient boosting machines (GBM), andsupport vector machines (SVM). All these methods are implemented in SAS , giving the useran amazing toolkit of predictive methods. In fact, the set of available methods is so rich it begsthe question, “When should I use one or a subset of these methods instead of the other methods?”In this talk I hope to provide a partial answer to this question through the application of several ofthese methods in the analysis of several real datasets with numerical and binary responsevariables.INTRODUCTIONOver the last 30 years there has been substantial development of regression methodology forregularization of the estimation in the multiple linear regression model and for carrying out nonlinear regression of various kinds. Notable contributions in the area of regularization include theLASSO (Tibshirani 1996), the elastic net (Zou and Hastie 2005), and least angle regression(Effron et al. 2002) which is both a regularization method and a series of algorithms that can beused to efficiently compute LASSO and elastic net estimates of regression coefficients.An early paper on non-linear regression via scatter plot smoothing and the alternating conditionalexpectations (ACE) algorithm is due to Breiman and Friedman (1985). Hastie and Tibshirani(1986) extend this approach to create generalized additive models (GAM). An alternativeapproach to non-linear regression using binary partitioning are regression trees (Breiman et al.1984). Multivariate adaptive regression splines (MARS) (Friedman 1991) extended generalizedlinear and generalized additive models in the direction of modeling interactions, and considerableresearch of tree methods, notably ensembles of trees, resulted in the development of gradientboosting machines (GBM) (Friedman 2000) and random forests (Breiman 2001). A completelydifferent approach, based on non-linear projections is support vector machines, the moderndevelopment of which is usually credited to Vapnik (1995) and Cortes and Vapnik (1995).All of the methods listed above, and more, are implemented in SAS and other statistical packagesgiving statisticians a very large toolkit for analyzing and understanding data with a continuous(interval valued) response variable. In SAS using the LASSO or fitting a regression tree or randomforests is no harder than fitting an ordinary multiple regression with some traditional variableselection. The LASSO has rapidly become a “standard” method for variable selection inregression, and all of these methods lend themselves to larger datasets, where there is a lot ofinformation and statistical significance does not make sense.In this paper I hope to illustrate the use of some of these methods for the analysis of real datasets.1

GETTING STARTEDIn the spirit of the “Getting Started” section of SAS procedure manual entries, we begin with asimple example that illustrates how tree methods can provide insight in situations where linearmethods are less effective. The data concern credit card applications to a bank in Australia(Quinlan, 1987). The response variable is coded as “Yes” if the application was approved and“No” if it was not approved. There are 15 predictor variables denoted by A1—A15, somecategorical and some numerical. For proprietary reasons the nature of the variables is notavailable. We note that variables A9 and A10 are code as ‘t’ and ‘f’ which we take to mean ‘true’and ‘false.’ A total of 666 observations had no missing values and of those 299 persons wereapproved for credit cards and 367 were not.A first step in a traditional analysis might be to fit a logistic regression, perhaps with some form ofvariable selection. For this example I used backward elimination with a significance level to stayof0.05. The code is given below:proc logistic data CRX;class A1 A4-A7 A9 A10 A12 A13 / param glm;model Approved (event 'Yes') A1-A15/ ctable pprob 0.5 selection b slstay 0.05;roc;run;Eight variables were removed from the model. From the output for the ctable option we obtainthe classification accuracy metrics for the fitted model.Error! Reference source not found. Classification accuracy for logistics regression on creditcard approval data.Classification TableCorrectIncorrectProbNonNonLevel Event Event Event ity Specificity90.784.8FalsePOSFalseNEG17.38.1The accuracy of the predictions is quite good with an overall percent correct of 87.4% (whichmeans the overall error rate is 12.6% 0.126), and both the sensitivity (percent of approvalcorrectly predicted) and specificity (percent of non-approvals correctly predicted) quite high at90.7% and 84.8%, respectively. The receiving operating characteristic (ROC) curve is a graphicalrepresentation of the quality of the fit of predictive model for a binary target (response). Figure 1shows the ROC curves for all the steps in the variable elimination process overlaid. It is clearfrom this graph that the variables eliminated from the model were not contributing to the predictiveaccuracy and that the overall fit of the logistic regression model is rather good. The AUC valueof 0.9463 for the model is high.2

Error! Reference source not found. ROC Curve for the logistic regression model with variableselection.ROC Curves for All Model Building 751.001 - SpecificityROC Curve (Area)Step 0 (0.9506)Step 2 (0.9504)Step 4 (0.9502)Step 6 (0.9470)Model (0.9463)Step 1Step 3Step 5Step 7(0.9505)(0.9504)(0.9498)(0.9465)Table 2 contains the estimated coefficients for the variable remaining in the model. From thistable it is relatively difficult to tell what variables are most important for determining whether acredit card application will be approved or denied.3

Error! Reference source not found. Variable coefficient estimates, standard errors, and Pvalues.Analysis of Maximum Likelihood EstimatesParameterDF EstimateInterceptStandardWaldError Chi-Square Pr 00010.9916A4u10.88820.32247.58840.0059A4y00.A6?1 1.96221.06363.40340.0651A6ff1-4.27311.031017.1793 99480.0254A6x00.A9f1-3.76300.3205137.8444 .0001A9t00.A1110.16440.046312.59040.0004A141 -0.00220 0.0008816.25970.0124A151 0.000562 0.0001829.59260.0020An alternative method that one might apply in this situation is a decision tree (Breiman et al. 1984).Decision trees (also known as classification and regression trees) work by recursive partitioningof the data into groups (“nodes”) that are increasingly homogeneous with respect to some kind ofa criterion, such as mean squared error for regression trees and either entropy or the Gini index4

for classification trees. Ultimately the fitted tree is “pruned” back to remove branches and leavesof the tree that are just fitting noise in the data. The pruning process is a critical part of fitting aclassification tree: unpruned trees overfit the data and are less accurate predictors for new data.The approach of segmenting the data space is quite different to that of fitting linear, quadratic oradditive functions to the predictor variables. In cases where there are strong interactions amongpredictor variables, classification trees can outperform linear and quasi linear methods.The first step in the fitting of a decision tree is to determine the appropriate size of the fitted tree.A plot of the cross-validated error rate against the size of the fitted tree is obtained using the codebelow:proc hpsplit data CRX cvmethod random(10) seed 123cvmodelfit plots(only) cvcc;class Approved A1 A4-A7 A9 A10 A12 A13;model Approved (event 'Yes') A1-A15;grow gini;run;Error! Reference source not found. Cross-validated error plotted against the size (number ofleaves) of the fitted trees.5

The plot shows that the minimum cross-validated error rate is achieved by a tree with just 5 leaves,which is a very small tree, and the 1-SE rule of Breiman et al. (1984) selects a tree with just twoleaves. That is, the tree splits the data just once. Usually, for large datasets, one would notexpect such small trees to be effective predictors but for these data they are, and they provide uswith some insight into the data.The tree with just two leaves (terminal nodes branches) splits on the variable A9. Among thepersons with a value of ‘t’ on A9, 79.55% were approved for a credit card whereas of the personswith a value of ‘f’ on this variable, only 6.45% were approved for credit cards. One can onlyspeculate as to what this question was, with ‘t’ and ‘f’ being its only possible responses. Theoverall error rate for this simple split of the data is 13.74%, which is very comparable to the 12.6%for the logistic regression model.How much can the error rate be reduced by using additional variables? The surprising answeris, “not much.” The decision trees with 5 and 10 leaves have error rates of 14.36% and 14.44%,respectively, no better than—and , perhaps, a smidge worse than—the error rate for the simplestdecision tree with just two leaves. Even random forests, one of the most accurate machinelearning predictive methods, can only reduce the error rate to 12.5%. What this means is thatnearly all of the information in these data about the approval or lack of approval of a credit cardapplication is contained in the single variable, A9, and that a very simple decision tee identifiedthis piece of information immediately.PREDICTION OF WINE QUALITYThe second example of applying machine learning methods for prediction concerns data on thequality of white wine in Portugal (Cortez et al. 2009). The response is the quality of the winesample on a scale of 0—10, with 10 being the highest quality. The median value of the score of3 experts was used. Predictor variables are chemical and physical characteristics of the winesamples including pH, density, alcohol content (as a percentage), chlorides, sulfates, total andfree sulfur dioxide, citric acid, residual sugar, and volatile acidity.In ordinary multiple linear regression we minimize the residual sum of squares, with respect to , , , ,to obtain the least squares estimates of , , , ,LASSO adds a penalty term to the residuals sum of squares. That is, we minimize 6 . The

The parameter is varied and a specific value may be chosen by some criterion, such as AIC orSBC, or to minimize cross-validated prediction error. In the SAS code below, the value of theLASSO parameter is selected by minimizing cross-validated error. Fifty distinct values of aretried. A plot of the coefficients as a function of follows the code.title2 "Regression with LASSO and 10-fold Cross-validation";proc glmselect data sasgf.WhiteWine plots coefficients;model Quality fixed acidity volatile acidity citric acidresidual sugar chlorides free sulfur dioxidetotal sulfur dioxide density pH sulphatesalcohol / selection LASSO(choose cvex steps 50)cvmethod split(10);run;Error! Reference source not found. Values of regression coefficients for different values ofLASSO parameter .From this plot we see that alcohol is the first variable to have a non-zero coefficient as decreasesand volatile acidity is the second such variable. For this model the cross validated predictionerror (CVEX PRESS) is 0.5679 and the final model contains all the predictor variables exceptcitric acid. The LASSO estimates of the regression coefficients are given in Table 3. Increased7

quality is associated with larger values of alcohol, residual sugar, and pH. Increased quality isassociated with smaller values of density, chlorides, and volatile acidity. The coefficient fordensity is very large relative to the other regression coefficients but that is only a reflection of thefact that the differences in densities among the wines are very, very small. Fixing the randomseed for the 10-fold cross-validation ensures that we are able to replicate results exactly whenrepeating the analysis.Error! Reference source not found. LASSO estimates of regression coefficients for white winedata.Parameter xed acidity10.037306volatile acidity1-1.873477residual sugar10.069450chlorides1-0.357294free sulfur dioxide10.003608total sulfur phates10.570229alcohol10.224899The second step in the analysis to fit a classification tree to the data. As was the case in the firstexample, the first step in fitting a classification tree is to determine how large the tree should be.Sample code for doing this is provided below:title2 "Determining Appropriate Size of the Tree";proc hpsplit data sasgf.WhiteWine cvmethod random(10) seed 123cvmodelfit intervalbins 10000;model Quality fixed acidity volatile acidity citric acidresidual sugar chlorides free sulfur dioxidetotal sulfur dioxide density pH sulphates alcohol;run;By default PROC HPSPLIT “bins” the values of each numerical predictor variable into 100 bins ofequal width across the range of the predictor variable. This is a small departure from the originalalgorithm of Breiman et al. (1984) in which the values of each numerical predictor variable arecompletely sorted. This modification makes perfect sense for very large datasets for which thecost of complete sorting would be prohibitive. For moderate sample sizes I prefer the original8

algorithm and by selecting intervalbins 10000 I am effectively making it so there are only 1or 2 observations per bin, and hence coming close to a full sort of the predictor variables.The plot of cross-validated error against tree size is given below. The cross-validated error isminimized for a large tree that has 57 leaves but the 1-SE rule of Breiman et al. (1984) selects amuch smaller tree with only 5 leaves (Figure 5).Error! Reference source not found. Cross-validated error against tree size for the white winedata.In figure 5 we see that at the root node, node 0, there are 4898 observations and the averagequality score is 5.8779. The first split is on alcohol, at a value of 10.801. For the 3085 wines withalcohol 10.801 the average quality score is 5.6055 whereas for the 1813 wines with alcohol 10.801 the average quality score is 6.3414. Thus the wines with higher alcohol content are ratedhigher, on average, and this result is consistent with the positive coefficient for alcohol in theregression. The difference between these two values may seem modest but the vast majority ofthe wines have scores in the range 5—8.For the wines with alcohol 10.801 the next split is on volatile acidity at a value of 0.250. The1475 wines with volatile acidity 0.250 have an average quality score of 5.8725 while the 1610wines with volatile acidity 0.250 have an average score of 5.3609. This is consistent with theregression results in which volatile acidity had a negative coefficient.9

The second split for the wines with alcohol 10.801, on free sulfur dioxide is much lessinteresting because only 114 out of the 1813 observations end up in the node corresponding tofree sulfur dioxide 11.012.The cross-validated prediction error for the regression tree with 5 leaves is 0.5892, which isslightly larger than the value 0.5679 for the regression using LASSO estimates of the coefficients.By fitting a much larger regression tree, the prediction error may be reduced to 0.5485.Error! Reference source not found. First two levels of classification tree with 5 leaves (terminalnodes).Subtree Starting at Node 0NodeNAvgalcohol 10.801alcohol 10.801NodeNAvgvolatile acidity 0.250NodeNAvgNodeNAvgvolatile acidity 0.250NodeNAvg10free sulfur dioxide 11.012free sulfur dioxide 11.012NodeNAvgNodeNAvg

The third step in the analysis is to apply random forests to determine if higher predictive accuracymight be achieved. Random Forests (Breiman, 2001) takes predictions from many classificationor regression trees and combines them to construct more accurate predictions. The basicalgorithm is as follows:1. Many random samples are drawn from the original dataset. Observations in the originaldataset that are not in a particular random sample are said to be out-of-bag for thatsample.2. To each random sample a classification or regression tree is fit without any pruning.3. The fitted tree is used to make predictions for all the observations that are out-of-bag forthe sample the tree is fit to.4. For a given observations, the predictions from the trees on all of the samples for whichthe observation was out-of-bag are combined. In regression this is accomplished byaveraging the out-of-bag predictions; in classification it is achieved by “voting” the out-ofbag predictions, so the class that is predicted by the largest number of trees for which theobservation is out-of-bag is the overall predicted value for that observation.Many details are omitted from the discussion here, including the number of samples to be drawnfrom the original data, the size of those samples, whether the samples are drawn with or withoutreplacement, and the number of variables available for the binary partitioning in each tree and ateach node.Random Forests may be fit using PROC HPFOREST in SAS Enterprise Miner . Here is somesample code for the white wine data:title1 "Fitting Regression Random Forests to White Wine Data";proc hpforest data sasgf.WhiteWine maxtrees 200 scoreprole oob;input fixed acidity volatile acidity citric acid residual sugarchlorides free sulfur dioxidetotal sulfur dioxidedensity pH sulphates alcohol / level interval;target Quality / level interval;run;All the predictor variables are interval valued and go in a single input statement. If there werecategorical variables, we would need a second input statement for those variables with the optionlevel nominal.The response variable, Quality, is also interval-valued and goes into atarget statement. The default number of subsets of the data and number of trees to fit is 200.The option scoreprole oob asks for the out-of-bag error to be reported. One advantage ofrandom forests over other machine learning algorithms is that nearly everything is automated anddefault settings produce good results in a large number of problems and settings. Table 4 belowcontains some accuracy results.Error! Reference source not found. Random Forests predictive accuracies for selectednumbers of trees.11

Fit StatisticsNumberof TreesAverage AverageSquare SquareNumberErrorErrorof Leaves (Train)(OOB)1010472 0.08523 0.471215051571 0.06476 0.36657100102746 0.06214 0.35143200205276 0.06076 0.34452For the full 200 trees the out-of-bag average square error, which is equivalent to the crossvalidated prediction error for multiple linear regression and regression trees, is 0.3445, which isquite a bit lower than the value of 0.5679 obtained from the regression using LASSO and thevalues of 0.5892 and 0.5485 obtained for the regression trees with 5 and 57 leaves, respectively.Thus, this is one situation where the use of a high level machine learning algorithm, such asrandom forests, gradient boosting machines, or support vector machines, can result in a muchhigher predictive accuracy than that which trad

PREDICTION OF WINE QUALITY The second example of applying machine learning methods for prediction concerns data on the quality of white wine in Portugal (Cortez et al. 2009). The response is the quality of the wine sample on a scale of 0—10, with 10 being the highest quality. The median value of the score of 3 experts was used.

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

This presentation and SAP's strategy and possible future developments are subject to change and may be changed by SAP at any time for any reason without notice. This document is 7 provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a .

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att