Introduction To Statistical Learning

1y ago
10 Views
2 Downloads
1.31 MB
49 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Luis Waller
Transcription

Introduction to Statistical LearningBin LiIIT Lecture Series1 / 49

What is statistical learningIStatistical learning is the science of learning from the datausing statistical methods.IIIIIIPredict the price of a stock in 6 months from now, on thebasis of company performance measures and economic data.Predict whether a patient, hospitalized due to a heart attack,will have a second attack based on patient’s demographic, dietand clinical measurements.Identify the risk factors for prostate cancer.Given a collection of text documents, we want to organizethem according to their content similarities.Statistical learning plays a key role in data mining, artificialintelligence and machine learning.We can divide all statistical learning problems into supervisedand unsupervised situations.IISupervised learning is where both the predictors, Xi ’s, and theresponse, Yi , are observed (e.g. regression/classification).In unsupervised learning, only Xi ’s are observed (e.g.clustering/market bastet analysis).2 / 49

Handwritten Digit RecognitionIData come from the handwritten ZIP codes on envelopesfrom U.S. postal mail.IEach image is a segment from a five digit ZIP code, isolatinga single digit.IThe images are 16 16 eight-bit graysclae maps, with eachpixel ranging in intensity from 0 to 255.IImages are nomralized to have approximately the same sizeand orientation.ITask: predict from 16 16 matrix of pixel intensities, theidentity of each image (0, 1, . . . , 9).Results:IIIIIISingle layer neural network: 80.0%Two layer network: 87%Constrained neural network: 98.4%Tangent distance with 1-NN: 98.9%Support vector machine: 99.2%3 / 49

Handwritten Digit Recognition (cont.)Figure from EOSL 2009Figure 11.9: Examples of training cases from ZIP codedata. Each image is a 16 16 8-bit grayscale represen-4 / 49

0.150.200.250.30A Recent Project with Dr. Chakraborty5001000150020002500Wavelength (nm)5 / 49

Statistical Science2001, Vol. 16, No. 3, 199–231Statistical Modeling: The Two CulturesLeo BreimanAbstract. There are two cultures in the use of statistical modeling toreach conclusions from data. One assumes that the data are generatedby a given stochastic data model. The other uses algorithmic models andtreats the data mechanism as unknown. The statistical community hasbeen committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has keptstatisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developedrapidly in fields outside statistics. It can be used both on large complexdata sets and as a more accurate and informative alternative to datamodeling on smaller data sets. If our goal as a field is to use data tosolve problems, then we need to move away from exclusive dependenceon data models and adopt a more diverse set of tools.1. INTRODUCTIONStatistics starts with data. Think of the data asbeing generated by a black box in which a vector ofinput variables x (independent variables) go in oneside, and on the other side the response variables ycome out. Inside the black box, nature functions toassociate the predictor variables with the responsevariables, so the picture is like this:The values of the parameters are estimated fromthe data and the model then used for informationand/or prediction. Thus the black box is filled in likethis:ylinear regressionlogistic regressionCox modelxModel validation. Yes–no using goodness-of-fit 6 / 49

side, and on the other side the response variables yDatatheInsideBlack theBox black box, nature functions tocomeandout.associate the predictor variables with the responsevariables, so the picture is like this:ynaturexThere are two goals in analyzing the data:IPrediction. To be able to predict what the responses arePrediction.Totobeableto variables.predict what the responsesgoing to befutureinputareI goingto informationabout how ariables.aboutInformation. To extract some informationI Twodifferenttowardsthethe abovegoals. o the input variables.There are two different approaches toward thesegoals:7 / 49

RODUCTIONThe values of the parameters are estimated fromTIONthevaluesdata andthe modelthen usedinformationfromof theparametersareforestimatedh data. Think of the data as Theand/orThus theblackusedbox isforfilledin likedata prediction.and the modeltheninformationack box ofin thewhicha vectorThinkdataas of thethis:pendentvariables)go inone and/or prediction. Thus the black box is filled in likexin whicha vectorofIModelingCulture linearfromstatisticians.ide the responseDatavariablesy this:regressiont variables) go in oneyxlogistic regressionack box, nature functions toresponsewithvariablesyCox modellinearregressionvariablesthe responseyxlogistic regressionx,functions toe isnaturelike this:Model validation.Yes–nousing goodness-of-fitCoxmodelles with the responsetests and residual examination.xenaturethis:Estimatedculture population.98% of goodness-of-fitall statistiModelvalidation.Yes–no usingcians.I Start withassuminga stochasticmodel for the black box;testsand residualexamination.n analyzingx the data:Estimatedculturepopulation.98% ofalldata;statistiI ingCulturefromo predict what the responsescians.I Use fittedmodeltodoprediction;The analysis in this culture considers the inside ofre inputyzingthevariables;data:thebox complexand CIunknown.approach is toI Useact some st andto Culturedo Theirinference.find a function f x —an algorithm that operates onctwhatresponsestingthe theresponsevariablesx to analysispredictthey. TheirblackboxCSlooksThein responsesthis cultureconsiderstheinsideofI The Algorithmicut :complex and unknown. Their approach is tomeinformationaboutent approaches toward thesefind a function f x —an algorithm that operates onhe response variablesyunknownxTwo Culturesx to predict the responses y. Their black box lookslike this:tureroaches toward theseculture starts with assumingl for the inside of the blackmmon data model is that dataendent draws fromIydecision treesunknownneuralnetsxModel validation. Measured by predictive accuracy.Approximatethe blackby someEstimatedcultureboxpopulation.2% complicatedof statisticians, function;starts withassumingdecision trees(predictorvariables,in other fields.I Estimate manythe functionfromsomeherandominsidenoise,of theblackneuralnets algorithm;parameters)In this andpaperinformationI will argue thatfocusonin theBoth predictionare thebasedfitted functions;ata model is that Idatastatisticalcommunityon data racy.drawsfromor, Department of Statistics,Estimatedculture population. 2% of statisticians, 8 / 49

Ozone ProjectIPredictors: daily and hourly readings of over 450meteorological variables for a period of seven years.IResponse: hourly values of ozone concentration in the Basin.IObjetive: predict ozone concentration 12 hours in advance.ITraining set: the first five years data. Test set: the last twoyears data.IModel: multiple linear regressions (including quadratic termsand interactions) with variable selection.IResults: A failure. The false alarm rate of the final predictorwas too high.IQ: What are the possible reasons make MLR unsuccessful inOzone project?9 / 49

Chlorine ProjectIPredictors: mass spectrum predictor with molecular weightranges from 30 to over 10,000.IResponse: contains chlorine or not.ITraining set: 25,000 compounds with known chemicalstructure and mass spectra. Test set: 5,000 knowncompounds.IModel: Linear discriminant analysis (LDA), quadraticdiscriminant analysis (QDA) and decision trees.IResults: LDA and QDA were difficult to adapt to the variabledimensionality. Decision tree with 1,500 yes-no questions:success with 95% prediction accuracy.IQ: What are the possible reasons make tree successful inChlorine project?10 / 49

Perceptions on Statistical AnalysisIFocus on finding a good solution, that’s what consultants getpaid for.ILive with the data before you plunge into modeling.ISearch for a model that gives a good solution, eitheralgorithmic or data.IPredictive accuracy on test sets is the criterion for how goodthe model is.IComputers are an indispensable partner. Programming is anecessary skill for statisticians.11 / 49

What research in the university was like?IA friend of Leo Breiman, a prominent statistician from theBerkeley Statistics Department, visited me in Los Angeles inthe late 1970s. After I described the decision tree method tohim, his first question was, “What’s the model for the data?”IIn Annals of Statistics and JASA, almost every articlecontains a statement of the form:Assume that the data are generated by the following model . . .IConsider data modeling as the template for statistical analysis.IThe conclusions are about the model’s but not the nature’smechanism.IIf the model is a poor emulation of nature, the conclusionsmaybe wrong.12 / 49

A Study for Gender DiscriminationA study was done several decades ago by a well-known member ofa university statistics department to assess whether there wasgender discrimination in the salaries of the faculty.All personnel files were examined and a data base set up whichconsisted of salary as the response variable and 25 other variableswhich characterized academic performance. Such as paperspublished, quality of journals published in, teaching record,evaluations, etc.Gender appears as a binary predictor variable.A linear regression was carried out on the data and the gendercoefficient was significant at the 5% level. This was believed asstrong evidence of sex discrimination.13 / 49

A Study for Gender Discrimination (cont.)IIIICan the data gathered answer the questionposed?Is inference justified when your sample is theentire population?Should a data model be used?The deficiencies in analysis occurred because thefocus was on the model and not on the problem.14 / 49

Problems in Current Data ModelingIThe linear regression model led to many erroneous conclusionsthat appeared in journal articles waving the 5% significancelevel without knowing whether the model fit the data.IThe author set up a simulated regression problem in sevendimensions with a controlled amount of nonlinearity. Standardtests of goodness-of-fit (i.e. lack-of-fit test) did not rejectlinearity until the nonlinearity was extreme.IAn acceptable residual plot does not imply that the model is agood fit to the data.IPublished applications to data often show little care inchecking model fit . . . The question of how well the model fitsthe data is of secondary importance compared to theconstruction of an ingenious stochastic model.15 / 49

Limitations of Data ModelingIEnforcing the form of the model in data modeling.IRelatively low prediction accuracy on data generated fromcomplex systems.IOld saying: “If all a man has is a hammer, then every problemlooks like a nail.”IApproaching problems by looking for a data model imposes ana priori straight jacket that restricts the ability of statisticiansto deal with a wide range of statistical problems.ITakeaway message: to solve a wider range of data problems,we need a larger set of tools!16 / 49

Estimating unknown function fYi f (Xi ) iwith E { i } 0,80706050Income2030405040Income607080where f is an unknown function and is a random error.30ISuppose we observe Yi and Xi (Xi1 , Xi2 , . . . , Xip ) fori 1, . . . , n.We believe that there is a relationship between Y and at leastone of the X ’s. So we model the relationship as20I1012141618Years of Education202210121416182022Years of EducationFigure from ISLR 201317 / 49

Income vs. education and seniorityofSeniorityeIncomYearsEducationFigure from ISLR 201318 / 49

Estimating unknown function f (cont.)The accuracy of estimating f depends onthe size of variation for the i ’s.the complexity of fitted function fˆ 4 21 0.20.40.6X0.81.0 0.0 5 Span 1/4Span 2/3SD 1.06SD 0.101.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5YI3IYI 0.00.20.40.60.81.0X19 / 49

Why do we estimate f?ITwo main reasons: prediction and inference.IIIIIMake accurate predictions of Y based on a new value of X .Which particular predictors actually affect the response?Is the relationship positive or negative?Is the relationship a simple linear one or is it more complicatedetc.?Two examples:IIInterested in predicting how much money an individual willdonate based on observations from 90,000 people on which wehave recorded over 400 different characteristics. For a givenindividual should I send out a mailing?Wish to predict median house price based on 14 variables.Understand which factors have the biggest effect on theresponse and how big the effect is. For example how muchimpact does a river view have on the house value etc.20 / 49

How Do We Estimate f ?IIUse the training data {(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) and astatistical method to estimate f .Two groups of statistical learning methods:IParametric methods:IIIIMake some assumption about the functional form of f (e.g.MLR).Pros: estimating f estimating a set of parameters(relatively easy task). Easy to interpret the model.Cons: The form of model is too rigid. Low prediction accuracywhen f is complicated.Non-parametric methods:IIIDo not make explicit assumption about the functional form off (e.g. neural network, tree).Pros: accurately fit a wider range of possible shapes of f .Cons: Large number of observations is required to obtain anaccurate estimate of f .21 / 49

A linear regression estimateofSeniorityeIncomYearsEducationEven if the standard deviation is low, we will still get a bad answerif we use the wrong model.Figure from ISLR 201322 / 49

A thin-plate spline estimateofSeniorityeIncomYearsEducationNon-linear regression methods are more flexible and can potentiallyprovide more accurate estimates.Figure from ISLR 201323 / 49

A poor estimateofSeniorityeIncomYearsEducationNon-linear regression methods can also be too flexible and producepoor estimates for f .Figure from ISLR 201324 / 49

HighTrade-off between model flexibility and interpretabilitySubset SelectionLassoInterpretabilityLeast SquaresGeneralized Additive ModelsTreesBagging, BoostingLowSupport Vector MachinesLowHighFlexibilityFigure from ISLR 201325 / 49

1.51.00.0240.56Y8Mean Squared Error102.0122.5Training vs. test error: Example 10204060X80100251020FlexibilityLeft: LR (orange), two smoothing spline fits (blue and green).Right: training MSE (grey), testing MSE (red), minimum possibletest MSE (dash).Figure from ISLR 201326 / 49

1.51.00.0240.56Y8Mean Squared Error102.0122.5Example 2 (f is close to linear)0204060X80100251020FlexibilityLeft: LR (orange), two smoothing spline fits (blue and green).Right: training MSE (grey), testing MSE (red), minimum possibletest MSE (dash).Figure from ISLR 201327 / 49

151050 100Y10Mean Squared Error2020Example 3 (f is far from linear)0204060X80100251020FlexibilityLeft: LR (orange), two smoothing spline fits (blue and green).Right: training MSE (grey), testing MSE (red), minimum possibletest MSE (dash).Figure from ISLR 201328 / 49

Bias variance tradeoffIITwo competing forces govern the choice of learning method,i.e. bias and variance.Bias refers to the error that is introduced by modeling a reallife problem (that is usually extremely complicated) by a muchsimpler problem.IIIVariance refers to how much your estimate for f would changeby if you had a different training data set.IIIFor example, linear regression assumes that there is a linearrelationship between Y and X , which is unlikely in real life.In general, the more flexible/complex a method is the less biasit will have.Generally, the more flexible a method is the more variance.In general, the more flexible/complex a method is the less biasit has.It can be shown the expected MSE for a new Y at x new is:E [MSE(x new )] Irreducible Error Bias2 Variance29 / 49

The Bias-Variance decompositionBootstraping and BaggingBayesian Linear RegressionBias-variance tradeoff in splinesToIminimize the expected loss, there is a tradeoff between the bias and the100 datasets with N 25 points eachvariance of a learning algorithm.IFlexible(e.g.,withmany24parameterslow regularization)Fitmodelsa modelGaussianor basisfunctions have low bias andhigh variance.I Use regularized least squares with varying lambdaRigid models (e.g., few parameters or large regularization) have high bias andlow variance.11ln λ 2.6t1ln λ 0.31t000 1 1 10x101x11tx110x10 10xt0 101t0ln λ 2.4t 10x130 / 49

51.01.0101.51.52.02.5Bias, variance and MSE curves in example 1-32510Flexibility20251020FlexibilitySquared bias (blue), variance (orange) and test MSE (red) forexample 1-3.Vertical dotted line is the optimal flexibility level with theminimum test MSE.Figure from ISLR 201331 / 49

The classification settingIFor a classification problem we can use the error rate i.e.Error rate nXi 1III (yi 6 ŷi )/nThe error rate represents the misclassifications rate.The Bayes error rate refers to the lowest possible error ratethat could be achieved if somehow we knew exactly what the“true” probability distribution of the data looked like.By the Bayes rule:fˆ(x) arg max Pr (y k X x).kIDecision boundary between class k and l is determined by theequation:Pr (y k X x) Pr (y l X x).IIn real life problems the Bayes error rate can’t be calculatedexactly.32 / 49

K-Nearest Neighbors (KNN)Ik Nearest Neighbors is a flexible approach to estimate theBayes classifier.IFor any given X we find the k closest neighbors to X in thetraining data, and examine their corresponding Y .IIf the majority of the Y ’s are orange we predict orangeotherwise guess blue.IThe smaller that k is the more flexible the method will be.33 / 49

KNN example with k 3ooooooooooooooooooooooooFigure from ISLR 201334 / 49

KNN with k 1 and k 100KNN: K 1oooooooooooooooo oo o ooooo o ooooo oooo o ooooo oooooooooooo o oooooo oooo oo oo oo ooo ooo oooooo ooo o ooo oooo ooooooooooo o ooooo o oooooo o ooo oo o oo ooo ooo o o ooo oo oo o oooooo o ooooooooo oooooooooooo oo o o oooo ooooooooooooo oooo ooooKNN: K 100ooooooooooooooo oo o ooooo o ooooo oooo o ooooo oooooooooooo o oooooo oooo oo oo oo ooo ooo oooooo ooo o ooo oooo ooooooooooo o ooooo o oooooo o ooo oo o oo ooo ooo o o ooo oo oo o oooooo o ooooooooo oooooooooooo oo o o oooo ooooooooooooo oooo oooooDash line is the class boudary from the Bayes classifier.k 1 overfits (too complex) and k 100 underfits (too simple).Figure from ISLR 201335 / 49

A good choice of k(Figure from ISLR 2013)KNN: K 10oo oooooo oo ooooo oo oooooo oooooo oo ooo o ooooo oo oooooo o o ooooooooo oooo oo oo oo oo o oo oo oooooo o oooo oooo o o o ooo o oo o o ooooooooooo oooo o ooo oo o oo ooooo oooo o o oo o oo ooooooo ooooo oooooooo oo o ooo ooo oo oooooooo ooooooooo oooo oooX2oX1The class boundary for the knn with k 10 is very similar to theone from Bayes classifier.36 / 49

0.100.05Error Rate0.150.20Training vs. test error rates in knn example0.00Training ErrorsTest Errors0.010.020.050.100.200.501.001/KTraining error rates keep going down as k decreases.Test error rate at first decreases but then starts to increase.Figure from ISLR 201337 / 49

Prediction ErrorA fundamental pictureHigh BiasLow VarianceLow BiasHigh VarianceTest SampleTraining SampleLowHighModel ComplexityFigure from EOSL 2001.Figure 7.1: Behavior of test sample and training sam-38 / 49

A cautionary noteIGeorge Box, a famous statistician and son-in-law of R.A.Fisher, once said:“All models are wrong, but some are useful.”IIn practice, there is really NO true model but a good model.A good model should achieve at least one of the following:IIIIIan interpretable model that can be explained by some knownfacts or knowledge;reveals some unknown truth or relationship among thevariables or observations;a model with accurate prediction on new samples.The optimal model depends on:IIIIthethethethepurpose of the study;complexity of the underlying mechanism;quality of the data and signal-noise-ratio;sample size.39 / 49

Simulation study II Data: 500 samples with 25 input variables and 1 numeric response Y .II P2Data generating mechanism: yi 15j 1 xi i where i N(0, 3 ).Input variables: X (X1 , . . . , X25 ) MVN(0, Σ) whereρ(Xi , Xj ) 0.5 i 6 j and 1 o/w.library(MASS) #mvrnorm is in MASS librarymu - rep(0,25)Sigma - matrix(0.5,25,25) diag(.5,25)n - 500set.seed(1)x - mvrnorm(n,mu,Sigma)y - as.vector(x%*%c(rep(1,15),rep(0,10))) rnorm(n,sd 3)data1 - data.frame(x,y)[1:50,]; data2 - data.frame(x,y)I The best subset selection is applied here using regsubsets function inlibrary(leaps) in R.I Two groups of models are generated using the first 50 obs (data1). andfull data (n 500, data2)40 / 49

Simulation study I (cont.) Invmax: the maximum size of subsets to examine.Inbest: the number of subsets of each size to record.IThere are some other useful option. For details, type?regsubsets in R.library(leaps)sout1 - summary(regsubsets(y ., data data1, nvmax 15,nbest 5))res1 - cbind(apply(sout1 which[,-1],1,sum),Cp sout1 cp,bic sout1 bic)sout2 - summary(regsubsets(y ., data data2, nvmax 25,nbest 5))res2 - cbind(apply(sout2 which[,-1],1,sum),Cp sout2 cp,bic sout2 bic)par(mfrow c(2,2))plot(res1[,1],res1[,2],xlim c(1,15),ylim c(0,50),xlab "Model size",ylab "Mallow Cp")plot(res1[,1],res1[,3],xlim c(1,15),ylim range(res1[,3]),xlab "Model size",ylab "BIC")plot(res2[,1],res2[,2],xlim c(1,25),ylim c(0,200),xlab "Model size",ylab "Mallow Cp")plot(res2[,1],res2[,3],xlim c(1,25),ylim range(res2[,3]),xlab "Model size",ylab "BIC")41 / 49

Sample size effect 40 3050 20n 50 40 BIC 70 046 8 10 12 142 80 2 6020 5030 10Mallow Cpn 50 4 6Model size 400 600BIC 1000 50 800150100 1514 0Mallow Cp Model size12n 500 10 10 2025 1200200n 5005 Model size 8 5 10 15 20 25Model size42 / 49

Noise effectIWe set two levels of standard deviation on i : 1 and 6 withSNR 122 and 3.4, respectively.IWe use the BIC (common criterion to select models) to selectthe optimal model size (highlighted by red vertical line).IOthers are kept the same as previous (n 500).SD 6 300 400 BIC 1500 500 600 2000 51015Model size2025 700BIC 1000 500SD 1 5 10152025Model size43 / 49

Simulation study II: bias-variance tradeoffyi 2sin(1.5xi ) xi i ,where i N(0, 1)10observation.8I Data: I Training set: dat has 20 I Fit the data using Y 4 2fine grid and the truefunction values withoutnoise. 6polynomial regressions.I dat2 has X values on a n - 20set.seed(1)dat -data.frame(x runif(n, 0, 9.5))dat y -with(dat,2*sin(1.5*x) x rnorm(n,sd 1))dat2 -data.frame(x seq(from 1,to 9,le 81))dat2 y -with(dat2,2*sin(1.5*x) x)plot(dat x,dat y, xlab "X", ylab "Y")lines(dat2 x,dat2 y,col "red",lwd 2) 0 2468X44 / 49

Fitting on various orders of polynomial regressionsI Fit the data using8I Predict on a fine grid6of X in dat2. 4Ypred - matrix(0,length(dat2 x),10)for (i in 1:10){poly.fit - lm(y poly(x,i,raw T),dat)pred[,i] -predict(poly.fit,dat2)}matplot(dat2 x, pred, xlab "X", ylab "Y",xlim c(0,9.5), ylim range(c(dat y,pred)),lty 1:10,lwd 2,type "l",col rainbow(10, start 3/6, end 4/6))points(dat x, dat y)lines(dat2 x, dat2 y, col "red", lwd 2) 2 0 Order: 1Order: 2Order: 3Order: 4Order: 5Order: 6Order: 7Order: 8Order: 9Order: 1010polynomialregressions from order1 to 10.02468X45 / 49

Repeat 50 times on randomly generately Y246810 2024Y681086Y420 2 2024Y681012Order 1012Order 512Order 124682468XXBias 2.006; SD 0.284Bias 1.319; SD 0.322Bias 0.064; SD 1.013012X340.40.20.00.10.20.0 rue valueMean of estimates0.61.51.20.7X 1012X34 101234X46 / 49

Remarks on previous figureI Variance: how much ŷvaries from one trainingset D to another.I Bias: the differencebetween the true value atX x and expectedvalue of ŷ X x (average of datasets).I Model too “simple” does not fit data well (abiased solution).I Model too “complex” small change of datamakes a big change on ŷ(a high variance solution). iter - 50pred - list()for (it in 1:iter){set.seed(it)dat y - 2*sin(1.5*dat x) dat x rnorm(n,sd 1)pred[[it]] - matrix(0,length(dat2 x),10)for (i in 1:10){pred[[it]][,i] - predict(lm(y poly(x,i,raw T),dat),dat2)}}par(mfcol c(2,3))plot(dat2 x,pred[[1]][,1],xlab "X",ylab "Y",type "n")for (i in 1:iter){lines(dat2 x,pred[[i]][,1])}lines(dat2 x,dat2 y,col "red",lwd 2)segments(3,-2,3,6,lwd 2,col rgb(0,0,1,alpha 0.5))title("Order 1")plot(density(pred.2[,1],bw 0.1),main "",xlab "X")lines(rep(dat2 y[ind],2),c(0,0.2),col "blue")lines(rep(mean(pred.2[,1]),2),c(0,0.2),col "red",lty 2)47 / 49

5.02.0MSE1.00.5mse - matrix(0,iter,10)FUN1 - function(x) mean((x-dat2 y) 2)for (it in 1:iter){mse[it,] - apply(pred[[it]],2,FUN1)}plot(1:10,mse[1,],log "y",ylab "MSE",xlab "Polynomial order",xlim c(1,10),ylim range(mse),type "n")for (it in 1:iter){lines(1:10,mse[it,]col "blue",lwd 0.3)}lines(1:10,apply(mse,2,mean),col "red")0.2 0.1I Curves on the background arethe MSE for each sample againtpolynomial order.I Solid red line is the averageMSE among 50 samples.I Left: low variance but high bias Riight: high variance lowbias.I Optimal order is around 6 (truefunction has 4 reflection pts).10.0MSE curves among 50 repetitions246810Polynomial order48 / 49

Bias-variance tradeoff in MSE3.02.50.51.01.5bias2 - vari - rep(0,10)for (i in 1:10){tmp1 - matrix(0,length(dat2 x),iter)for (it in 1:iter){tmp1[,it] - pred[[it]][,i]}tmp2 - apply(tmp1,1,mean)#bias2[i]: mean bias 2 for ith orderbias2[i] - mean((dat2 y-tmp2) 2)#tmp3: variance of est. on grid for ith ordertmp3 - apply(tmp1,1,var)vari[i] - mean(tmp3)}plot(1:10,apply(mse,2,mean),xlab "Polynomial order",ylab "",col "blue",ylim range(c(bias2,vari)),type "l",lwd 2)lines(1:10,bias2,col "red",lwd 2,lty 2)lines(1:10,vari,col "orange",lwd 2,lty 4)0.0 MSEBias2Variance2.0I Since we know the true function,here MSE bias2 Variance.I Bias is estimated by using theaverage over 50 replications asE (fˆ).I Variance is estimated by usingthe variance of fˆ over 50replications.246810Polynomial order49 / 49

What is statistical learning I Statistical learning is the science of learning from the data using statistical methods. I Predict the price of a stock in 6 months from now, on the basis of company performance measures and economic data. I Predict whether a patient, hospitalized due to a heart attack, will have a second attack based on patient's demographic, diet

Related Documents:

machine learning Supervised & unsupervised learning Models & algorithms: linear regression, SVM, neural nets, -Statistical learning theory Theoretical foundation of statistical machine learning -Hands-on practice Advanced topics: sparse modeling, semi-supervised learning, transfer learning, Statistical learning theory:

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to

The Elements of Statistical Learning byJeromeFriedman,TrevorHastie, andRobertTibshirani John L. Weatherwax David Epstein† 1 March 2021 Introduction The Elements of Statistical Learning is an influential and widely studied book in the fields of machine learning, statistical inference, and pattern recognition. It is a standard recom-

if Y (x) 0:5, then? Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 17 / 39. Part 2 : A guiding example Linear frontier : classification rate 73.5 % Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 18 / 39.

agree with Josef Honerkamp who in his book Statistical Physics notes that statistical physics is much more than statistical mechanics. A similar notion is expressed by James Sethna in his book Entropy, Order Parameters, and Complexity. Indeed statistical physics teaches us how to think about

Module 5: Statistical Analysis. Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module revi

Lesson 1: Posing Statistical Questions Student Outcomes Students distinguish between statistical questions and those that are not statistical. Students formulate a statistical question and explain what data could be collected to answer the question. Students distingui

APS 240 Interlude Ð Writing Scientific Reports Page 5 subspecies of an organism (e.g. Calopteryx splendens xanthostoma ) then the sub-species name (xanthostoma ) is formatted the same way as the species name. In the passage above you will notice that the name of the damselfly is followed by a name: ÔLinnaeusÕ. This is the authority, the name of the taxonomist responsible for naming the .