3y ago

45 Views

2 Downloads

739.95 KB

21 Pages

Transcription

09s1: COMP9417 Machine Learning and Data MiningMachine Learning for NumericPredictionAcknowledgement: Material derived from slides for the bookMachine Learning, Tom M. Mitchell, McGraw-Hill, 1997http://www-2.cs.cmu.edu/ tom/mlbook.htmland slides by Andrew W. Moore available athttp://www.cs.cmu.edu/ awm/tutorialsand the book Data Mining, Ian H. Witten and Eibe Frank,Morgan Kauffman, 2000. http://www.cs.waikato.ac.nz/ml/wekaand the book Pattern Classification, Richard O. Duda, Peter E. Hartand David G. Stork. Copyright (c) 2001 by John Wiley & Sons, Inc.April 1, 2009AimsThis lecture will enable you to describe and reproduce machine learningapproaches to the problem of numerical prediction. Following it you shouldbe able to:Recommended reading: Mitchell, Chapter 4 (6.4);Witten & Frank, pp. 119–126, 223-233.Suggested exercises: Mitchell, 4.1-4.3, 4.5, 4.7 (4.11)Relevant WEKA programs:Linear Regression, Logistic, Multi-Layer Perceptron, M5P define linear regression reproduce the basic method of implementing linear regression describe least mean squares define the problem of non-linear regression define neural network learning in terms of non-linear regression reproduce the method of back-propagation with sigmoid function andhidden layers reproduce the regression and model tree approaches for non-linearregressionCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 1COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 2

IntroductionIntroductionSo far we have considered mainly discrete representations for data andhypotheses in Machine Learning . . .For this class of representations, machine learning is viewed as:. . . however, often find tasks where the most natural representation is thatof prediction of numeric valuesCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 3searching a space of functions . . .COMP9417: April 1, 2009IntroductionMachine Learning for Numeric Prediction: Slide 4IntroductionSome methods: linear regression (statistics) the process of computing an expressionthat predicts a numeric quantity perceptron (machine learning) a biologically-inspired linear predictionmethod multi-layer neural networks (machine learning)learning non-linearpredictors via hidden nodes between input and output regression trees (statistics / machine learning)predicts a numeric quantitytree where each leaf– the average value of training instances that reach the leaf– internal nodes test discrete or continuous attributes model trees (statistics / machine learning) regression tree with linearregression models at the leaf nodes– can fit with non-axis-orthogonal slopes– smoothing at internal nodes to approximate continuous functionCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 5COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 6

Dataset: predicting CPU performanceExamples: 209 different computer configurationsLinear Regression for CPU datasetPRP - 56.1 0.049 MYCT 0.015 MMIN 0.006 MMAX 0.630 CACH- 0.270 CHMIN 1.46 CHMAXRegression equationOutcome: linear sum of attribute values with appropriate weights.RegressionThe process of determining the weights for the regression equation.COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 7COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 8Linear RegressionLinear Regression Numeric attributes and numeric prediction Linear models, i.e. outcome is linear combination of attributesy w0 w1x1 w2x2 . . . wmxm Weights are calculated from the training data Predicted value for first training instance x(1) is:Linear regression assumes that the expected value of the output given aninput, E[y x], is linear.Simplest case: Out(x) wx for some unknown w.Given the data, we can estimate w.COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 9(1)(1)(1)w0x0 w1x1 w2x2 . . . wmx(1)m mX(1)wj xjj 0COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 10

Minimizing Squared ErrorMultiple Regression Difference between predicted and actual values is the error !m 1 coefficients are chosen so that sum of squared error on all instancesin training data is minimizedSquared error:nXi 1 y (i) mX 2(i)wj xj j 0Coefficients can be derived using standard matrix operationsCan be done if there are more instances than attributes (roughly speaking).“Ordinary Least Squares” (OLS) regression – minimizing the sum ofsquared distances of data points to the estimated regression line.Linear least squares fitting with 2 input variables.COMP9417: April 1, 2009COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 11Regression for Classification Any regression technique can be used for classification– Training: perform a regression for each class, setting the output to 1for training instances that belong to class, and 0 for those that don’t– Prediction: predict class corresponding to model with largest outputvalue (membership value) For linear regression this is known as multiresponse linear regressionMachine Learning for Numeric Prediction: Slide 12Pairwise regression Another way of using regression for classification:– A regression function for every pair of classes, using only instancesfrom these two classes– An output of 1 is assigned to one member of the pair, an outputof -1 to the other Prediction is done by voting– Class that receives most votes is predicted– Alternative: “don’t know” if there is no agreement More likely to be accurate but more expensiveHow many regressions for k classes ?COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 13COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 14

Logistic regressionDiscussion of linear models Not appropriate if data exhibits non-linear dependencies Problem: some assumptions violated when linear regression is appliedto classification problems (not really fitting a line) Alternative: logistic regression– Designed for classification problems– Tries to estimate class probabilities directly using maximum likelihoodmethod– Uses the linear model to predict the log of the odds of the classprobability Plog(P/(1 P )) w0x0 w1x1 . . . wk xk But: can serve as building blocks for more complex schemes (i.e. modeltrees) Example: multi-response linear regression defines a separatinghyperplane for any two given classes:(1)(1)(1)(2)(2)(2)(1)(2)xm w0 w1 x1 w2 x2 . . . wmxmw0 w1 x1 w2 x2 . . . wmwhich can be rewritten as(1)(2)(1)(2)(1)(2) wm)xm 0(w0 w0 ) (w1 w1 )x1 . . . (wmwhere w (1) and w (2) are the weight vectors for the two classes.COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 15COMP9417: April 1, 2009Discussion of linear modelsMachine Learning for Numeric Prediction: Slide 16Discussion of linear modelsThink of the hyperplane as separating all the positives on one side fromthe negatives on the other.For pairwise linear regression this also applies – the only difference is thatonly the instances in each of the two classes is consideredHowever, sometimes it is not possible to define a separating hyperplane,even for some very simple functions . . .Filled circle – output one; hollow circle – output zero.AND – divide 1’s from 0’s with single line; XOR – not possible.COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 17COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 18

Discussion of linear modelsDiscussion of linear modelsDealing with noise – robust regression Statistical methods that address problem of outliers are called robust Possible way of making regression more robust:– Minimize absolute error instead of squared error– Remove outliers (i.e. 10% of points farthest from the regressionplane)– Minimize median of squares instead of mean of squares (copes withoutliers in x and y direction) Geometric interpretation: finds narrowest strip covering half theobservations (see figure) Thickness measured in vertical direction Least median of squares lies in centre of fuzzy grey band in figureCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 19COMP9417: April 1, 2009Artificial Neural NetworksMachine Learning for Numeric Prediction: Slide 20Connectionist ModelsConsider humans: Threshold units Gradient descent Neuron switching time .001 second Multilayer networks Number of neurons 1010 Backpropagation Connections per neuron 104 5 Hidden layer representations Scene recognition time .1 second Advanced topics 100 inference steps doesn’t seem like enough much parallel computationCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 21COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 22

Connectionist ModelsProperties of artificial neural nets (ANN’s):When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Many neuron-like threshold switching units Output is a vector of values Many weighted interconnections among units Possibly noisy data Highly parallel, distributed process Form of target function is unknown Emphasis on tuning weights automatically Human readability of result is unimportantExamples: Speech phoneme recognition (NetTalk) Image classification (see face recognition data) many others . . .COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 23COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 24ALVINNALVINN drives 70 mph on highwaysSharpLeftStraightAheadSharpRight30 OutputUnits4 HiddenUnits30x32 SensorInput RetinaCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 25COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 26

PerceptronPerceptronx1w1x2w2.x0 1w0Sometimes simpler vector notation used: Σo( x) nΣ wi xii 0wnxn o(x1, . . . , xn) {o 1 if w · x 0 1 otherwise.n1 if S w x 0i 0 i i-1 otherwise1 if w0 w1x1 · · · wnxn 0 1 otherwise.COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 27COMP9417: April 1, 2009Decision Surface of a PerceptronDecision Surface of a Perceptronx2x2Machine Learning for Numeric Prediction: Slide 28But some functions not representable e.g., not linearly separable- Therefore, we’ll want networks of these.x1x1---(a) (b)Represents some useful functions What weights represent g(x1, x2) AN D(x1, x2)?COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 29COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 30

Perceptron training rulePerceptron training ruleCan prove it will convergewi wi wi If training data is linearly separablewhere wi η(t o)xi and η sufficiently smallWhere: t c( x) is target value o is perceptron output η is small constant (e.g., .1) called learning rateCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 31COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 32Gradient DescentGradient Descent25To understand, consider simpler linear unit, where20Let’s learn wi’s that minimize the squared error1X(td od)2E[w] 215E[w]o w0 w1x1 · · · wnxn105021d D0-1Where D is set of training examplesCOMP9417: April 1, 2009w0Machine Learning for Numeric Prediction: Slide 33COMP9417: April 1, 20093210-1-2w1Machine Learning for Numeric Prediction: Slide 34

Gradient DescentGradient DescentGradient E[w] E E E,,··· w0 w1 wn E wi 1X(td od)2 wi 2 1X (td od)22 wi 1X 2(td od)(td od)2 wi X XdTraining rule:d w η E[w] i.e.,d E wi η wid E wiCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 35Gradient Descent(td od) (td w · x d) wi(td od)( xi,d)dCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 36Training Perceptron vs. Linear unitGradient-Descent(training examples, η)Perceptron training rule guaranteed to succeed ifEach training example is a pair h x, ti, where x is the vector of inputvalues, and t is the target output value. η is the learning rate (e.g., .05). Training examples are linearly separableInitialize each wi to some small random valueUntil the termination condition is met, DoInitialize each wi to zeroFor each h x, ti in training examples, DoInput the instance x to the unit and compute the output oFor each linear unit weight wi wi wi η(t o)xiFor each linear unit weight wiwi wi wiCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 37 Sufficiently small learning rate ηLinear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by HCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 38

Incremental (Stochastic) Gradient DescentIncremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfiedED [w] 1X(td od)22d D1. Compute the gradient ED [w] 1Ed[w] (td od)222. w w η ED [w] Incremental Gradient Descent can approximate Batch Gradient Descentarbitrarily closely if η made small enoughIncremental mode Gradient Descent:Do until satisfied For each training example d in D1. Compute the gradient Ed[w] 2. w w η Ed[w] COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 39Multilayer Networks of Sigmoid Unitsheadhid.F1COMP9417: April 1, 2009.who’dCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 40Multilayer Networks of Sigmoid UnitshoodF2Machine Learning for Numeric Prediction: Slide 41COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 42

Sigmoid Unitx1w1x2w2.Sigmoid Unitx0 1Nice property:w0dσ(x)dx σ(x)(1 σ(x))We can derive gradient descent rules to trainΣ One sigmoid unitnnet S wi xio s (net) i 0wnxn1 Multilayer networks of sigmoid units Backpropagation-net1 eσ(x) is the sigmoid function11 e xCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 43COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 44Error Gradient for a Sigmoid UnitError Gradient for a Sigmoid UnitBut we know: E wi 1X(td od)2 wi 2 1X (td od)22 wi od σ(netd) od(1 od) netd netdd D netd (w · xd) xi,d wi wid 1X2(td od)(td od) 2 wid X od(td od) wiSo: E wi X(td od)od(1 od)xi,dd Dd XdCOMP9417: April 1, 2009(td od) od netd netd wiMachine Learning for Numeric Prediction: Slide 45COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 46

Backpropagation Algorithm Gradient descent over entire network weight vectorInitialize all weights to small random numbers. Easily generalized to arbitrary directed graphsUntil satisfied, Do Will find a local, not necessarily global error minimumFor each training example, DoInput the training example to the network andcompute the network outputs– In practice, often works well (can run multiple times) Often include weight momentum αFor each output unit kδk ok (1 ok )(tk ok ) wi,j (n) ηδj xi,j α wi,j (n 1)For each hidden unit Phδh oh(1 oh) k outputs wh,k δk Minimizes error over training examplesUpdate each network weight wi,jwi,j wi,j wi,jwhere wi,j ηδj xi,jCOMP9417: April 1, 2009More on Backpropagation– Will it generalize well to subsequent examples? Training can take thousands of iterations slow! Using network after training is very fastMachine Learning for Numeric Prediction: Slide 47COMP9417: April 1, 2009Learning Hidden Layer RepresentationsLearning Hidden Layer RepresentationsInputsOutputsMachine Learning for Numeric Prediction: Slide 48A target 1000000001000000001000000001 01000000001000000001Can this be learned?COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 49COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 50

Learning Hidden Layer RepresentationsLearning Hidden Layer RepresentationsAn autoassociator network:Learned hidden layer P9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 51Sum of squared errors for each output OMP9417: April 1, 200950010001500.08.88.27.71.02.99.98.01 00001000000001Machine Learning for Numeric Prediction: Slide 522000Hidden unit encoding for input 0100000010.70OutputTraining the network0.80.89.01.01.99.03.22.80.60COMP9417: April 1, 2009Training0.9 HiddenValues.04.11.97.97.05.99.01.942500Machine Learning for Numeric Prediction: Slide 530COMP9417: April 1, 20095001000150020002500Machine Learning for Numeric Prediction: Slide 54

Training the networkConvergence of BackpropagationGradient descent to some local minimumWeights from inputs to one hidden unit43 Perhaps not global minimum.2 Add momentum1 Stochastic gradient descent0-1 Train multiple nets with different initial weights-2-3Nature of convergence-4-505001000150020002500 Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progressesCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 55COMP9417: April 1, 2009Expressive Capabilities of ANNsOverfitting in ANNsBoolean functions:Error versus weight updates (example 1)0.01 Every Boolean function can be represented by network with singlehidden layerTraining set errorValidation set error0.0090.0080.007Error but might require exponential (in number of inputs) hidden unitsMachine Learning for Numeric Prediction: Slide 560.0060.005Continuous functions:0.0040.003 Every bounded continuous function can be approximated with arbitrarilysmall error, by network with one hidden layer [Cybenko 1989; Horniket al. 1989]0.002050001000015000Number of weight updates20000 Any function can be approximated to arbitrary accuracy by a networkwith two hidden layers [Cybenko 1988].COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 57COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 58

Overfitting in ANNsAlternative Error FunctionsError versus weight updates (example 2)0.08Penalize large weights:Training set errorValidation set error0.070.06E(w) Error0.051X2X(tkd okd)2 γX2wjii,jd D k outputs0.040.03Train on target slopes as well as values:0.020.01001000200030004000Number of weight updates50006000 1XE(w) 2XX tkdj inputs xjd (tkd okd)2 µd D k outputs okd xjd!2 Tie together weights, e.g., in phoneme recognition networkCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 59COMP9417: April 1, 2009Recurrent NetworksEvaluating numeric predictiony(t 1)y(t 1)bx(t)x(t)(a) Feedforward networkc(t)(b) Recurrent networky(t 1)x(t)Machine Learning for Numeric Prediction: Slide 60 Same strategies: independent test set, crossvalidation, significancetests, etc. Difference: error measures Actual target values: a1, a2, . . . , an Predicted target values: p1, p2, . . . , pnc(t)y(t)x(t – 1) Most popular measure: mean-squared error(p1 a1)2 . . . (pn an)2nEasy to manipulate mathematically.c(t – 1)y(t – 1)(c) Recurrent networkx(t – 2)c(t – 2)unfolded in timeCOMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 61COMP9417: April 1, 2009Machine Learning for Numeric Prediction: Slide 62

Other measuresNon-linear Regression with TreesThe root mean-squared error:r(p1 a1)2 . . . (pn an)2nThe average (mean) absolute error is less sensitive to outliers than themean-squared error:Despite some nice properties of ANNs, such as generalization to dealsensibly with unseen input patterns and robustness to losing neurons(pre

COMP9417: April 1, 2009 Machine Learning for Numeric Prediction: Slide 27 Perceptron Sometimes simpler vector notation used: o(x ) 1 ifw x 0 1 otherwise. COMP9417: April 1, 2009 Machine Learning for Numeric Prediction: Slide 28 Decision Surface of a Perceptron x1 x2 -- -x1 x2 (a) (b)- - Represents some useful functions

Related Documents: