Lecture 2: More On Linear Methods For Regression

2y ago
10 Views
2 Downloads
733.97 KB
52 Pages
Last View : 1y ago
Last Download : 3m ago
Upload by : Shaun Edmunds
Transcription

Lecture 2: More on linear methods for regression Overfitting and bias-variance trade-offLinear basis functions modelsSequential (on-line, incremental) learningWhy least-squares? A probabilistic analysisIf we have time: RegularizationCOMP-652, Lecture 2 - September 9, 20091

Recall: Linear and polynomial regression Our first assumption was that it is good to minimize sum- (or mean-)squared error Algorithms that minimize this function are called least-squares Our second assumption was the linear form of the hypothesis class The terms were powers of the input variables (and possibly crossterms of these powers)COMP-652, Lecture 2 - September 9, 20092

Recall: OverfittingM 01M 11tt00 1 10x10M 311M 91txt00 1 10x10x1The higher the degree of the polynomial, the more degrees of freedom,and the more capacity to “overfit” (think: memorize) the training dataCOMP-652, Lecture 2 - September 9, 20093

Recall: Typical overfitting plot1ERMSTrainingTest0.5003M69 The training error decreases with the degree of the polynomial, i.e.the complexity of the hypothesis The testing error, measured on independent data, decreases at first,then starts increasing Cross-validation helps us– Find a good hypothesis class– Report unbiased resultsCOMP-652, Lecture 2 - September 9, 20094

The anatomy of the error Suppose we have examples hx, yi where y f (x) and isGaussian noise with zero mean and standard deviation σ Reminder: normal (Gaussian) distributionN (x µ, σ 2 )2σµCOMP-652, Lecture 2 - September 9, 2009x5

The anatomy of the error: Linear regression In linear regression, given a set of examples hxi, yiii 1.m, we fit alinear hypothesis h(x) wT x, such as to minimize sum-squarederror over the training data:mXi 1(yi h(xi))2 Because of the hypothesis class that we chose (linear hypotheses)for some functions f we will have a systematic prediction error Depending on the data set we have, the parameters w that we findwill be differentCOMP-652, Lecture 2 - September 9, 20096

An example (Tom Dietterich) The sine is the true function The circles are the data points The straight line is the linear regression fitCOMP-652, Lecture 2 - September 9, 20097

Example continuedWith different sets of 20 points, we get different linesCOMP-652, Lecture 2 - September 9, 20098

Bias-variance analysis Given a new data point x, what is the expected prediction error? Assume that the data points are drawn independently and identicallydistributed (i.i.d.) from a unique underlying probability distributionP (hx, yi) The goal of the analysis is to compute, for an arbitrary new point x, 2EP (y h(x))where y is the value of x that could be present in a data set, and theexpectation is over all all training sets drawn according to P We will decompose this expectation into three componentsCOMP-652, Lecture 2 - September 9, 20099

Recall: Statistics 101 Let X be a random variable with possible values xi, i 1 . . . n andwith probability distribution P (X) The expected value or mean of X is:E[X] nXxiP (xi)i 1 If X is continuous, roughly speaking, the sum is replaced by anintegral, and the distribution by a density function The variance of X is:V ar[X] E[(X E(X))2] E[X 2] (E[X])2COMP-652, Lecture 2 - September 9, 200910

The variance lemmaV ar[X] E[(X E[X])2]nX (xi E[X])2P (xi) i 1nXi 1nXi 1(x2i 2xiE[X] (E[X])2)P (xi)x2i P (xi) 2E[X]nXxiP (xi) (E[X])2i 1nXP (xi)i 1 E[X 2] 2E[X]E[X] (E[X])2 · 1 E[X 2] (E[X])2We will use the form:E[X 2] (E[X])2 V ar[X]COMP-652, Lecture 2 - September 9, 200911

Bias-variance decomposition 222EP (y h(x)) EP (h(x)) 2yh(x) y 2 2 EP (h(x)) EP y 2EP [y]EP [h(x)]Let h̄(x) EP [h(x)] denote the mean prediction of the hypothesis at x,when h is trained with data drawn from PFor the first term, using the variance lemma, we have:EP [(h(x))2] EP [(h(x) h̄(x))2] (h̄(x))2Note that EP [y] EP [f (x) ] f (x)For the second term, using the variance lemma, we have:E[y 2] E[(y f (x))2] (f (x))2COMP-652, Lecture 2 - September 9, 200912

Bias-variance decomposition (2) Putting everything together, we have: 2EP (y h(x)) EP [(h(x) h̄(x))2] (h̄(x))2 2f (x)h̄(x) EP [(y f (x))2] (f (x))2 EP [(h(x) h̄(x))2] (f (x) h̄(x))2 E[(y f (x))2] The first term is thevariance of the hypothesis h when trained withfinite data sets sampled randomly from P The second term is the squared bias (or systematic error) which isassociated with the class of hypotheses we are considering The last term is the noise, which is due to the problem at hand, andcannot be avoidedCOMP-652, Lecture 2 - September 9, 200913

Example revisited: BiasCOMP-652, Lecture 2 - September 9, 200914

Example revisited: VarianceCOMP-652, Lecture 2 - September 9, 200915

Example revisited: NoiseCOMP-652, Lecture 2 - September 9, 200916

A point with low biasCOMP-652, Lecture 2 - September 9, 200917

A point with high biasCOMP-652, Lecture 2 - September 9, 200918

Error decomposition0.15(bias)2variance(bias)2 variancetest error0.120.090.060.030 3 2 1012ln λ The bias-variance sum approximates well the test error over a set of1000 points x-axis is a measure of the hypothesis complexity (decreasing left-toright) Simple hypotheses have high bias (bias will be high at many points) Complex hypotheses have high variance: the hypotheses is verydependent on the data set on which it was trained.COMP-652, Lecture 2 - September 9, 200919

Bias-variance trade-off Consider fitting a small degree vs. a high degree polynomial Which one do you expect to have higher bias? Higher variance?COMP-652, Lecture 2 - September 9, 200920

Bias-variance trade-off Typically, bias comes from not having good hypotheses in theconsidered class Variance results from the hypothesis class containing “too many”hypotheses Hence, we are faced with a trade-off: choose a more expressiveclass of hypotheses, which will generate higher variance, or a lessexpressive class, which will generate higher bias The trade-off depends also on how much data you haveCOMP-652, Lecture 2 - September 9, 200921

More on overfitting Overfitting depends on the amount of data, relative to the complexityof the hypothesis With more data, we can explore more complex hypotheses spaces,and still find a good solutionN 151N 1001tt00 1 10COMP-652, Lecture 2 - September 9, 2009x10x122

Linear models in general By linear models, we mean that the hypothesis function hw (x) is alinear function of the parameters w This does NOT mean the hw (x) is a linear function of the input vectorx (e.g., polynomial regression) In generalK 1Xhw (x) wk φk (x) wT φ(x)k 0where φk are called basis functions As usual, we will assume that φ0(x) 1, x, to create a bias term The hypothesis can alternatively be written as:hw (x) Φwwhere Φ is a matrix with one row per instance; row j contains φ(xj ). Basis functions are fixedCOMP-652, Lecture 2 - September 9, 200923

Example basis functions: Polynomials10.50 0.5 1 101φk (x) xk“Global” functions: a small change in x may cause large change in theoutput of many basis functionsCOMP-652, Lecture 2 - September 9, 200924

Example basis functions:10.50 0.5 1 101φk (x) xk“Global” functions: a small change in x may cause large change in theoutput of many basis functionsCOMP-652, Lecture 2 - September 9, 200925

Example basis functions: Gaussians10.750.50.250 1 0x µkφk (x) exp2s2 1µk controls the position along the x-axiss controls the width (activation radius)µk , s fixed for now (later we discuss adjusting them)Usually thought as “local” functions: a small change in x only causesa change in the output of the basis with means close to xCOMP-652, Lecture 2 - September 9, 200926

Example basis functions: Sigmoidal10.750.50.250 1x µkφk (x) σs 011where σ(a) 1 exp( a)µk controls the position along the x-axiss controls the slopeµk , s fixed for now (later we discuss adjusting them)“Local” functions: a small change in x only causes a change in theoutput of a few basis (others will be close to 0 or 1)COMP-652, Lecture 2 - September 9, 200927

Minimizing the mean-squared error Recall from last time: we want minw JD (w), where:m11X(hw (xi) yi)2 (Φw y)T (Φw y)JD (w) 2 i 12 Compute the gradient and set it to 0:1 w JD (w) w (wT ΦT Φw wT ΦT y yT Φw yT y) ΦT Φw ΦT y 02 Solve for w:w (ΦT Φ) 1ΦT y What if Φ is too big to compute this explicitly?COMP-652, Lecture 2 - September 9, 200928

Gradient descent The gradient of J at a point hw0, w1, . . . , wk i can be thought of as avector indicating which way is “uphill”.2000SSQ15001000500010105500 5w0 5 10 10w1 If this is an error function, we want to move “downhill” on it, i.e., in thedirection opposite to the gradientCOMP-652, Lecture 2 - September 9, 200929

Example gradient descent traces In general, there may be may local optima Final solution depends on the initial parametersCOMP-652, Lecture 2 - September 9, 200930

Gradient descent algorithm The basic algorithm assumes that J is easily computed We want to produce a sequence of vectors w1, w2, w3, . . . with thegoal that:– J(w1) J(w2) J(w3) . . .– limi wi w and w is locally optimal. The algorithm: Given w0, do for i 0, 1, 2, . . .wi 1 wi αi J(wi) ,where αi 0 is the step size or learning rate for iteration i.COMP-652, Lecture 2 - September 9, 200931

Step size and convergence Convergence to a local minimum depends in part on the αi. If they are too large (such as constant) oscillation or “bubbling” mayoccur.(This suggests the αi should tend to zero as i .) If they are too small, the wi may not move far enough to reach a localminimum, or may do so very slowly.COMP-652, Lecture 2 - September 9, 200932

Robbins-Monroe conditions The αi are a Robbins-Monroe sequence if: Xαi andi 0 E.g., αi Xi 0αi2 1i 1 (averaging)12 for i 1 . . . T , E.g., αi αi 212 for i T 1, . . . (T 1) 2T etc These conditions, along with appropriate conditions on J aresufficient to ensure convergence of the wi to a point w such that J(w ) 0. Many variants are possible: e.g., we may use at each step a randomvector with mean J(wi); this is stochastic gradient descent.COMP-652, Lecture 2 - September 9, 200933

“Batch” versus “On-line” optimization The error function, JD , is a sum of errors attributed to each instance:(JD J1 J2 . . . Jm.) In batch gradient descent, the true gradient is computed at each step: JD J1 J2 . . . Jm. In on-line gradient descent, at each iteration one instance, i {1, . . . , m}, is chosen at random and only Ji is used in the update. Linear case (least-mean-square or LMS or Widrow-Hoff rule): pickinstance i and update:wi 1 wi αi(yi wT φ(xi)φ(xi), Why prefer one or the other?COMP-652, Lecture 2 - September 9, 200934

“Batch” versus “On-line” optimization Batch is simple, repeatable. On-line:– Requires less computation per step.– Randomization may help escape poor local minima.– Allows working with a stream of data, rather than a static set(hence “on-line”).COMP-652, Lecture 2 - September 9, 200935

TerminationThere are many heuristics for deciding when to stop gradient descent.1. Run until k Jk is smaller than some threshold.2. Run it for as long as you can stand.3. Run it for a short time from 100 different starting points, see whichone is doing best, goto 2.4. . . .COMP-652, Lecture 2 - September 9, 200936

Gradient descent in linear models and beyond In linear models, gradient descent can be used with larger data setsthan the exact solution method Very useful if the data is non-stationary (i.e., the data distributionchanges over time) In this case, use constant learning rates (not obeying Robbins-Munroconditions) Crucial method for non-linear function approximation (where closedform solutions are impossible)Annoyances: Speed of convergence depends on the learning rate schedule In non-linear case, randomizing the initial parameter vector is crucialCOMP-652, Lecture 2 - September 9, 200937

Another algorithm for optimization Recall Newton’s method for finding the zero of a function g : R R At point wi, approximate the function by a straight line (its tangent) Solve the linear equation for where the tangent equals 0, and movethe parameter to this point:wCOMP-652, Lecture 2 - September 9, 2009i 1g(wi) w 0 ig (w )i38

Application to machine learning Suppose for simplicity that the error function J has only oneparameter We want to optimize J, so we can apply Newton’s method to find thedzeros of J 0 dwJ We obtain the iteration:wi 1J 0(wi) w 00 iJ (w )i Note that there is no step size parameter! This is a second-order method, because it requires computing thesecond derivative But, if our error function is quadratic, this will find the global optimumin one step!COMP-652, Lecture 2 - September 9, 200939

Second-order methods: Multivariate setting If we have an error function J that depends on many variables, wecan compute the Hessian matrix, which contains the second-orderderivatives of J: 2JHij wi wj The inverse of the Hessian gives the “optimal” learning rates The weights are updated as:w w H 1 w J This is also called Newton-Raphson methodCOMP-652, Lecture 2 - September 9, 200940

Which method is better? Newton’s method usually requires significantly fewer iterations thangradient descent Computing the Hessian requires a batch of data, so there is nonatural on-line algorithm Inverting the Hessian explicitly is expensive, but there is very cutetrick for computing the product we need in linear time (Schraudolph,1996)COMP-652, Lecture 2 - September 9, 200941

Coming back to mean-squared error function. Good intuitive feel (small errors are ignored, large errors arepenalized) Nice math (closed-form solution, unique global optimum) Geometric interpretation (in our notation, t is y and y is hw (x))Stϕ1yϕ2 Any other interpretation?COMP-652, Lecture 2 - September 9, 200942

A probabilistic assumption Assume yi is a noisy target value, generated from a hypothesis hw (x) More specifically, assume that there exists w such that:yi hw (xi) eiwhere ei is random variable (noise) drawn independently for eachxi according to some Gaussian (normal) distribution with mean zeroand variance σ. How should we choose the parameter vector w?COMP-652, Lecture 2 - September 9, 200943

Bayes theorem in learningLet h be a hypothesis and D be the set of training data. Using Bayestheorem, we have:P (D h)P (h),P (h D) P (D)where: P (h) prior probability of hypothesis h P (D) prior probability of training data D (normalization,independent of h) P (h D) probability of h given D P (D h) probability of D given h (likelihood of the data)COMP-652, Lecture 2 - September 9, 200944

Choosing hypothesesP (D h)P (h)P (h D) P (D)What is the most probable hypothesis given the training data?Maximum a posteriori (MAP) hypothesis hM AP :hM AP arg max P (h D)h HP (D h)P (h)(using Bayes theorem) arg maxh HP (D) arg max P (D h)P (h)h HThis is the Bayesian answer (more detail next time)COMP-652, Lecture 2 - September 9, 200945

Maximum likelihood estimationhM AP arg max P (D h)P (h)h H If we assume P (hi) P (hj ) (all hypotheses are equally likely a priori)then we can further simplify, and choose the maximum likelihood(ML) hypothesis:hM L arg max P (D h) arg max L(h)h Hh H Standard assumption: the training examples are independentlyidentically distributed (i.i.d.) This alows us to simplify P (D h):P (D h) mYi 1COMP-652, Lecture 2 - September 9, 2009P (hxi, yii h) mYi 1P (yi xi; h)46

The log trick We want to maximize:L(h) mYi 1P (yi xi; h)This is a product, and products are hard to maximize! Instead, we will maximize log L(h)! (the log-likelihood function)log L(h) mXi 1COMP-652, Lecture 2 - September 9, 2009log P (yi xi; h)47

Maximum likelihood for regression Adopt the assumption that:yi hw (xi) ei,where ei are normally distributed with mean 0 and variance σ The best hypothesis maximizes the likelihood of yi hw (xi) ei Hence,“ y h (x ) ”2mY11 2 i σw i eL(w) 22πσi 1because the noise variables ei are from a Gaussian distributionCOMP-652, Lecture 2 - September 9, 200948

Applying the log tricklog L(w) mXi 1mXi 1log log 12πσ 21e(yi hw (xi ))21 2σ2 2πσ 2 !mX1 (yi hw (xi))22i 1σ2Maximizing the right hand side is the same as minimizing:mX1 (yi hw (xi))22i 1σ2This is our old friend, the sum-squared-error function!COMP-652, Lecture 2 - September 9, 200949

Maximum likelihood hypothesis for least-squaresestimators Under the assumption that the training examples are i.i.d. and thatwe have Gaussian target noise, the maximum likelihood parametersw are those minimizing the sum squared error:w arg minwmXi 12(yi hw (xi)) This makes explicit the hypothesis behind minimizing the sumsquared error If the noise is not normally distributed, maximizing the likelihoodwill not be the same as minimizing the sum-squared error (seehomework) In practice, different loss functions may be neededCOMP-652, Lecture 2 - September 9, 200950

Regularization Remember the intuition: complicated hypotheses lead to overfitting Idea: change the error function to penalize hypothesis complexity:J(w) JD (w) λJpen(w)This is called regularization in machine learning and shrinkage instatistics λ is called regularization coefficient and controls how much we valuefitting the data well, vs. a simple hypothesis One can view this as making complex hypotheses a priori less likely(though there are some subtleties)COMP-652, Lecture 2 - September 9, 200951

Regularization for linear models A squared penalty on the weights would make the math work nicelyin our case:1λ(Φw y)T (Φw y) wT w22 This regularization term is also known as weight decay in neuralnetworks Optimal solution:w (ΦT Φ λI) 1ΦyCOMP-652, Lecture 2 - September 9, 200952

The straight line is the linear regression fit COMP-652, Lecture 2 - September 9, 2009 7. Example continued With different sets of 20 points, we get different lines COMP-652, Lecture 2 - September 9, 2009 8. Bias-variance analysis Given a new data point x, what is the expected pr

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 1: Linear Regression Linear Regression Notation Loss Function Solving the Regression Problem Geome

SKF Linear Motion linear rail INA Linear SKF Linear Motion Star Linear Thomson Linear linear sysTems Bishop-Wisecarver INA Linear Pacific Bearing Thomson Linear mecHanical acTuaTors Duff-Norton Joyce/Dayton Nook Industries Precision mecHanical comPonenTs PIC Design Stock Drive Product

Lecture 11 – Eigenvectors and diagonalization Lecture 12 – Jordan canonical form Lecture 13 – Linear dynamical systems with inputs and outputs Lecture 14 – Example: Aircraft dynamics Lecture 15 – Symmetric matrices, quadratic forms, matrix norm, and SVD Lecture 16 – SVD applications