15-830 Machine Learning 2: Nonlinear Regression

1y ago
6 Views
1 Downloads
712.31 KB
25 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Jerry Bolanos
Transcription

15-830 – Machine Learning 2: Nonlinear Regression J. Zico Kolter September 18, 2012 1

Non-linear regression Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) 80 100 High temperature / peak demand observations for all days in 2008-2011 2

Central idea of non-linear regression: same as linear regression, just with non-linear features 2 xi E.g. φ(xi ) xi 1 Two ways to construct non-linear features: explicitly (construct actual feature vector), or implicitly (using kernels) 3

Observed Data d 2 Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) 80 100 Degree 2 polynomial 4

Observed Data d 3 Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) 80 100 Degree 3 polynomial 5

Observed Data d 4 Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) 80 100 Degree 4 polynomial 6

Constructing explicit feature vectors Polynomial features (max degree d) d z z d 1 Special case, n 1: φ(z) . z 1 General case: φ(z) ( n Y i 1 Rd 1 zibi : n X ) bi d n d n R( ) i 1 7

Radial basis function (RBF) features – Defined by bandwidth σ and k RBF centers µj Rn , j 1, . . . , k φj (z) exp kz µj k2 2σ 2 1 Feature Value 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Input 8

Difficulties with non-linear features Problem #1: Computational difficulties – Polynomial features, k n d d O(dn ) – RBF features; suppose we want centers in uniform grid over input space (w/ d centers along each dimension) k dn – In both cases, exponential in the size of the input dimension; quickly intractable to even store in memory 9

Problem #2: Representational difficulties – With many features, our prediction function becomes very expressive – Can lead to overfitting (low error on input data points, but high error nearby) 10

Observed Data d 1 2 1.5 20 40 60 High Temperature (F) 80 2 1.5 20 40 60 High Temperature (F) 2 1.5 0 80 100 20 40 60 High Temperature (F) 80 100 80 100 Observed Data d 50 3 2.5 0 2.5 100 Observed Data d 4 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 0 Observed Data d 2 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) Least-squares fits for polynomial features of different degrees 11

Observed Data num RBFs 2 2 1.5 20 40 60 High Temperature (F) 80 2 1.5 20 40 60 High Temperature (F) 2 1.5 0 80 100 20 40 60 High Temperature (F) 80 100 80 100 Observed Data num RBFs 50, λ 0 3 2.5 0 2.5 100 Observed Data num RBFs 10 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 0 Observed Data num RBFs 4 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) Least-squares fits for different numbers of RBFs 12

A few ways to deal with representational problem: – Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth) – Regularization: penalize large parameters θ minimize θ m X (ŷi , yi ) λkθk22 i 1 λ: regularization parameter, trades off between low loss and small values of θ (often, don’t regularize constant term) – We’ll come back to this issue when talking about evaluating machine learning methods 13

6000 5000 J(θ) 4000 3000 2000 1000 0 0 2 4 6 θ 2 8 10 12 Pareto optimal surface for 20 RBF functions 14

Observed Data num RBFs 50, λ 0 2 1.5 20 40 60 High Temperature (F) 80 2 1.5 20 40 60 High Temperature (F) 2 1.5 0 80 100 20 40 60 High Temperature (F) 80 100 80 100 Observed Data num RBFs 50, λ 1000 3 2.5 0 2.5 100 Observed Data num RBFs 50, λ 50 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 0 Observed Data num RBFs 50, λ 2 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) RBF fits varying regularization parameter (not regularizing constant term) 15

Implicit feature vectors (kernels) One of the main trends in machine learning in the past 15 years Kernels let us work in high-dimensional feature spaces without explicitly constructing the feature vector This addresses the first problem, the computational difficulty 16

Simple example, polynomial feature, n 2, d 2 φ(z) 1 2z1 2z2 2 z 1 2z1 z2 z22 Let’s look at the inner product between two different feature vectors 2 φ(z)T φ(z 0 ) 1 2z1 z10 2z2 z20 z12 z10 2z1 z2 z10 z20 z22 z20 2 1 2(z1 z10 z2 z20 ) (z1 z10 z2 z20 )2 1 2(z T z 0 ) (z T z 0 )2 (1 z T z 0 )2 17

General case: (1 z T z 0 )d is the inner product between two polynomial feature vectors of max degree d ( n d -dimensional) d – But, can be computed in only O(n) time We use the notation of a kernel function K : Rn Rn R that computes these inner products K(z, z 0 ) φ(z)T φ(z 0 ) Some common kernels polynomial (degree d): K(z, z 0 ) (1 z T z 0 )d kz z 0 k22 0 Gaussian (bandwidth σ): K(z, z ) exp 2σ 2 18

Using kernels in optimization We can compute inner produucts, but how does this help us solve optimization problems? Consider (regularized) optimization problem we’ve been using minimize θ m X (θT φ(xi ), yi ) λkθk22 i 1 Representer theorem: The solution to above problem is given by ? θ m X αi φ(xi ), for some α Rm , (or θ? ΦT α) i 1 19

Notice that (ΦΦT )ij φ(xi )T φ(xj ) K(xi , xj ) Abusing notation a bit, we’ll define the kernel matrix K Rm m K ΦΦT , (Kij K(xi , xj )) can be computed without constructing feature vectors or Φ 20

Let’s take (reguarlized) least squares objective. J(θ) kΦθ yk22 λkθk22 and substitute θ ΦT α J(α) kΦΦT α yk22 λαT ΦΦT α kKα yk22 λαT Kα αT KKα 2y T Kα y T y λαT Kα Taking the gradient w.r.t. α and setting to zero α J(α) 2KKα 2Ky 2λKα α? (K λI) 1 y 21

How do we compute prediction on a new input x0 ? 0 T 0 ŷ θ φ(x ) m X i 1 !T αi φ(xi ) 0 φ(x ) m X αi K(xi , x0 ) i 1 Need to keep around all examples xi in order to make a prediction; non-parametric method 22

MATLAB code for polynomial kernel % computing alphas d 6; lambda 1; K (1 X*X'). d; alpha (K lambda*eye(m)) \ y; % computing prediction k test (1 x test*X'). d; y hat k test*alpha; Gaussian kernel sigma 0.1; lambda 1; K exp(-0.5*sqdist(X', X')/sigma 2) alpha (K lambda*eye(m)) \ y; 23

Observed Data d 4, λ 0.01 2 1.5 20 40 60 High Temperature (F) 80 2 1.5 20 40 60 High Temperature (F) 2 1.5 0 80 100 20 40 60 High Temperature (F) 80 100 80 100 Observed Data σ 40, λ 0.01 3 2.5 0 2.5 100 Observed Data σ 2, λ 0.01 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 0 Observed Data d 10, λ 0.01 3 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 3 2.5 2 1.5 0 20 40 60 High Temperature (F) Fits from polynomial and RBF kernels 24

Kernels can also be used with other loss functions; key element is just the transformation θ ΦT α. Absolute loss alpha sdpvar(m,1); solvesdp([], sum(abs(K*alpha - y))); Some advanced algorithms possible: deadband loss kernels “support vector regression” 25

15-830 { Machine Learning 2: Nonlinear Regression J. Zico Kolter September 18, 2012 1. Non-linear regression 0 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) High temperature / peak demand observations for all days in 2008-2011 2 Central idea of non-linear regression: same as linear regression,

Related Documents:

830-011-0000 Definitions 830-011-0010 Employees, Meetings, Officers of the Board 830-011-0020 Trainee (Apprenticeship) — Generally 830-011-0040 Completion of Funeral Service Practitioner and Embalmer Apprenticeship and Examination 830-011-0050 Background Investigation Required Prior to Oregon Licens

The fast and universal infrared thermometer with 1-point laser sighting and 10:1 optics in ergonomic „pistol design". Non-contact Infrared Thermometers testo 830 testo 830 testo 830-T1 testo 830-T2 testo 830-T1, infrared thermometer, 1 pointing laser, 10:1 lens, alarm function with adjustable threshold values, includes 9-volt battery and

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

price copeland r-22 reciprocating (three phase) price compressors copeland r-22 reciprocating 208/230 single phase zr16k5e-pfv-830 15,500 btuh 208/230 1ph 337.65 zr18k5e-pfv-830 18,000 btuh 208/230 1ph 355.35 zr21k5e-pfv-830 21,000 btuh 208/230 1ph 346.79 zr25k5e-pfv-830 25,300 btuh 208/23

tubeless tires on 15 drop center rims 11R22.5 H 6,610 3,000 120 830 6,005 2,725 120 830 100 45 8.25 11.1 282 41.1 1,044 19.2 488 505 314 17 12.5 318 75 11R24.5 H 7,160 3,250 120 830 6,610 3,000 120 830 107 48

PC 830.2, except for those described in subdivision (d) PC 830.3 PC 830.32 PC 830.33 In addition, any peace officer employed by a POST participating agency who does not receive their authority through one of the above penal codes is subject to all SB 2 requirements. This includes all levels of reserve peace officers .

Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).