1m ago

0 Views

0 Downloads

633.19 KB

53 Pages

Tags:

Transcription

Predictive Learningfrom DataLECTURE SET 2Problem Setting, Basic LearningProblems and Inductive PrinciplesElectrical and Computer Engineering1

OUTLINE2.0 Objectives Background- formalization of inductive learning- classical statistics vs predictive approach2.1 Terminology and Learning Problems2.2 Basic Learning Methods and ComplexityControl2.3 Inductive Principles2.4 Alternative Learning Formulations2.5 Summary2

2.0 Objectives- To quantify the notions of explanation,prediction and model- Introduce terminology- Describe common learning problems Past observations data pointsExplanation (model) functionLearning function estimation (from data)Prediction using the model to predict new inputs3

Example: classification problemtraining samples, modelGoal 1: explain training data min training errorGoal 2: generalization (for future data) Learning (model estimation) is ill-posed4

Mathematical formalization Learning machine predictive systemGeneratorof samplesxˆyLearningMachineySystem Unknown joint distribution P(x,y)Set of functions (possible models) f (x, )Pre-specified Loss function L(y, f (x, ))(by convention, non-negative Loss )5

Inductive Learning Setting The learning machine observes samples (x ,y), andreturns an estimated response yˆ f (x, w) Two types of inference: identification vs imitation Risk Loss(y, f(x,w)) dP(x,y) minGeneratorof y6

Two Views of Empirical Inference Two approaches to empirical or statistical inferenceEMPIRICAL DATAKNOWLEDGE,ASSUMPTIONSSTATISTICAL INFERENCEPROBABILISTICMODELING RISK-MINIMIZATIONAPPROACHThese two approaches are different both technically andconceptually7

Classical Approaches to Inductive InferenceGeneric problem: finite data Model(1) Classical Science hypothesis testingexperimental data is generated by a given model(single function scientific theory)(2) Classical statistics max likelihood data generated by a parametric model for density.Note: loss fct likelihood (not problem-specific) The same solution approach for all types of problemsR. Fisher: “uncertain inferences” from finite datasee: R. Fisher (1935), The Logic of Inductive Inference, J. Royal StatisticalSociety, available at http://www.dcscience.net/fisher-1935.pdf8

Discussion Math formulation useful for quantifying- explanation fitting error (training data)- generalization prediction errorNatural assumptions- future similar to past: stationary P(x,y),i.i.d.data- discrepancy measure or loss function,i.e. mean squared error (MSE)What if these assumptions do not hold?9

OUTLINE2.0 Objectives2.1 Terminology and Learning Problems- supervised/ unsupervised- classification- regression etc.2.2 Basic Learning Methods and ComplexityControl2.3 Inductive Principles2.4 Alternative Learning Formulations2.5 Summary10

Supervised Learning: Regression Data in the form (x,y), where- x is multivariate input (i.e. vector)- y is univariate output (‘response’)2()Regression: y is real-valued L( y, f (x)) y f (x) Estimation of real-valued function11

Regression Estimation ProblemGiven: training data (xi , yi ), i 1,2,.nFind a function f (x, w ) that minimizes squarederror for a large number (N) of future samples:N k 1[( y k f (x k , w)]2 min2 dP(x,y) min(y f(x,w)) BUT future data is unknown P(x,y) unknown All estimation problems are ill-posed12

Supervised Learning: Classification Data in the form (x,y), where- x is multivariate input (i.e. vector)- y is univariate output (‘response’)Classification: y is categorical (class label) 0 if y f (x )L( y, f (x )) 1 if y f (x ) Estimation of indicator function13

Density Estimation Data in the form (x), where- x is multivariate input (feature vector)Parametric form of density is given: f (x, )The loss function is likelihood or, morecommon, the negative log-likelihoodL( f (x, )) ln f (x, ) The goal of learning is minimization ofR( ) lnf (x, ) p (x )dxfrom finite training data, yielding f (x, 0 )14

Unsupervised Learning 1 Data in the form (x), where- x is multivariate input (i.e. feature vector)Goal: data reduction or clustering Clustering estimation of mapping X C,2L(x,f(x)) x f(x)where C c1, c2 ,.,c m and15

Unsupervised Learning 2 Data in the form (x), where- x is multivariate input (i.e. vector)Goal: dimensionality reduction Mapping f (x) is projection of the data ontolow-dimensional subspace, minimizing lossL(x, f (x)) x f (x)216

OUTLINE2.0 Objectives2.1 Terminology and Learning Problems2.2 Basic Learning Methods andComplexity Control- Parametric modeling- Non-parametric modeling- Data reduction- Complexity control2.3 Inductive Principles2.4 Alternative Learning Formulations2.5 Summary17

Basic learning methodsGeneral idea Specify a wide set of possible models f (x, )where is an abstract set of ‘parameters’*Estimate model parameters by minimizinggiven loss function for training data ( ERM)Learning methods differ in Chosen parameterizationLoss function usedOptimization method used for parameterestimation18

Parametric Modeling ( ERM)Given training data (xi , yi ), i 1,2,.n(1) Specify parametric model(2) Estimate its parameters (via fitting to data) Example: Linear regression F(x) (w x) bn21Remp ( w, b) yi ( w xi ) b minn i 119

Parametric Modeling: classificationGiven training data(xi , yi ), i 1,2,.n(1) Specify parametric model(2) Estimate its parameters (via fitting to data)Example: univariate classification data set(a) Linear decision boundary(b) third-order polynomialf ( x) sign (x b )(f ( x) sign x wx b220)

Parametric Methods in Classical Statistics Learning density estimation, i.i.d. data Maximum Likelihood inductive principle:Given n training samples X, find w* maximizingP data model P (X w ) p(x i ; w )ni 1equivalently, minimize negative log-likelihood-See textbook, Section 2.2, for example:Estimate two parameters of normal distributionfrom i.i.d. data samples via max likelihood empirical mean and empirical variance)21

Maximum Likelihood (cont’d) Similar approach for regression for knownparametric distribution (normal noise) maximizing likelihood min squared loss Similar approach for classification: forknown class distributions (Gaussian)maximizing likelihood second-orderdecision boundaryGeneral approach: (statistical decision theory) Start with parametric form of a distribution Estimate its parameters via max likelihood Use estimated distributions for makingdecision (prediction)22

Non-Parametric ModelingGiven training data(xi , yi ), i 1,2,.nEstimate the model (for given x 0) as‘local average’ of the data.Note: need to define ‘local’, ‘average’ Example: k-nearest neighbors regressionkf (x 0 ) yj 1jk23

Data Reduction ApproachGiven training data, estimate the model as ‘compactencoding’ of the data.Note: ‘compact’ # of bits to encode the modelor# of bits to encode the data (MDL) Example: piece-wise linear regressionHow many parameters neededfor two-linear-component model?24

Data Reduction Approach (cont’d)Data Reduction approaches are commonly usedfor unsupervised learning tasks. Example: clustering.Training data encoded by 3 points (cluster centers)HIssues:- How to find centers?- How to select thenumber of clusters?25

Diverse terminology (for learning methods) Many methods differ in parameterizationof admissible models or approximatingfunctions yˆ f (x, w)- neural networks- decision trees- signal processing ( wavelets) How training samples are used:Batch methodsOn-line or flow-through methods26

Motivation for Complexity ControlEffect of model control on generalization(a) Classification(b) Regression27

Complexity Control: parametric modelingConsider regression estimation Ten training samplesy x 2 N (0, 2 ), where 2 0.25 Fitting linear and 2-nd order polynomial:28

Complexity Control: local estimationConsider regression estimation Ten training samples fromy x 2 N (0, 2 ), where 2 0.25 Using k-nn regression with k 1 and k 4:29

Complexity Control (summary) Complexity (of admissible models) affectsgeneralization (for future data) Specific complexity indices for– Parametric models: # of parameters– Local modeling: size of local region– Data reduction: # of clusters Complexity control choosing optimalcomplexity ( good generalization) forgiven (training) data set not well-understood in classical statistics30

OUTLINE2.0 Objectives2.1 Terminology and Learning Problems2.2 Basic Learning Methods andComplexity Control2.3 Inductive Principles- Motivation- Inductive Principles: Penalization,SRM, Bayesian Inference, MDL2.4 Alternative Learning Formulations2.5 Summary31

Conceptual Motivation Generalization from finite data requires:a priori knowledge any info outsidetraining data, e.g. ?inductive principle general strategies forcombining a priori knowledge and datalearning method constructiveimplementation of inductive principle Example: Empirical Risk Minimization parametric modeling approachQuestion: what are possible limitations of ERM?32

Motivation (cont’d) Need for flexible (adaptive) methods- wide ( flexible) parameterization ill-posed estimation problems- need provisions for complexity control Inductive Principles originate fromstatistics, applied math, info theory,learning theory – and they adopt distinctlydifferent terminology & concepts33

Inductive Principles Inductive Principles differ in terms of- representation of a priori knowledge- mechanism for combining a prioriknowledge with training data- applicability when the true model doesnot belong to admissible models- availability of constructive procedures(learning methods/ algorithms)Note: usually prior knowledge about parameterization34

PENALIZATION Overcomes the limitations of ERMPenalized empirical risk functionalR pen ( ) Remp ( ) f (x, ) f (x, ) is non-negative penalty functionalspecified a priori (independent of the data); itslarger values penalize complex functions.is regularization parameter (non-negativenumber) tuned to training dataExample: ridge regression 35

Structural Risk Minimization Overcomes the limitations of ERMComplexity ordering on a set of admissiblemodels, as a nested structureS0 S1 S2 .Examples: a set of polynomial models, Fourierexpansion etc. Goal of learning minimization of empiricalrisk for an optimally selected elementSk36

Bayesian Inference Probabilistic approach to inference Explicitly defines a priori knowledge as priorprobability (distribution) on a set of modelparametersBayes formula for updating prior probabilityusing the evidence given by training data: P data model P model P model data P data P model data posterior probabilityP data model likelihood (probability that the data aregenerated by a model)37

Bayesian Density Estimation Consider parametric density estimation whereprior probability distribution P model p(w)Given training data X, the posterior probabilitydistribution is updatedP(X w ) p(w )p(w X ) P(X )P model data P model 0w38

Implementation of Bayesian Inference Maximum Likelihood,i.e. choose w* maximizingP data model P (X w ) p(x i ; w )ni 1 (equivalent to ERM)True Bayesian inference (averaging) (x X ) p(x ; w ) p(w X )dwWhere p(x; w ) is a set of admissible densitiesP(X w ) p(w )andp(w X ) P(X )39

Minimum Description Length (MDL) Information-theoretic approach- any training data set can be optimally encoded- code length generalization capabilityRelated to the Data Reduction approachintroduced (informally) earlier.Two possible implementations:- lossy encoding- lossless encoding of the data (as in MDL)40

Binary Classification under MDL Consider training data setX {xk,yk} (k 1,2,.n) where y {0,1}Given data object X {x1,., xn} is a binary stringy1,.,yn random?if there is a dependency then the output stringcan be encoded by a shorter code:- the model having code length L (model)- the error term L( data model) the total length of such a code for string y is:b L (model) L( data model)and the compression coefficient is K b / n41

Comparison of Inductive Principles Representation of a priori knowledge/ complexity:penalty term, structure, prior distribution,codebookFormal procedure for complexity control:penalized risk, optimal element of a structure,posterior distributionConstructive implementation of complexity control:resampling, analytic bounds, marginalization,minimum code length***See Table 2.1 in [Cherkassky & Mulier, 2007]***42

OUTLINE2.0 Objectives2.1 Terminology and Learning Problems2.2 Basic Learning Methods and ComplexityControl2.3 Inductive Principles2.4 Alternative Learning Formulations- Motivation- Examples of non-standard formulations- Formalization of application domain2.5 Summary43

Motivation Estimation of predictive modelStep 1: Problem specification/ FormalizationStep 2: Model estimation, learning, inference Standard Inductive Formulation- usually assumed in all ML algorithms- certainly may not be the best formalization forgiven application problem44

Standard Supervised LearningGeneratorof samplesxLearningMachineSystem f(x.w)LossL(f(x,w),y)yAvailable (training) data format (x,y)Test samples (x-values) are unknownStationary distribution, i.i.d samplesSingle model needs to be estimatedSpecific loss functions adopted for common tasks(classification, regression etc.)45

Non-standard Learning Settings Available Data Format- x-values of test samples are knownduring training Transduction, semi-supervised learningDifferent (non-standard) Loss Function- see later example ‘learning the sign of afunction’Univariate Output ( a single model)- multiple outputs may need to beestimated from available data46

Transduction predicting function values at given points:Given labeled training set x-values of test dataEstimate (predict) y-values for given test inputsa priori ontrainingdatatransductionpredictedoutput47

Learning sign of a function Given training data (xi , yi ), i 1,2,.ny [ 2, 2]with y-values in a bounded rangeEstimate function f (x) predicting sign of yLoss function L( y, f ( x)) yf (x)If prediction is wrong real-valued loss yIf prediction is correct real-valued gain y Neither standard regression, nor classificationPractical application: frequent trading48

Multiple Model Estimation Training data in the form (x,y), where- x is multivariate input- y is univariate real-valued output (‘response’)Similar to standard regression, but subsets ofdata may be described by different models49

Formalization of Application Problems Problem Specification Step cannot be formalizedBut Several guidelines can be helpful duringformalization process Mapping process:Application requirements Learning formulationSpecific components of this mapping processare shown next50

APPLICATIONLossFunctionNEEDSInput, output,other variablesTraining/test dataAdmissibleModelsFORMAL PROBLEM STATEMENTLEARNING THEORY51

Summary Standard Inductive Learning function estimation Goal of learning (empirical inference):to act/perform well, not system identification Important concepts:- training data, test data- loss function, prediction error ( prediction risk)- basic learning problems Complexity control Inductive principles – which one is the ‘best’ ?52

Summary (cont’d) Assumptions for inductive learning Non-standard learning formulationsAside: predictive modeling ofphysical systems vs social systemsNote: main assumption (stationarity) does not hold insocial systems (business data, financial data etc.) For discussion think of example application thatrequires non-standard learning formulationNote: (a) do not use examples similar to onespresented in my lectures and/or text book(b) you can email your example to instructor(maximum half-a-page)53

R. Fisher: "uncertain inferences" from finite data see: R. Fisher (1935), The Logic of Inductive Inference, . - Parametric modeling - Non-parametric modeling - Data reduction - Complexity control 2.3 Inductive Principles 2.4 Alternative Learning Formulations 2.5 Summary. 18

Related Documents: