2y ago

44 Views

2 Downloads

427.10 KB

66 Pages

Transcription

Modern Statistical MethodsRajen D. Shahr.shah@statslab.cam.ac.ukCourse webpage:http://www.statslab.cam.ac.uk/ rds37/modern stat methods.htmlIn this course we will study a selection of important modern statistical methods. Thisselection is heavily biased towards my own interests, but I hope it will nevertheless give youa flavour of some of the most important recent methodological developments in statistics.Over the last 25 years, the sorts of datasets that statisticians have been challenged tostudy have changed greatly. Where in the past, we were used to datasets with many observations with a few carefully chosen variables, we are now seeing datasets where the numberof variables can run into the thousands and greatly exceed the number of observations. Forexample, with microarray data, we typically have gene expression values measured for several thousands of genes, but only for a few hundred tissue samples. The classical statisticalmethods are often simply not applicable in these “high-dimensional” situations.The course is divided into 4 chapters (of unequal size). Our first chapter will start byintroducing ridge regression, a simple generalisation of ordinary least squares. Our studyof this will lead us to some beautiful connections with functional analysis and ultimatelyone of the most successful and flexible classes of learning algorithms: kernel machines.The second chapter concerns the Lasso and its extensions. The Lasso has been at thecentre of much of the developments that have occurred in high-dimensional statistics, andwill allow us to perform regression in the seemingly hopeless situation when the numberof parameters we are trying to estimate is larger than the number of observations.In the third chapter we will study graphical modelling and provide an introduction tothe exciting field of causal inference. Where the previous chapters consider methods forrelating a particular response to a large collection of (explanatory) variables, graphicalmodelling will give us a way of understanding relationships between the variables themselves. Ultimately we would like to infer causal relationships between variables based on(observational) data. This may seem like a fundamentally impossible task, yet we will showhow by developing the graphical modelling framework further, we can begin to answer suchcausal questions.Statistics is not only about developing methods that can predict well in the presenceof noise, but also about assessing the uncertainty in our predictions and estimates. Inthe final chapter we will tackle the problem of how to handle performing thousands ofhypothesis tests at the same time and more generally the task of quantifying uncertaintyin high-dimensional settings.Before we begin the course proper, we will briefly review two key classical statisticalmethods: ordinary least squares and maximum likelihood estimation. This will help to setthe scene and provide a warm-up for the modern methods to come later.i

Classical statisticsOrdinary least squaresImagine data are available in the form of observations (Yi , xi ) R Rp , i 1, . . . , n, andthe aim is to infer a simple regression function relating the average value of a response, Yi ,and a collection of predictors or variables, xi . This is an example of regression analysis,one of the most important tasks in statistics.A linear model for the data assumes that it is generated according toY Xβ 0 ε,(0.0.1)where Y Rn is the vector of responses; X Rn p is the predictor matrix (or designmatrix) with ith row xTi ; ε Rn represents random error; and β 0 Rp is the unknownvector of coefficients.Provided p n, a sensible way to estimate β is by ordinary least squares (OLS). Thisyields an estimator β̂ OLS withβ̂ OLS : arg min kY Xβk22 (X T X) 1 X T Y,(0.0.2)β Rpprovided X has full column rank.Under the assumptions that (i) E(εi ) 0 and (ii) Var(ε) σ 2 I, we have that: Eβ 0 ,σ2 (β̂ OLS ) E{(X T X) 1 X T (Xβ 0 ε)} β 0 . Varβ 0 ,σ2 (β̂ OLS ) (X T X) 1 X T Var(ε)X(X T X) 1 σ 2 (X T X) 1 .The Gauss–Markov theorem states that OLS is the best linear unbiased estimator inour setting: for any other estimator β̃ that is linear in Y (so β̃ AY for some fixed matrixA), we haveVarβ 0 ,σ2 (β̃) Varβ 0 ,σ2 (β̂ OLS )is positive semi-definite.Maximum likelihood estimationThe method of least squares is just one way to construct as estimator. A more generaltechnique is that of maximum likelihood estimation. Here given data y Rn that we takeas a realisation of a random variable Y , we specify its density f (y; θ) up to some unknownvector of parameters θ Θ Rd , where Θ is the parameter space. The likelihood functionis a function of θ for each fixed y given byL(θ) : L(θ; y) c(y)f (y; θ),where c(y) is an arbitrary constant of proportionality. The maximum likelihood estimateof θ maximises the likelihood, or equivalently it maximises the log-likelihood (θ) : (θ; y) log f (y; θ) log(c(y)).ii

A very useful quantity in the context of maximum likelihood estimation is the Fisherinformation matrix with jkth (1 j, k d) entry 2ijk (θ) : Eθ (θ) . θj θkIt can be thought of as a measure of how hard it is to estimate θ when it is the trueparameter value. The Cramér–Rao lower bound states that if θ̃ is an unbiased estimatorof θ, then under regularity conditions,Varθ (θ̃) i 1 (θ)is positive semi-definite.A remarkable fact about maximum likelihood estimators (MLEs) is that (under quitegeneral conditions) they are asymptotically normally distributed, asymptotically unbiasedand asymptotically achieve the Cramér–Rao lower bound.Assume that the Fisher information matrix when there are n observations, i(n) (θ) (wherewe have made the dependence on n explicit) satisfies i(n) (θ)/n I(θ) for some positivedefinite matrix I. Then denoting the maximum likelihood estimator of θ when there aren observations by θ̂(n) , under regularity conditions, as the number of observations n we have (n)dn(θ̂ θ) Nd (0, I 1 (θ)).Returning to our linear model, if we assume in addition that εi N (0, σ 2 ), then thelog-likelihood for (β, σ 2 ) isn1 Xn2(yi xTi β)2 . (β, σ ) log(σ ) 222σ i 12We see that the maximum likelihood estimate of β and OLS coincide. It is easy to checkthat 2 T σ X X02i(β, σ ) .0nσ 4 /2 The general theory for MLEs would suggest that approximately n(β̂ β) Np (0, nσ 2 (X T X) 1 );in fact it is straight-forward to show that this distributional result is exact.iii

Contents1 Kernel machines1.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.1.1 The singular value decomposition and principal components analysis1.2 v-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.3 The kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.4 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.4.1 Examples of kernels . . . . . . . . . . . . . . . . . . . . . . . . . . .1.4.2 Reproducing kernel Hilbert spaces . . . . . . . . . . . . . . . . . . .1.4.3 The representer theorem . . . . . . . . . . . . . . . . . . . . . . . .1.5 Kernel ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.6 Other kernel machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.6.1 The support vector machine . . . . . . . . . . . . . . . . . . . . . .1.6.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . .1.7 Large-scale kernel machines . . . . . . . . . . . . . . . . . . . . . . . . . .113457891213161618192 The Lasso and beyond2.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2 The Lasso estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2.1 Prediction error of the Lasso with no assumptions on the design2.2.2 Basic concentration inequalities . . . . . . . . . . . . . . . . . .2.2.3 Some facts from optimisation theory and convex analysis . . . .2.2.4 Lasso solutions . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2.5 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . .2.2.6 Prediction and estimation . . . . . . . . . . . . . . . . . . . . .2.2.7 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3 Extensions of the Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3.1 Structural penalties . . . . . . . . . . . . . . . . . . . . . . . . .2.3.2 Reducing the bias of the Lasso . . . . . . . . . . . . . . . . . . .212122232327303031353636373 Graphical modelling and causal inference3.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 Conditional independence graphs . . . . . . . . . . . . . . . . . . . . . . .3.3 Gaussian graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . .38384041iv.

3.43.53.63.73.3.1 Normal conditionals . . . . . . . . . . . . . . . . .3.3.2 Nodewise regression . . . . . . . . . . . . . . . . . .3.3.3 The precision matrix and conditional independence3.3.4 The Graphical Lasso . . . . . . . . . . . . . . . . .Structural equation models . . . . . . . . . . . . . . . . . .Interventions . . . . . . . . . . . . . . . . . . . . . . . . .The Markov properties on DAGs . . . . . . . . . . . . . .Causal structure learning . . . . . . . . . . . . . . . . . . .3.7.1 Three obstacles . . . . . . . . . . . . . . . . . . . .3.7.2 The PC algorithm . . . . . . . . . . . . . . . . . .4 Multiple testing and high-dimensional inference4.1 The closed testing procedure . . . . . . . . . . . .4.2 The False Discovery Rate . . . . . . . . . . . . .4.3 Inference in high-dimensional regression . . . . . .4.3.1 Using the debiased Lasso in practice . . .v.41424343454646474749.5253545659

Chapter 1Kernel machinesLet us revisit the linear model withYi xTi β 0 εi .For unbiased estimators of β 0 , their variance gives a way of comparing their quality interms of squared error loss. For a potentially biased estimator, β̃, the relevant quantity isEβ 0 ,σ2 {(β̃ β 0 )(β̃ β 0 )T } E[{β̃ E(β̃) E(β̃) β 0 }{β̃ E(β̃) E(β̃) β 0 }T ] Var(β̃) {E(β̃ β 0 )}{E(β̃ β 0 )}T ,a sum of squared bias and variance terms. A crucial part of the optimality argumentsfor OLS and MLEs was unbiasedness. Do there exist biased methods whose variance isis reduced compared to OLS such that their overall prediction error is lower? Yes!—infact the use of biased estimators is essential in dealing with settings where the number ofparameters to be estimated is large compared to the number of observations. In the firsttwo chapters we’ll explore two important methods for variance reduction based on differentforms of penalisation: rather than forming estimators via optimising a least squares or loglikelihood term, we will introduce an additional penalty term that encourages estimatesto be shrunk towards 0 in some sense. This will allow us to produce reliable estimatorsthat work well when classical MLEs are infeasible, and in other situations can greatlyout-perform the classical approaches.1.1Ridge regressionOne way to reduce the variance of β̂ OLS is to shrink the estimated coefficients towards 0.Ridge regression [Hoerl and Kennard, 1970] does this by solving the following optimisationproblem2R2(µ̂Rλ , β̂λ ) arg min {kY µ1 Xβk2 λkβk2 }.(µ,β) R RpHere 1 is an n-vector of 1’s. We see that the usual OLS objective is penalised by anadditional term proportional to kβk22 . The parameter λ 0, which controls the severity of1

the penalty and therefore the degree of the shrinkage towards 0, is known as a regularisationparameter or tuning parameter. We have explicitly included an intercept term which is notpenalised. The reason for this is that were the variables to have their origins shifted soe.g. a variable representing temperature is given in units of Kelvin rather than Celsius, thefitted values would not change. However, X β̂ is not invariant under scale transformationsof the variables so it is standard practice to centre each column of X (hence making themorthogonal to the intercept term) and then scale them to have 2 -norm n. PnIt is straightforward tothat after this standardisation of X, µ̂Rλ Ȳ : i 1 Yi /n,Pshownso we may assume that i 1 Yi 0 by replacing Yi by Yi Ȳ and then we can remove µfrom our objective function. In this caseβ̂λR (X T X λI) 1 X T Y.In this form, we can see how the addition of the λI term helps to stabilise the estimator.Note that when X does not have full column rank (such as in high-dimensional situations),we can still compute this estimator. On the other hand, when X does have full columnrank, we have the following theorem.Theorem 1. For λ sufficiently small (depending on β 0 and σ 2 ),E(β̂ OLS β 0 )(β̂ OLS β 0 )T E(β̂λR β 0 )(β̂λR β 0 )Tis positive definite.Proof. First we compute the bias of β̂λR . We drop the subscript λ and superscript R forconvenience.E(β̂) β 0 (X T X λI) 1 X T Xβ 0 β 0 (X T X λI) 1 (X T X λI λI)β 0 β 0 λ(X T X λI) 1 β 0 .Now we look at the variance of β̂.Var(β̂) E{(X T X λI) 1 X T ε}{(X T X λI) 1 X T ε}T σ 2 (X T X λI) 1 X T X(X T X λI) 1 .Thus E(β̂ OLS β 0 )(β̂ OLS β 0 )T E(β̂ β 0 )(β̂ β 0 )T is equal toTσ 2 (X T X) 1 σ 2 (X T X λI) 1 X T X(X T X λI) 1 λ2 (X T X λI) 1 β 0 β 0 (X T X λI) 1 .After some simplification, we see that this is equal toTλ(X T X λI) 1 [σ 2 {2I λ(X T X) 1 } λβ 0 β 0 ](X T X λI) 1 .Thus E(β̂ OLS β 0 )(β̂ OLS β 0 )T E(β̂ β 0 )(β̂ β 0 )T is positive definite for λ 0 if andonly ifTσ 2 {2I λ(X T X) 1 } λβ 0 β 0is positive definite, which is true for λ 0 sufficiently small (we can take 0 λ 2σ 2 /kβ 0 k22 ).2

The theorem says that β̂λR outperforms β̂ OLS provided λ is chosen appropriately. Tobe able to use ridge

eral thousands of genes, but only for a few hundred tissue samples. The classical statistical methods are often simply not applicable in these \high-dimensional" situations. The course is divided into 4 chapters (of unequal size). Our rst chapter will start by introducing ridge regression, a simple generalisation of ordinary least squares. Our study of this will lead us to some beautiful .

Related Documents: