A Tutorial On Restricted Maximum Likelihood Estimation In .

2y ago
42 Views
3 Downloads
266.22 KB
11 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Ellie Forte
Transcription

A Tutorial on Restricted Maximum Likelihood Estimationin Linear Regression and Linear Mixed-Effects ModelXiuming Zhangzhangxiuming@u.nus.eduA*STAR-NUS Clinical Imaging Research CenterOctober 12, 2015SummaryThis tutorial derives in detail an estimation procedure—restricted maximum likelihood (ReML) [Patterson and Thompson, 1971] [Harville, 1974]—that is able to produceunbiased estimates for variance components of an linear model. We first introduce theconcept of bias in variance components by maximum likelihood (ML) estimation insimple linear regression and then discuss a post hoc correction. Next, we apply ReMLto the same model and compare the ReML estimate with the ML estimate followed bypost hoc correction. Finally, we explain the linear mixed-effects (LME) model for longitudinal analysis [Bernal-Rusiel et al., 2013] and demonstrate how to obtain unbiasedestimators of the parameters with ReML.1 Linear RegressionFamiliarity with basic linear regression facilitates the understanding of more complexlinear models. We Therefore, start with this and introduce the concept of bias inestimating variance components.1.1 The ModelThe simplest linear regression is of the form y β0 β1 x , where y is named response(or dependent variable or prediction), x is named regressor (or explanatory variable,independent variable), β’s are regression coefficients, and is called residual (or1

error), which is assumed to distribute as a zero-mean Gaussian with an unknownvariance, i.e., N (0, σ 2 ).When we have more than one regressor (a.k.a. multiple linear regression1 ), themodel comes in its matrix formy Xβ ,(1)where y is the response vector, X is the design matrix with each its row specifyingunder what design or conditions the corresponding response is observed (hence thename), β is the vector of regression coefficients, and is the residual vector distributingas a zero-mean multivariable Gaussian with a diagonal covariance matrix N (0, σ 2 IN ),where IN is the N N identity matrix. Thereforey N (Xβ, σ 2 IN ),(2)meaning that linear combination Xβ explains (or predicts) response y with uncertaintycharacterized by a variance of σ 2 .As seen, this assumption on the covariance matrix mandates that (i) residualsamong responses are independent, and (ii) residuals of all the responses have the samevariance σ 2 . This assumption makes parameter estimations straightforward, but meanwhile imposes some limitations on the model. For example, the model cannot handleproperly intercorrelated responses, such as the longitudinal measurements of one individual. This motivates the usage of linear mixed-effects (LME) model in analyzinglongitudinal data [Bernal-Rusiel et al., 2013].1.2 Parameter EstimationUnder the model assumptions, we aim to estimate the unknown parameters (β and σ 2 )from the data available (X and y). Maximum likelihood (ML) estimation is the mostcommon estimator. We maximize the log-likelihood w.r.t. β and σ 2L(β, σ 2 y, X) N1Nlog 2π log σ 2 2 (y Xβ)T (y Xβ)222σand obtainβ̂ (X T X) 1 X T y1σ̂ 2 (y X β̂)T (y X β̂),N(3)(4)where N is the number of responses. β̂ is simply the ordinary least squares (OLS)estimator of β, and we compute σ̂ 2 with the value of β̂. As we will prove in the1A single-response special case of general linear model, which itself is a special case of generalizedlinear model with identity link and normally distributed responses.2

following section, estimator σ̂ 2 is biased downwards as compared with real value σ 2 ,because we neglect the loss of degree of freedom (DoF) for estimating β.1.3 Estimation Bias in Variance ComponentThe bias of an estimator refers to the difference between this estimator’s expectation(here E{σ̂ 2 }) and the true value (here σ 2 ). To facilitate computing E{σ̂ 2 }, we definematrix A X(X T X) 1 X T , so Ay X β̂ ŷ. In fact, A is an orthogonal projectionthat satisfies A2 A (idempotent) and A A (Hermitian or self-adjoint or simplysymmetric for real matrices). We can then verify that (IN A)T (IN A) IN A.E{σ̂ 2 } 1E{(y X β̂)T (y X β̂)}N1E{(y Ay)T (y Ay)}N1E{y T (I A)T (I A)y}N1E{y T (I A)y}N 1E{y T y} E{y T Ay}NTheorem 1.1If y N (0, IN ), and A is an orthogonal projection, then y T Ay χ2 (k) withk rank(A).Proof. If A is idempotent, its eigenvalues satisfy λ2 λ. If A is also Hermitian,its eigenvalues are reala . Hence, in eigendecomposition A QΛQ 1 QΛQT b ,Λ is a diagonal matrix containing either 0 or 1.(i) When A is full-rank, Λ IN , A has to be IN , and y T Ay y T y. Then theresult follows immediately from the definition of chi-square distribution.(ii) When A is of rank k N ,TTTy Ay y QΛQ y WT"#Ik 000W WkT Wk ,where W QT y, and Wk denotes the vector containing the first k elements ofW . Since Q is orthogonal, W N (0, IN ), so Wk N (0, Ik ). The result againfollows directly from the definition of chi-square distribution.abSee http://proofwiki.org/wiki/Hermitian Matrix has Real Eigenvalues.Eigenvectors of a real symmetric matrix are orthogonal, i.e., Q 1 QT .3

Recall Equation (2). By Theorem 1.1, y Xβ T y Xβ χ2 (N ).σσHence,E{y T y} N σ 2 (Xβ)T (Xβ).Similarly, y Xβσ T Ay Xβσ χ2 (k),where k rank(A). Hence,E{y T Ay} kσ 2 (Xβ)T (Xβ).Substituting E{y T y} and E{y T Ay} givesE{σ̂ 2 } N k 2σ σ2,N(5)where k rank(A) is just the number of columns (regressors) in X. Therefore, ourestimation of the variance component is biased downwards. This bias is especiallysevere when we have many regressors (a large k), in which case we need to correctthis bias by simply multiplying a factor of N/(N k). Hence, the corrected, unbiasedestimator becomes1(y X β̂)T (y X β̂)N k1 (y X(X T X) 1 X T y)T (y X(X T X) 1 X T y),N k2 σ̂unbiased(6)which is classically used in linear regression [Verbeke and Molenberghs, 2009].2 Restricted Maximum LikelihoodIn simple problems where solutions to variance components are closed-form (like linear regression above), we can remove the bias post hoc by multiplying a correctionfactor. However, for complex problems where closed-form solutions do not exist, weneed to resort to a more general method to obtain a bias-free estimation for variancecomponents. Restricted maximum likelihood (ReML) [Patterson and Thompson,1971] [Harville, 1974] is one such method.2.1 The TheoryGenerally, estimation bias in variance components originates from the DoF loss inestimating mean components. If we estimated variance components with true mean4

component values, the estimation would be unbiased. The intuition behind ReML is tomaximize a modified likelihood that is free of mean components instead of the originallikelihood as in ML.Consider a general linear regression modely Xβ ,where y is still an N -vector of responses, X is still an N k design matrix, but residual is no longer assumed to distribute as N (0, σ 2 IN ), but rather N (0, H(θ)), where H(θ)is a general covariance matrix parametrized by θ. For simplicity, H(θ) is often writtenas just H. Previously, we have been referring to θ as “variance components.”If vector a is orthogonal to all columns of X, i.e., aT X 0, then aT y is known as anerror contrast.We can find at mostN k such vectors that are linearly independent2 .hiDefine A a1 a2 . . . aN k . It follows that AT X 0 and E{AT y} 0. S IN X(X T X) 1 X T is a candidate for A, as SX 0. Furthermore, it can be shownAAT S and AT A IN .The error contrast vectorw AT y AT (Xβ ) AT N (0, AT HA)is free of β. [Patterson and Thompson, 1971] has proven that in the absence of information on β, no information about θ is lost when inference is based on w rather than ony. We can now directly estimate θ by maximizing a “restricted” log-likelihood functionLw (θ AT y). This bypasses estimating β first and can Therefore, produce unbiasedestimates for θ.Once H(θ) is known, the generalized least squares (GLS) solution to β minimizing squared Mahalanobis length of the residual (Y Xβ)T H 1 (Y Xβ) is justβ̂ (X T H 1 X) 1 X T H 1 y.(7)We now derive a convenient expression for Lw (θ AT y) [Harville, 1974].Lw (θ AT y) log fw (AT y θ)TZ log fw (A y θ)fβ̂ (β̂ β, θ) dβ̂Z log fw (AT y θ) fβ̂ (GT y β, θ) dβZ log fw (AT y θ)fβ̂ (GT y β, θ) dβZ log fw,β̂ (AT y, GT y β, θ) dβ2(β̂, β exchangeable here)Imagine a plane spanned by k 2 linearly independent vectors in three-dimensional space (N 3). Wecan find at most N k 1 vector (passing the origin) orthogonal to the plane.5

Z logfy hiT y β, θ dβA GZ1hi logfy (y β, θ) dβ. det A G Interlude 2.1hiWe express det A G in terms of X. hiT hi 12 det A G det A GA Ghi det"#! 12AT G A T GGT A G T G 1 1 det AT A 2 det GT G GT A(AT A) 1 AT G 2 11 (det I) 2 det GT G GT AI 1 AT G 2 1 det GT G GT SG 2 1 det X T X 2We continue deriving1hiLw (θ A y) log det A G Z 12ZTT log det X Xfy (y β, θ) dβfy (y β, θ) dβ 1T 1p log det X Xexp (y Xβ) H (y Xβ) dβ2(2π)N det H Z 2111 T NT 1 log det X X (2π) 2 (det H) 2 exp (y Xβ) H (y Xβ) dβ2T 12Z1Interlude 2.2We can decompose (y Xβ)T H 1 (y Xβ) into(y X β̂)T H 1 (y X β̂) (β β̂)T (X T H 1 X)(β β̂)with Equation (7).We resumeTTLw (θ A y) log det X X log det X T X 21(2π) 12(2π) 2 (det H) 2 N2 12(det H)N16 1T 1exp (y Xβ) H (y Xβ) dβ2 1exp (y X β̂)T H 1 (y X β̂)2Z

1exp (β β̂)T (X T H 1 X)(β β̂) dβ2 2111 NT T 1 log det X X (2π) 2 (det H) 2 exp (y X β̂) H (y X β̂)2Zk1(2π) 2 (det X T H 1 X) 2(a Gaussian integral) 111 log(2π)det X T X 2 (det H) 2 (det X T H 1 X) 2 1T 1exp (y X β̂) H (y X β̂)21111 (N k) log(2π) log det X T X log det H log det X T H 1 X22221T 1(8) (y X β̂) H (y X β̂),2 12 (N k)whereβ̂ (X T H 1 X) 1 X T H 1 y.(9)With this convenient expression, we can maximize the restricted log-likelihoodLw (θ AT y) w.r.t. variance components θ to obtain an unbiased estimate for the covariance matrix H(θ̂) and the corresponding regression coefficient estimates β̂. NewtonRaphson method is usually employed. For more computational details, see [Lindstromand Bates, 1988].2.2 Applied to Simplest Linear RegressionWe have seen the estimation bias in θ by ML from Equation (5). In the simplestform of linear regression where we assume H σ 2 IN , estimation σ̂ 2 is closed-form(Equation (4)), allowing us to correct the bias simply with a multiplicative factor. Inthis section, we verify that, in the simplest form of linear regression, the ReML methodproduces exactly the same solutions as ML method followed by the post hoc correction.Setdd111Lw (σ 2 AT y) log det H log det X T H 1 X (y X β̂)T H 1 (y X β̂)22dσdσ222d11 (N k) log σ 2 2 (y X β̂)T (y X β̂)dσ 2 22σ 0We obtain exactly the same result as Equation (6), produced by post hoc correction.It is worth noticing that in this simplest linear regression case, the mean estimateβ̂ is independent of the variance component θ (Equation (3)). This implies althoughthe ML and ReML estimates of σ̂ 2 are different, the estimates of β̂ are the same. Thisis no longer true for more complex regression models, such as the linear mixed-effectsmodel, as to be seen in the next section. Thus, for those complex models, we have a7

ReML estimate of θ and also a “ReML” estimate of β, both being different from theirML estimates.3 Linear Mixed-Effects ModelLongitudinal data are (usually non-uniformly) ordered in time, and missing data arevery common. Furthermore, serial measurements of one subject are positively correlated, and between-subject variance is not constant over time due to possible divergingtrajectories [Bernal-Rusiel et al., 2013]. The linear mixed-effects (LME) model [Lairdand Ware, 1982] is a suitable model to handle such data.3.1 The ModelThe Ni serial measurements yi of subject i are modeled asyi Xi β Zi bi i ,where Xi is the ni p subject design matrix for the fixed effects (e.g., gender, education,clinical group), β is a p-vector of fixed effects regression coefficients to be estimated,Zi is a Ni q design matrix for the random effects, bi is a q-vector of random effects,and is a Ni -vector of residuals. Zi ’s columns are a subset of Xi ’s, linking randomeffects bi to Yi . That is, any component of β can be allowed to vary randomly bysimply including the corresponding columns of Xi in Zi [Bernal-Rusiel et al., 2013].For example, to allow each subject to have their own trajectory intercepts, we set Zito be a Ni -vector of 1’s. The following distributional assumptions are madebi N (0, D) i N (0, σ 2 INi ) 1 , . . . , M , b1 , . . . , bM independent,where D and σ 2 IN are covariance matrices of multivariate Gaussian distributions, Mis the total number of subjects, bi reflects how the subset of regression coefficientsfor subject i deviates from those of the population, and i represents residuals not explained by fixed or random effects. This allows subjects to deviate from the population,accounting for inter-subject variability.Intuitively, the LME model extends simple multiple regression y Xβ by allowing for the additions of zero Gaussian noise bi to a subset of the regression coefficients.More formally, the introduction of random effects helps distinguish the conditional(subject-specific) mean E{yi bi } and marginal (population-average) mean E{yi }:E{yi bi } Xi β Zi bi8

E{yi } Xi βas well as subject-specific covariance Cov{yi bi } and population-average covarianceCov{yi }:Cov{yi bi } Cov{ i } σ 2 INiCov{yi } Cov{Zi bi } Cov{ i } Zi DZiT σ 2 INi ,which is not a diagonal matrix (cf. the diagonal covariance matrix σ 2 IN in linearregression). Therefore, Zi and D give a “structure” to the originally diagonal matrix,in turn allowing us to model intra-subject measurement correlations.Finally, for each subject, we haveyi N (Xi β, Hi (θ)),where Hi (θ) Zi DZiT σ 2 INi . See [Bernal-Rusiel et al., 2013] and [Verbeke andMolenberghs, 2009] for real-world applications of the LME model.3.2 Estimation by Restricted Maximum LikelihoodWe aim to estimate fixed effects regression coefficients β as well as model parametersσ 2 and D from the data. The solutions to variance components θ (i.e., σ 2 and D) arenot closed-form. Therefore, we need to perform ReML rather than ML followed bypost hoc correction to obtain unbiased estimates of θ.To estimate these quantities, which are shared across all subjects, we need to stackup all subjects’ datay Xβ Zb ,where y1 y2 y . . yM X1 X2 X . . XM Z1 0 . . . 0 0 Z2 . . . 0 Z . . . . 0 0 . . . ZM b1 b2 b . . bMThat is, H1 (θ)0 0H2 (θ) y N Xβ, H(θ) . . . 00 . , . . . HM (θ).where H(θ) (or H for short) is a block diagonal matrix.900. 1 2 . . . M

Substitute y y, X X, and H H into Equation (8) and drop constant terms111Lw (θ AT y) log det H log det X T H 1 X (y X β̂)T H 1 (y X β̂)222111T 1 log det H log det X H X (y Xβ̂)T H 1 (y Xβ̂)222MMMYY111X log(yi Xi β̂)T Hi 1 (yi Xi β̂)det Hi logdet XiT Hi 1 Xi 222i 1 12MXlog det Hi i 112i 1MXlog det XiT Hi 1 Xi i 112i 1MX(yi Xi β̂)T Hi 1 (yi Xi β̂),i 1whereβ̂ (XT H 1 X) 1 XT H 1 y! 1 MMXX XiT Hi 1 XiXiT Hi 1 yi .i 1i 1We can then maximize Lw (θ AT y) w.r.t. β, σ 2 , and D. Computational details arefound in [Lindstrom and Bates, 1988]. Details on hypothesis testing can be foundin [Bernal-Rusiel et al., 2013].10

References[Bernal-Rusiel et al., 2013] Bernal-Rusiel, J. L., Greve, D. N., Reuter, M., Fischl, B., andSabuncu, M. R. (2013). Statistical analysis of longitudinal neuroimage data with linearmixed effects models. NeuroImage, 66:249–260.[Harville, 1974] Harville, D. A. (1974). Bayesian inference for variance components using onlyerror contrasts. Biometrika, 61(2):383–385.[Laird and Ware, 1982] Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, pages 963–974.[Lindstrom and Bates, 1988] Lindstrom, M. J. and Bates, D. M. (1988). Newton-Raphson andEM algorithms for linear mixed-effects models for repeated-measures data. Journal of theAmerican Statistical Association, 83(404):1014–1022.[Patterson and Thompson, 1971] Patterson, H. D. and Thompson, R. (1971). Recovery ofinter-block information when block sizes are unequal. Biometrika, 58(3):545–554.[Verbeke and Molenberghs, 2009] Verbeke, G. and Molenberghs, G. (2009). Linear mixed models for longitudinal data. Springer Science & Business Media.11

following section, estimator 2 is biased downwards as compared with real value 2, because we neglect the loss of degree of freedom (DoF) for estimating . 1.3 Estimation Bias in Variance Component The bias of an estimator refers to the di erence between this estimator’

Related Documents:

Jane Denereaz Diamond Sprite Restricted . Tayla Lawes Malibu Restricted Theo Davis Tourbeck Alex Restricted Thomas Beever Ash Restricted Tracy Barwick Solomon Restricted . Lorna Howson Feronia Open Louise Gallavan Fabergé Open Lynsey Ellis Scrumpy Open Maddy Mazzawi Bee Positive Open

Tutorial Process The AVID tutorial process has been divided into three partsÑ before the tutorial, during the tutorial and after the tutorial. These three parts provide a framework for the 10 steps that need to take place to create effective, rigorous and collaborative tutorials. Read and note the key components of each step of the tutorial .

Tutorial Process The AVID tutorial process has been divided into three partsÑ before the tutorial, during the tutorial and after the tutorial. These three parts provide a framework for the 10 steps that need to take place to create effective, rigorous and collaborative tutorials. Read and note the key components of each step of the tutorial .

Tutorial 1: Basic Concepts 10 Tutorial 1: Basic Concepts The goal of this tutorial is to provide you with a quick but successful experience creating and streaming a presentation using Wirecast. This tutorial requires that you open the tutorial document in Wirecast. To do this, select Create Document for Tutorial from the Help menu in Wirecast.

Tutorial 16: Urban Planning In this tutorial Introduction Urban Planning tools Zoning Masterplanning Download items Tutorial data Tutorial pdf This tutorial describes how CityEngine can be used for typical urban planning tasks. Introduction This tutorial describes how CityEngine can be used to work for typical urban .

Tutorial 1: Basic Concepts 10 Tutorial 1: Basic Concepts The goal of this tutorial is to provide you with a quick but successful experience creating and streaming a presentation using Wirecast. This tutorial requires that you open the tutorial document in Wirecast. To do this, select Create Document for Tutorial from the Help menu in Wirecast.

31 Cooey 39 (Winchester) 22 LR Rifle CT056402 1 Single Shot Non Restricted 32 CIL 402 12GAX2 3/4" Shotgun 225516 1 Single Shot Non Restricted 33 Ranger Ranger 16GAX2 3/4 Shotgun 65061 1 Single Shot Non Restricted 34 Winchester 37A Youth 410GAX3" Shotgun C965948 1 Single Shot Non Restricted 35 Mossberg 4X4 22-250

4 Restricted Structure and Adaptive Control 4.1 Restricted Structure solution The optimal solution to the predictive optimal control problem simply requires T d to be set to zero. In the case of a restricted structure control law, it is necessary that (3.13) be minimised with respect to the parameters of the given controller structure. In the .