A PROPOSAL FOR HANDLING MISSING DATA*

3y ago
11 Views
3 Downloads
1.50 MB
24 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Sutton Moon
Transcription

PSYCHOMETRIKA--VOL. 40 NO. 2JUNE, 1975A PROPOSAL FOR HANDLING MISSING DATA*TERRY C. GLEASONANDRICHARD STAELINCARNEGIE-MELLONUNIVERSITYA method for dealing with the problem of missing observations in multivariate data is developed and evaluated. The method uses a transformation ofthe principal components of the data to estimate missing entries. The properties of this method and four alternative methods are investigated by meansof a Monte Carlo study of 42 computer-generated data matrices. The methodsare compared with respect to their ability to predict correlation matrices aswell as missing entries.The results indicate that whenever there exists modest intercorrelationsamong the variables (i.e., average off diagonal correlation above .2) the proposed method is at least as good as the best alternative (a regression method)while being considerably faster and simpler computationally. Models fordetermining the best alternative based upon easily calculated characteristicsof the matrix are given. The generality of these models is demonstrated usingthe previously published results of Timm.Investigations that seek to employ multivariate data analysis techniques---multiple regression, discriminant and, canonical correlation analysis, orany of the various clustering techniques--commonly encounter the problemof dealing with missing measurements on one or more variables for one ormore subjects. If the number of missing entries in the data matrix is verysmall, then it is reasonable to discard all of the measurements for a givenindividual and proceed with the analysis as if t h a t data had never been collected. However, even if the percentage of missing observations is as smallas 1 or 2 percent of the entire data matrix, the number of individuals excludedby this procedure can become substantial if the number of variables is large.Hence it is common to seek ways of estimating missing observations; indeednumerous proposals are available in the literature for computing suitablevalues for missing data. After a brief review of several of these proposals,a new approach will be presented for estimating missing observations togetherwith the results of a Monte Carlo study of the relative strengths and weaknesses of this new technique and three other available methods. I n addition,the new procedure and four other procedures are examined with respect to* This is an extension and elaboration of a paper read at the Spring 1973 meetingsof the Psychometric Society. We wish to express our appreciation to Timothy McGuirefor his helpful comments.229

230PSYCHOMETRIKAtheir ability to use incomplete data to estimate the correlation matrix obtainedusing a full compliment of observations.It should be noted at the outset that all of the procedures discussed hereare specifically intended for use on data matrices whose entries are missingby virtue of some process that is unrelated to any of the relationships betweenthe variables in the/natrix. For example, in filling out questionnaires respondents occasionally turn two pages at the same time, thus missing aset of questions. Or sometimes a subject will be distracted and look awayfrom the questionnaire for a moment only to return to a point beyond thenext scheduled response. Whatever the cause, it is clear that in instancesof this type, the failure to answer a question is not related to the content ofthe question, hence the methods discussed in this paper will be appropriate.However, if it is discovered, for example, that high income people do notanswer certain questions or that questions on delicate topics are consistentlynot answered by certain subjects, then the procedures considered here mayor may not be useful in estimating what those missing responses should be.Their application entails a risk of artificially building into the data relationships which are based only on the observed respondents and thus may notbe applicable to the subset of non-respondents.Previous WorkThe available alternatives for estimating missing data can be dividedinto two approaches. The first is the common statistical procedure whichbegins by assuming the observed data to be a sample drawn from a multivariate distribution of known form (usually the multivariate normal) butwith unknown parameters. The parameters of the population are estimatedusing the available data and an estimating .procedure such as maximumlikelihood. Once estimated, these parameters may then be used to computeparticular missing observations by employing the conditional distribution ofthe variable whose datum is missing given the variables whose observationsare not missing.The major advantage of the statistical method is that it yields anunambiguous model of the estimation process whose characteristics can beevaluated analytically. However, this analytical power is bought at theprice of rather stringent assumptions about the distributional form of thepopulation. Such assumptions may be inappropriate for large classes of datato which it would be desirable to apply the estimation procedure. For example,psychological investigations often employ questionnaires whose items mayhave only two or at most a few responses. The data produced by such questionsare clearly not normal. Moreover to approximate a normal distribution witha collection of dichotomous scores is to introduce errors into the analysiswhose effects at(, not described by the analytical model.

TERRY C. GLEASON AND RICHARD STAELIN231It also appears to be the case that, although the logic of the maximumlikelihood procedure is straightforward, the resulting analysis is very complicated. Indeed, really successful applications of the procedure have beenmade only to problems in which the form of the data is well understood or inwhich other features of the situation, such as properties of the experimentaldesign place additional constraints on the model (for example Edgett [1956],Anderson [1957], Trawinski and Bargmann [1964] , Srivastava and McDonald[1973]).To escape the arbitrary distributional assumptions of the statisticalapproach as well as the computational complexities introduced by proceduressuch as the Method of Maximum Likelihood, the problem of estimatingmissing data is often approached by a second avenue which does not explicitlymodel the generation of the data but instead seeks ways of utilizing theinformation in variables without missing values to construct reasonableestimates for the missing entries. This is a pragmatic approach guided byheuristics rather than assumptions. Likewise its results are evaluated notby studying the properties of the model, but by measuring how well differentmethods can reconstruct unknown values from available data using realexamples.Timm [1970] has presented a comparative study of three such techniquesproposed by Buck [1960], Dear [1959], and Wilks [1932]. The method ofBuck computes estimates for any missing entry by using a regression equation.The nonm sing entries for the individual with a missing entry are the independent variables and the regression coefficients are calculated using allindividuals in the sub-matrix which has no missing entries. This regressionmethod is extended in the next section. It will be shown that the form of thesolution is the same as the maximum likelihood solution assuming a multi variate normal distribution thus establishing a relationship between thestatistical and the pragmatic approaches. The Dear method decomposesthe matrix into its known and unknown parts and uses the first principalcomponent and its associated loadings derived from the known data toestimate the unknown elements. Wilks proposed the handy expedient ofusing the mean of all nonmissing values for any variable to estimate themissing values. Timm assessed the ability of each method to predict eitherthe correlation matrix or the covariance matrix, and he showed that theDear and Buck methods were generally superior to the Wilks method forthe data matrices he investigated. In addition, the Dear method was computationally faster than the Buck method by a factor of two.Christoffersson [1965] and Wold [i966] proposed a method for estimatingall the principal components from an incomplete matrix using an algorithmcalled Nonlinear Iterative Least Squares (NILES). They state that a missingentry can be estimated as a linear combination of the principal components

232PSYCHOMETRIKAalthough they show how to predict the principal components only for a onecomponent model. For this case the NILES method is equivalent to Dear'sprocedure.Walsh [1961] suggested a generalization of the Buck technique thatutilizes subjects with incomplete data as well as those with complete data.The method is extremely complex and does not appear to have been studiedfor its performance characteristics.There are three general approaches that can be used to estimate thecorrelation matrix for a set of data: 1) compute the correlations using onlythose subjects with complete data, 2) estimate the missing entries in thematrix and then compute the correlations, or 3) compute the correlationusing all pairs of observations that are available for each pair of variables.This latter approach appears to have been first suggested by Glasser [1964].He shows that for regression analysis this method yields consistent estimatorsof the regression coeffieienta However, Monte Carlo work by Haitovsky[1968] indicates that even with sample sizes of 1000 the regression coefficientsderived from Glasser's matrix may not be as good as least squares estimatesobtained using only those subjects with complete data.Of all the work cited above, the paper by Timm is most useful in permitting the assessment of comparative advantages of different procedures.However, Timm's results are quite complicated as presented and thereappears to be no clear pattern from which general statements could bederived to guide future work. Moreover, Timm's study dealt only with theestimation of correlation and covariance matrices. These results provide noinformation on the problem of estimating the missing entries themselves.Subsequent sections of this paper will extend, elaborate, and organize thework begun by Timm. It is useful, however, to first develop some theoreticalconsiderations regarding two new procedures.Some Theoretical ConsiderationsLet Z be an m X p data matrix representing the scores of m individualson p variables. For convenience and without loss of generality Z is takento be standardized by columns so that its correlation matrix is simply R -(1/m)Z'Z. Suppose Z is partitioned by columns into two submatrices(1)Z [Z,Z2]w h e r e Z l i s m X q(q p) a n d Z 2 i s m X ( p - q).In the subsequent discussion Z will be regarded as missing data and Z2as observed data. Such a partition is always possible for any single individuaIin a real data matrix. Here, however, it is convenient to treat the entirematrix Z1 as missing so that the problem of handling incomplete data can beformulated in t 'rms of a specification of the conditions under which theentries of Z1 may be reconstructed from Z2 and the inter-correlations between

TERRY C. GLEASON AND RICHARD STAELIN233Z1 and Z2.One obvious approach to reconstructing Z1 is to use the regressionsof Z onto Z2. This procedure, which is an extension of the Buck methodmay be developed as follows. The partition given in (1) induces a partitionof RIt is well known that regression weights for estimating Z1 from Z2 are R22-1R l.Hence, (3)2, Z2R2 -'R ,.Moreover if Z can be written in the form Z1 Z 2 T where T is any lineartransformation, then (3) will reproduce Z exactly. This technique is referredto subsequently as the Regression method.It is interesting to note that the conditional expectation of Z given Z2,assuming that Z has a multivariate normal distribution also is given by (3).Thus, assuming multivariate normality leads via a maximum likelihoodargument to the same conclusion (3) as is here developed by a heuristicargument that avoids distributional assumptions.Another approach to reconstructing Z from Z2 and R is to utilize thesingular value decomposition of Z, and, by using only the largest principalcomponents of Z, estimate Z by means of the least squares properties ofthe singular value decomposition [Eckart and Young, 1936; Johnson, 1965].The procedure is as follows.The matrix Z may be written in the form(4)Z UDI/2V 'where V is the matrix of eigenvectors of R, D is a diagonal matrix of eigenvalues of R and(5)U Z V D -1/2.Now suppose only the largest n eigenvalues of R are retained. Denote then X n submatrix of D which contains just these eigenvalues by b. Likewiseretain only those eigenvectors in V that correspond to the eigenvalues i n / )and denote the resulting submatrix by ]7. Then by (5) 0 Z f ) - /2 and(6)2 01)-'/2 ' (ZtT/):t/2)/)l/217'--Zl?17'.The matrix 2 is of rank n and a least-squares approximation to Z. Thisfact suggests an interesting possibility. If the variables of Z are redundantin the sense that the intercorrelations among the variables are large, thenwill contain most of the information or v:lriance in Z. Consequently the

234PSYCHOMETRIKAmatrix Z1 ought to be reconstructable from the matrix Z and the eigenvectors 17 of R. To show how this can be done, it is first necessary to point outthat the partition of Z induces a partition of l?, i.e.,where 1 i is q n and P is (p - q) n. Let W 1 t ', then(7)JEquation (6) may now be rewritten(8)[2 2 ] [Z1Z ]W.Since Z1 is to be computed from Z2 , the apparent next step is to replaceZ1 with 21 on the fight side of (8) and solve for 21 :2, 2,W1, z l ,and(9)ZI Z I?V,I(I -- IrVll) -1provided (I - I H) is nonsingular.Equation (9) represents the P r i n c i p a l Components method for reconstructing Z from Z . To see how well this method works in theory it isinstructive to examine the covariance matrix of Z1 with 2 1 ,(10)1Z1'2, 1 Z,'Z V2,(I -- T' ,,)-'mm R, g,' ,(I-#,',,)-'.Because T7 is a matrix of eigenvectors of R, it follows that(11)Rl171 R1 2 -- ] , .Multiplying (11) on the fight by 1' and using the definition of l givenin (7), yields(12)R,,I ,I q- R,2I ,, 1 1/ 171'.It is useful to introduce the notion of the residual matrix, R*, which isobtained by removing that portion of R which is accounted for by 7and/5, i.e.,(13)R* R -- /) '.Notice that this matrix depends on the p - n smallest eigenvalues of R(and their associated eigenvectors). If t h e rank of R is less than or equal

TERRY C. OLEASON AND RICHARD STAELINto n, then R* will be the zero matrix.Using (13) in (12) yieldsR,11 ,, R,2W2, R11 - Rtt*or(14)R1, 2, R11(I - ,,)- Rl1".Substituting (14) into (10) produces the final result,(15) g,'2, R,, -R,,*(I - ,,)-'Equation (15) and its derivation from (10) suggest that if Z is of ranl nor less, then 2, reproduces Z , exactly provided a) n : p - q or q p / 2and b) I - l ,, is nonsingular. The first set of constraints prevents thepossibility that the rank of 21 would be forced to be necessarily smaller thanthe rank of ZI . The conditions which lead to violation of constraint b) aremuch less obvious. If the terms of (14) are rearranged, then(16)Ru(IWI,) R,21 21 na Bit*.-From (16) it is apparent that the singularity of I - WH depends primarilyupon R 2, particularly when R has rank n or less (i.e., R ,* 0). For exampleif any subset of the variables in Z1 are tmrelated to all the variables in Z2,then the rows of R12 corresponding to these variables will be zeroes throughout,hence R121 21 will be singular. Now if R, is nonsingular, then it follows thatI -- WH must be singular. But if R1, is singular (which may happen if n q),then the situation is more complex. It is easily shown that I r is an idempotentmatrix, hence it has n eigenvalues equal to 1. The submatrix 1 1 must haveeigenvalues less than or equal to those of W. The matrix I - I tt will besingular only if Wt has an eigenvalue equal to one. Such a circumstanceoccurs whenever some of the variables in Z are unrelated to all the variablesin Z2. Therefore to require that I - I r,, be nonsingular is to require that eachof the variables in Zt be correlated with at least one of the variables of Z2.The question naturally arises as to the connection between the Regressionmethod (3) and the Principal Components method (9) for reconstructingZ from Z2. The relation is straightforward and may be obtained as follows.Because contains eigenvectors of R,R2, 1 R22 2 Multiplying on the right by iT,r and rearranging terms yieldsR2252, - R 1511 R21 -- R2,* - R 2 1 W l l .

236PSYCHOM 'rRIKAMultiplying this equation on the left by R22-1 and on the right by (I - W I)- gives(17)W 2 ( I - - W ) -1 R 2 - ' R - - R22-1R2 *(I -- # 1 ) - .The left-hand side of (17) is the transformation used by the PrincipalComponents method. The fight-hand side is the transformation used by theRegression method but diminished by a term that depends on the smallestp - n eigenvalues of R. If the rank of Z exceeds n, as is the case with mostordinary data matrices, then the principal Components method will differsomewhat from a least squares regression approach to reconstructing Z .However, if the first n eigenvalues of R are large relative to the rest, thenR2 * will be quite small and the two methods will yield very nearly the sameresults. Indeed, if the rank of Z is n or less then the two procedures willproduce identical results, although it will be necessary in the Regressionmethod to replace R22-I with a suitable generalized inverse of R22 Practical ConsiderationsThe preceding discussion gives some theoretical speculations as to howthe Regression and Principal Components methods might work. In theanalysis of any real matrix, however, the assumptions made in the theoreticaldiscussion will never be met exactly. For example, the correlation matrixis never known but must always be estimated using techniques such as thoseof Wilks [1932] or of Glasser [1964]. (Just which of these procedures oughtto be used is not clear a priori; in the sequel, evidence is given which suggeststhe conditions under which each of the above mentioned methods is to bepreferred.) Likewise in the Principal Components method it is not clearhow great an effect R12" will have when the rank of Z is p but most of thevariance is concentrated in the first n principal components of Z. Consequently, in this section some practical considerations are given for using thesetechniques on real data. In the subsequent discussion these procedures areevaluated by means of a Monte Carlo study.Regression Method. To use the Regression method one begins byestimating R with either the Wilks or Glasser methods. Then, wheneverq elements are missing for any individual (row) of Z, they may be estimatedby use of (3) with the nonmissing elements of the row becoming Z and thematrices R 2 and R being drawn from the relevant portions of the estimatedcorrelation matrix. Some efficiencies may be effected by collecting all thoserows of Z which have identical patterns of missing entries. For any suchcollection the regression coefficients need be calculated only once.It should be noted that this method differs from that proposed by Buck[1960] in the initial correlation matrix employed. Buck used the correlationmatrix developed from the complete submatrix of Z, i.e., those rows of Zwhich contained no missing data. Unfortunately this method of estimating R

TERRYC. G L E A S O NAND237RICHARD STAELINis very ineffective for matrices with large numbers of variables even whenthe proportion of missing data is comparatively small. For example, let r bethe proportion of all the entries in Z which are missing. The probability thatany row of Z will have no missing entry is approximately (1 - r)'. Thusas p increa s the proportion of rows with complete data decreases rapidly.As can be seen from Table 1, the number of rows available in the completeportion of Z at the level of missing data

psychometrika--vol. 40 no. 2 june, 1975 a proposal for handling missing data* terry c. gleason and richard staelin carnegie-mellon university

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att

Den kanadensiska språkvetaren Jim Cummins har visat i sin forskning från år 1979 att det kan ta 1 till 3 år för att lära sig ett vardagsspråk och mellan 5 till 7 år för att behärska ett akademiskt språk.4 Han införde två begrepp för att beskriva elevernas språkliga kompetens: BI

**Godkänd av MAN för upp till 120 000 km och Mercedes Benz, Volvo och Renault för upp till 100 000 km i enlighet med deras specifikationer. Faktiskt oljebyte beror på motortyp, körförhållanden, servicehistorik, OBD och bränslekvalitet. Se alltid tillverkarens instruktionsbok. Art.Nr. 159CAC Art.Nr. 159CAA Art.Nr. 159CAB Art.Nr. 217B1B