Using Multiple Imputation To Simulate Time Series: A .

2y ago
18 Views
2 Downloads
1.31 MB
10 Pages
Last View : 20d ago
Last Download : 3m ago
Upload by : Shaun Edmunds
Transcription

WSEAS TRANSACTIONS on COMPUTERSSebastian Cano, Jordi AndreuUsing Multiple Imputation to Simulate Time Series: A proposal tosolve the distance effect.JORDI ANDREUUniversitat Rovira i VirgiliDepartament de Gestió d’EmpresesAv. Universitat 1, 43204 ReusSPAINjordi.andreuc@urv.netSEBASTIAN CANOUniversitat Rovira i VirgiliDepartament d’EconomiaAv. Universitat 1, 43204 ReusSPAINsebastian.cano@urv.netAbstract: Multiple Imputation (MI) is a Markov chain Monte Carlo technique developed to work out missing data problems,specially in cross section approaches. This paper uses Multiple Imputation from a different point of view: it intends to applythe technique to time series and develops that way a simpler framework presented in previous papers. Here, the authors’idea consists basically on an endogenous construction of the database (the use of lags as supporting variables supposes anew approach to deal with the distance effect). This construction strategy avoids noise in the simulations and forces the limitdistribution of the chain to convergence well. Using this approximation, estimated plausible values are closer to real values,and missing data can be solved with more accuracy. This new proposal solves the main problem detected by the authors in[1] when using MI with time series: the previously commented distance effect. An endogenous construction when analyzingtime series avoids this undesired effect, and allows Multiple Imputation to benefit from information from the whole data base.Finally, new R computer code was designed to carry out all the simulations and is presented in the Appendix to be analyzedand updated by researchers.Key–Words: Missing data, Multiple Imputation, Time Series, MCMC, Simulation Algorithms, R Programming.1Introductionobjective of this paper. Missing values in a data base represent a huge issue, because then data can not be analyzeddirectly. The absence of some values oblige researchersto decide how to deal with that situation: missings can bedeleted or can be artificially substituted by a chosen value.The issue has been and still is a hot issue of investigation(see for example [5], [8],[7]). As can be seen in specialized literature, the decision taken by the researcher whenfacing missings is not innocuous, and introduces biases incalculations. To overcome this problem, Rubin proposed anew perspective. The author’s idea was to combine MCMCalgorithms such as EM, Data Augmentation or Gibbs Sampling to approximate the joint probability distribution ofmissing data and observed data. Applying Rubin’s strategyit is possible to analyze the data base and the underlyingmissing structure to provide not only one alternative valueto replace the missing value but (m) values. This simulation technique, known as Multiple Imputation (MI), offers(m) plausible values to fill in every empty cell. Once thisvariety of simulated results is available, MI faces anotherproblem: this multiplicity of values must be managed andpooled. Therefore, special inference rules are designed tocombine the simulated values and take into account the uncertainty.1Literature regarding Multiple Imputation has been fo-Complex probability distributions, where the number of dimensions were a serious issue, were solved by physicists bysimulation instead of direct calculation. One of the mostimportant papers in that direction was published in 1953[13], and was the beginning of a new fertile field. The mainoutput of the article was the presentation of Metropolis Algorithm, that later will be generalized by Hastings [12].The key innovation of the commented algorithm was theuse of Markov chains to look for probability distributions,making the target distribution the limit distribution of thechain.In the early years, Markov Chain Monte Carlo(MCMC) was only developed theoretically due to the lackof computational power to run these extraordinary complexalgorithms. But since the late 1980s, the huge developmentof computers and technology allowed the empirical use ofMCMC in many sciences. Nowadays, the empirical application of Markov Chain Monte Carlo is available in thegreat majority of technical software (Stata, R, SAS, SPlus).As advanced before, MCMC implied a real and important revolution in multiple fields, not only in physics. Wemay find them in Mechanical Statistics, Bayesian Statistics or Reconstruction Image Theory for example. WithinBayesian Statistics, MCMC has been used to solve missing data problems (see [14]), application that is the mainISSN: 1109-27501 The actual value is impossible to calculate, that’s the reason why inference is necessary.768Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERSSebastian Cano, Jordi Andreucused on cross section studies, where missing data is morelikely to appear. In that direction, most applications ofMI deal with surveys or incomplete cross section databases(see for example [4], [18]). In [1] the authors tried to apply MI in a different scenario. They tested MI with financial time series paying attention to how simulations changewhen one varies the main parameters of the technique. Author’s wanted to see if simulated values really fit the financial time series.2 . In case that values does not match theactual time series inference will lead to a wrong results andinference will be innacurate. In the cited paper, after almost200 simulations 3 , it was possible to draw some conclusionsregarding MI sensitivity:understanding of the application of MI to time series. After some analysis of sensitivity to different parameters, thisarticle proposes a solution to the distance effect, based ontime series lags as supporting variables of main series. Doing the simulations in this fashion lead to better results andthe distance effect not only decreases but almost disapears.As in many situations, the ’lag solution’ brings a trade-off.Although the estimation of plausible values becomes better, higher multiplicity (more plausible values) is generatedwith this solution, and the necessity to pool results becomeoverwhelming.The paper is structured as follows: In section 2we summarize methodology, paying attention to Markowchains, Markov Chain Monte Carlo, Gibbs Sampling andspecially to Multiple Imputation. In section 3, conclusionsof previous papers are presented, and the distance effectis deeply analized. Some graphs from simulations supportthe explanation. In section 4 a new point of view to dealwith the distance effect is proposed. Section 5 developesempirical tests for this new approach, using different timeseries (economic and physic time series). Finally section 6draws conclusions and section future research is proposedin section 7. An Appendix with the R code is presented.1. As expected by the theoretical framework, the difference between simulated values and real ones raiseswhen the database suffers from a higher percentage ofmissing data. It is obvious to conclude the lower theavailable data (higher number of missings), the worsethe estimation of plausible values (higher estimationerror).2. The estimation of plausible values becomes better increasing the number of imputations, but not in a significant way. In our empirical tests, the increase inthe estimation accuracy does not worth the increasein computation time. Although from the theoreticalframework the number of imputations seems a keyparameter, empirical results seem not to support thisimportance. The simulation improvement using morethan 40 imputations is negligible, because 80-90% error reduction is obtained using between 20 and 40 imputations.22.1Markov ChainsA Markov chain, named after Andrey Markov, is a discretetime stochastic process which follows the Markov property.That means past and future status are independent from current status, formally this definition is written as,3. Finally, estimated plausible values become worsewhen missings are distant values in time. This ideawas called in our original paper the distance effect.Errors, when estimating distant values, raise exponentially due to the use of long time series to apply MIalgorithms. Indeed, after a deep analysis, we can conclude now that the distance effect might be generatedby an inappropriate design of the database when usingMI with time series. A wrong data base structure leadsto a faulty convergence of the algorithm, and thereforeto non plausible simulations. So, a new perspectivemust be considered to improve MI performance whenusing the technique with time series.P r(Xn 1 Xn xn , ., X1 x1 ) P r(Xn 1 Xn xn )The process followed by a Markov chain starts with astate vector called u that includes the probability values ofdifferent states. To go one step further u must be multipliedby the transition matrix P, which includes every relationbetween all the possible states of the chain. So,u(n) u(0) · POne of the most important properties of Markov chainsconsists on the calculation of a time independent transitionmatrix. Due to the nature of P is possible to look for a limitof the transition matrix, following the expression below:In this paper, results from [1] are summarized and extended. Some graphs and tables are presented to a betterW lim Pn P 2 Theauthors’ original idea was to use new mathematical tools to estimate and predict future prices or financial values to improve MinimumRisk Index calculations developed in [3] and [2]3 10 historical components of the Dow Jones Industrial Average wereused in the simulation. Data for the period January 1962-December 2006was downloaded from the Yahoo Finance Database. The authors used 541monthly, 2347 weekly, and 11.328 daily observations to perform almost200 simulations.ISSN: 1109-2750Methodology Reviewn (1)To carry on Multiple Imputation the calculus of W is amust, because the stationary matrix of the chain is the targetdistribution we are looking for. Also the chain cannot beabsorbing, in this case W only gives information about theabsorbing states. The limit then is as follows:769Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERSnlim P n 00BISebastian Cano, Jordi Andreu (j)(2)θ1(j)θ22.2Markov chain Monte Carlo(j)θdMarkov Chain Monte Carlo (MCMC) is a collection ofmethods to generate pseudorandom numbers via MarkovChains. MCMC works constructing a Markov chain whichsteady-state is the distribution of interest. Random WalksMarkov are closely attached to MCMC. Indeed, this makesa division within the classification of MCMC algorithms.The well known Metropolis-Hastings and Gibbs Samplingare part of the Random Walk algorithms and their successdepends on the number of iterations needed to explore thespace, meanwhile the Hybrid Monte Carlo tries to avoid therandom walk using hamiltonian dynamics.The literature related to MCMC has raised in lastdecades due to the improvement of computational tools4 .Following these improvements and developments, newfields for these methods have been discovered. For example, one can find MCMC applications in Statistical Mechanics, Image Reconstruction and Bayesian Statistics.2.3(j 1) π(θ1 θ2 .(j)π(θ2 θ1 , · · ·(j)(j 1), · · · , θd)(j 1), θd)(j 1) π(θd θ1 , · · · , θd)3. Change the counter from j to j 1 and go to the second step until the convergence is reached.2.4Multiple ImputationThe presence of missings means a big issue to processdata. Every empty cell in a database is represented bysoftware with na which cannot be treated until is replacedby a number or eliminated. In such scenario the literaturehas developed many approaches to deal with this problem.One traditional approach is case deletion, meaning the nais directly erased. Another solution is single imputation,that means the missing is substituted by a value selectedby the researcher. This value can be for example the mean,the next or previous value, etc. Finally, a more complexsolution to missing data is Multiple Imputation (MI). Fora brief introduction to this technique see [15] and [17],and for a complete and detailed description see [14] and[16]. Multiple Imputation is a MCMC technique whichtries to solve missing data problems in a different fashion.Instead of calculating missing values directly (as we do insingle imputation), it carries many simulations to achieveplausible values. After this simulation, the researcherhas many plausible values for every missing datum.This multiplicity of information needs to be summarizedsomehow, and special rules of inference are defined to poolthe results.5Gibbs SamplingGibbs Sampling, named after Josiah Willard Gibbs, isan MCMC algorithm created by Geman and Geman in1984 [11]. Due to its simplicity it is a common option forthose who implement Multiple Imputation in a softwarepackage. Furthermore, Gibbs Sampling has had a vitalimportance in the later development of Bayesian Inferencethanks to BUGS software (Bayesian Inference UsingGibbs Sampling). Owing to this fact, some authors havesuggested to rename the algorithm to Bayesian Sampling.The process of the algorythm is as follows: let π(θ) bethe target distribution where θ (θ1 , θ2 , ., θd ). Also letπi (θi ) π(θi θ i ) be the conditional distributions fori 1, 2, ., d. Then, if the conditional distributions areavailable, we may approximate π(θ) through an iterativeprocess. Gibbs Sampling is performed by 3 steps:MI is a 3 stage process6 :imputation: The number m of imputations is set. Theprobability distribution P r(Xmis Xobs ) is approximated through MCMC algorithms, where Xmismeans missings and Xobs means observed data. Lateron it will be used to Monte Carlo simulations.analysis: Every simulated data set is analyzed using standard methods.1. Choose the initial values at the moment jpool: At this point m results are available. They are combined with special inference rules.(0)(0)(0)θ(0) (θ1 , θ2 , ., θd )Multiple Imputation performs fine when the datamissing mechanism is random. To see that, the probabilitydistribution of the dummy R (it represents the missing2. Calculate a new value of θ(j) from θ(j 1) by the following process,4 see5 Inference rules calculate missing data uncertainty using degrees offreedom.6 MI stages can be found in Figure 1[9] and [10]ISSN: 1109-2750770Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERSSebastian Cano, Jordi Andreunext paragraphs.P r(R X obs , X mis , ξ)GE0.560.46GM0.420.72HPQ6810246810TimeFigure 2: Errors when increasing the proportion of missingdata. Weekly data.(3)3.1In case we have R X 0 the missing data processis considered to be random. Nowadays, this analysis lacksa formal test to be sure about the missing data process.MI estimations modifying imputationpercentageMultiple Imputation accuracy depends on the percentageof missing data we have in the data base. If missing datarepresent a 10% of the available dataset, plausible valuesprovided by Multiple Imputation will be closer to real values than if the missing ratio is, for example 50%. Figure 2shows Absolute Average Errors (AAEs) of weekly simulations increasing the percentage to simulate from 5 to 50%.It can be seen in the graph, as expected, that AAEs growwhen the data base suffers from a higher missing ratio, although the increase is not linear.Definition of the problemAndreu and Cano (2008) [1] performed some tests withMultiple Imputation and time series. The authors designeda database with prices of 10 stocks of DJIA and performseveral MI tests7 . The main objective of the paper was toshow MI accuracy when used with time series. To perform this sensibility analysis, the authors performed 200simulations changing important parameters as the numberof imputations, the length of the time series and the number of iterations. After this deep empirical study, convergence is shown to be reached only with a small number ofiterations.8 Secondly, MI accuracy is directly related withthe number of missings the researcher is facing in the database. Thirdly, results become better when forcing MI toperform with a higher number of imputations. Finally, oneimportant issue was defined from the analysis: the distanceeffect, a problem that appears when simulating distant missing values. These main conclusions are presented in the3.2MI estimations modifying number of imputationsPerhaps the number of imputations used in MI is the mostimportant parameter to take into account. An increase inthis number clearly benefits results. If more plausible values are generated, more values can be combined (pooled)and simulations are closer to real values. It can be seenin Figure 3. In this graph, the reduction of AAEs is clear.The plot shows AAEs of weekly data simulations for theentire period meanwhile we increase the number of imputations. For each time series it is possible to see how AAEsdecrease when using more imputations in the simulation.According to our results, 80-90% of the error reduction isobtained using between 20 and 40 imputations. From aneconomic point of view and taking into account computational costs, it is worthless to use more than 40 imputa-7 Alcoa Inc (AA), Boeing Co (BA), Carterpillar Inc (CAT), Dupont(DD), Walt Disney (DIS), General Electric (GE), General Motors (GM),Hewlett Packard (HPQ), IBM and CocaCola (KO). Available data for 200simulations are close prices for the period January 1962-December 2006.541 monthly, 2.347 weekly and 11.328 daily observations are used in thecalculations.8 Fast convergence is known to be one of the main properties of MarkovChains.ISSN: 1109-27504Timemis30.680.640.570.50KO0.512data pattern) has to be analyzed. To do so, the connectionbetween R (known information), missing information ofthe sample and a nuisance parameter ξ is studied throughconditional gure 1: Three Multiple Imputation stages0.55AA0.520.78 0.50BA0.740.46 0.700.44One result butincludinguncertainty0.520.42M results availableCATThere are N771Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERSSebastian Cano, Jordi Andreutions. The difference between a simulation using 40 and1.000 imputations is tiny in AAEs, but huge in computational time (computational time increases 20 times whenusing 1.000 HPQ0.656 0.20.5080.6540.502IBM 0.40.4960.53005000.520KO0.2GE0.428 0.422CAT0.4340.710BA0.7400.507AA0.5110.565ERRORS FROM 1962 TO 14TimeFigure 5: The Distance effect with Disney share’s prices.Weekly data.Figure 3: Errors when increasing the number of imputations. Weekly data.0.81.0ERRORS FROM 1962 TO O0.6ERRORS FROM 1962 TO 200660000.0OBSERVATIONSFigure 4: The Distance effect with CocaCola share’s prices.Daily data.0100200300400OBSERVATIONS3.3Figure 6: The Distance effect with Hewlett Packard share’sprices. Monthly data.MI estimations modifying time serieslengthContrary to one of the most known principles of statistics, more data in our case might be negative. Using MIto estimate very distant missings (and using that way longISSN: 1109-2750772Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERSSebastian Cano, Jordi Andreu0.75GE0.550.55GM0.450.72HPQ0.680.58 0.64IBM155Time1015TimeFigure 7: Errors when increasing time series lenght.Weekly data. X x1x2x3. xtOne can add an auxiliary variable to the matrix, whichis actually the first lag of X. We call the new data structureX , it has the shape, X A new point of viewThe distance effect is a huge issue when using MI with timeseries. After conclusions in [1] and [6], some empiricaltests were carried out. We conclude here that the problem could be solved applying a different approach. Multiple Imputation was designed for working on cross sectiondatabases. Making a design close to a cross section appearance seems not to work with time series, which is mainlydue to the distance effect and the extraordinary increase inerrors when estimating distant values. A new point of viewis needed in order to use the technique with time series. Inthis new approach 2 issues need to be considered:x2x3x4.x1x2x3.xtxt 1 Arranging the time series in this fashion we let the values of the past influence the recent values. One can add asmany artificial variables he may consider. Now let’s thinkwe have a missing value in our time series, X 1. Proper construction of the Markov chain.2. Noise from other variables of the database.x1x2nax4x5x6x7. xtone can build the matrix X with two artificial variables,Let’s see a normal time series (X), which is a matrixwith t rows and one column,ISSN: 1109-27500.540.50100.50 0.60 0.70 0.80KO5 40.650.600.740.550.650.550.700.60DIS0.45DD0.75 0.45CAT0.65 0.70BA0.50AA0.700.85time series) is dangerous to our purposes. Estimation errors grow when time series’s length increases as can be observed in Figure 7. It is easy to see that this increase isexponentially, and disturbs MI estimations. The selectedfigure shows errors of weekly data simulations for the entire period and for each analysed time series, meanwhilewe increase time series’ length. Paying attention to the details, AAEs grow in all stocks between 5 and 90%, showingresults are more sensible to this parameter than to percentage to impute. In Andreu and Cano (2008) [1] we calledthis problem the distance effect. From a theoretical pointof view, these results can be explained that way. Multiple Imputation accuracy depends on the quality of available data. MCMC algorithm approximates the probabilitydistribution function generating values of variables takinginto account all the available information and correlationsamong variables. Augmenting the length of time series provides MI with more data to simulate missing values. Thesimulated value is worse if we accept structural breaks andchanges in the probability distribution functions generatingthe time series are feasible. Giving further data as an input obliges the probability distribution to be the same during the period of analysis, and this is not necessarily true.Putting emphasis on the distance effect, the error analysisshows that after the 5th or 6th missing the simulation isnot a plausible value, because the Markov chain does notconverge to the right limit distribution, and error rises. InFigures 4, 5 and 6 the distance effect can be seen in detail.Figures show inicial missing estimations are good, so theerror is close to the 0%. Error increases when MI tries tosimulate more distant values, and this effect is similar using daily, weekly or monthly data. Usually, the distanceeffect becomes worse after the 6th missing, and errors canincrease from 0% to 100%.773Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERS X Sebastian Cano, Jordi AndreuAfter doing this process information can be pooled.Now, there is a vector for each missing value with the following information,9 nax4x5.x2nax4.x1x2na.xtxt 1xt 2 Lower value of CI Central value (Q) U pper value of CI Degrees of f reedomNotice that now the missing value appears 3 times andis across one diagonal of the matrix. In a more completecase one might have a matrix like, X xt 2xt 1nananaxt 3xt 2xt 1nanaxt 4xt 3xt 2xt 1naxt 5xt 4xt 3xt 2xt 1At this point one fact needs to be considered. In caseone adds too many artificial variables the results might befaulty again, specially in those where frequency is low.Let’s think we have annual frequency. If one adds for instance 15 artificial variables then simulated values may besmooth again. It is like saying that the value today is influenced by the value 15 years ago. We can see this fact in thefigure below (Figure 8): the performance is raising until theoptimal point, after that the performance decreases. efficiencyHere one can identify 2 different submatrices, one isknown information and the other one is missing information. Some considerations about the triangular matrix: if convergence is reached, values of the diagonalshould be closer. the best simulation should be the one with more supporting information. when the missing is far away from the known information uncertainty will grow.Now there are many plausible values which have to bepooled using Rubin’s inference. The scalar of interest (Q)will be the value of the missing cell we are looking for.First we need calculate the average value of the scalar ofinterest,L*number of artificial variablesFigure 8: MI efficiency using LagsmQ̄ 1 XQ̂(t)m t 1(4)5and the total variance associated to Q, mm1 X (Q̂(t) Q̄)21 XŜ(t) 1 T m t 1m t 1m 1To illustrate what this paper has explained, we make somesimulations with different time series. We use the R language to do the programming and to perform 5 tests. Weuse a library named multiple imputation simulation for timeseries.10 There are 2 main commands in the library:(5)Next step is to calculate the degrees of freedom for asmall sample to carry out the inference,df c(1 f ) f f2m 1df c dfmists() performs the whole Multiple Imputation processand builds the X matrix. The sintax is: mists(data,iterations, number of data simulations, number of artificial variables).(6)After all this process inference based on t distributioncan be calculated,T 0.5 (Q Q̄) tdfISSN: 1109-2750Empirical tests9 where Upper stands for left side of the confidence interval, Q standsfor the average simulated value of the scalar of interest and Upper standsfor right side of the confidence interval.10 This library has been programmed for the unpublished PhD Thesis”Imputación Múltiple: definición y aplicaciones”, see [5].(7)774Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERSSebastian Cano, Jordi Andreurubin.value() pools the simulations for each missing valueand makes the inference calculation. The syntax is:rubin.value(object, missing number to make the inference).mists(aqua,50,c(1:9),14)Results in the following table,The tests have the following structure: first of all werun simulations with the mists() instruction and second wemake the inference over each simulated value using 95%confidence 8,6Test 1Time series: IBM prices. Frequency: Daily. Period: 2007.Sample: 260. Iterations: 50. Artificial variables: 9 Simulation: 7 last 0.0340,0320,004-0,008-0,0180,012Test 4Time series: US output aggregate. Frequency: annual. Period: 1909 - 1949. Sample: 40. Iterations: 50. Artificialvariables: 8. Simulation: 5 first values.mists(IBM,50,c(253:260),10)Results in the following ,0410,0300,022Results in the following table,Actual0,6800,6520,6470,6160,623Test 2Time series: Apple prices. Frequency: Daily. Period:2007. Sample: 260. Iterations: 50. Artificial variables:9. Simulation: 7 last values.5.5Results in the following 57-0,107-0,100-0,153-0,152Test 5MeanMedianSt.DevMinimumMaximumTest 3Time series: water temperature of the Pacific coast in USA.Frequency: 10 minutes. Period: July of 1974. Sample:400. Iterations: 50. Artificial variables: 13. Simulation: 9first values.0, 04450, 04010, 03120, 00180, 1212AAEs (and also median error) is around 4% with anstandard deviation of 3%. Similar results can be obtained11 Several missing values have been simulated by the ’Lag approach’.Only 5 examples are provided here to show the application of this technique.ISSN: 1109-2750Lower0,3200,3320,3080,2900,234The last test in this paper is different from above. In thisparticular case we compare simulations obtained by the’Lag methodology’ with the ones obtained in [1]. Andreuand Cano (2008) found a pattern in the error which canbe clearly noticed. To carry on this test we simulate againplausible values for the CocaCola Company time series.12Results for these simulations (in Figure 9) are clearly betterif we compare them with the first 50 simulations obtainedin Figure 4). Simulations do not show a raising pattern inthe error, and statistics are quite satisfying. The followingtable shows the main statistics of Absolute Average Errors(AAEs) of test 0,71812 The first two months in 1962 are simulated using daily frequency, 50iterations and 14 artificial variables. Available sample: 260.775Issue 7, Volume 9, July 2010

WSEAS TRANSACTIONS on COMPUTERSSebastian Cano, Jordi Andreuare on annual frequency). In this situation, more uncertainty on the simulations appears, so the loss of freedomdegrees is quite noticeable. More effort has to be put onthis side to obtain a better application of MI.repeating the whole simulation for the entire period withall time series in [1]. It is possible to see in those simulations that AAEs decrease exponentially, showing the newperspective (the ’lag approach’) indeed helps to improvethe performance of Multiple Imputation. The proposed approach to use MI with time series seems to avoid two previously mentioned issues: proper construction of the MarkovChain and noise from other 20051015202530354045time1:2:C ODE FOR THE INSTRUCTIONmists function(x,y,z,l){require(mice)MISTS ()3:4:5:6:x dataembed(data,l) MRTx[z] NAembed(x,l) MR7:mice(MR,maxit y) MRS8:9:10:11:12:complete(MRS

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

Related Documents:

the theoretical underpinnings of multiple imputation and then briefly describe trad-itional imputation approaches. Next, we use Van Buuren, Boshuizen, and Knook’s (1999) multiple imputation by chained equations approach to provide an illustration of imputing student background data missing from the TIMSS 2007 datafile for Tunisia.

2. Designing an imputation strategy can be quite complex. Many different imputation methods exist, and methodologists must choose an appropriate one from amongst them, based on both the data needs and properties of the targeted dataset. Assumptions about imputation models and non-response mechanisms should be validated, if possible.

develop imputation methods for scRNA-Seq. In next section we brie y discuss some existing scRNA-Seq imputation methods and propose a novel iterative imputation approach based on e ciently computing highly similar cells. We then present the results of a comprehensive assessment of the existing and proposed

The MI procedure is a multiple imputation procedure that creates multiply imputed data sets for incomplete p-dimensional multivariate data. It uses methods that incorporate appropriate variability across the m imputations. Which imputation method you choose depends on the patterns of missingness in the data and the type of the imputed variable. 1

Multiple imputation John Carlin Outline What is MI? Illustrative example Early history Basic theory MI as approximate Bayes Proper imputation MI in practice

a demanding enterprise. For instance, imputation of X-linked ge-notypes currently cannot be handled in a straightforward way. In contrast, matrix completion is simple to implement and requires almost no changes to deal with a wide variety of data types. Table 1 compares the virtues of model-based imputation versus matrix completion.

household variables because it interacts with them. We refer to Williams (1998) for a model-based imputation procedure to impute age and to impute sex for household members other than the householder. In the next subsection we review the imputation methodology that was used for the dress rehearsal.

Imputation methods have been implemented into NDA in co--operation with a research group of Statistics Finland. This presentation is based on research conducted under the EUREDIT FP5 project of the European Union. Keywords: tree-structured self-organising maps, neural data analysis, imputation classes. 1. Introduction