Simulating Dependent Discrete Data

2y ago
139 Views
10 Downloads
3.74 MB
16 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Sasha Niles
Transcription

August 28, 201310:6Journal of Statistical Computation & SimulationMadsenBirkes2013Journal of Statistical Computation & SimulationVol. 00, No. 00, January 0000, 1–15Simulating dependent discrete dataL. Madsena and D. BirkesaaDepartment of Statistics, Oregon State University, Corvallis, Oregon 97331, USA(00/00/00)This article describes a method for simulating n-dimensional multivariate non-normal data,with emphasis on count-valued data. Dependence is characterised by either Pearson correlation or Spearman correlation. The simulation is accomplished by simulating a vector ofcorrelated standard normal variates. The elements of this vector are then transformed toachieve target marginal distributions. We prove that the method corresponds to simulatingdata from a multivariate Gaussian copula. The simulation method does not restrict pairwisedependence beyond the limits imposed by the marginal distributions and can achieve anyPearson or Spearman correlation within those limits. Two examples are included. In the firstexample, marginal means, variances, Pearson correlations, and Spearman correlations are estimated from the epileptic seizure data set of Diggle, Liang, and Zeger [P. Diggle, P. Heagerty,K.Y. Liang, and S. Zeger Analysis of Longitudinal Data Oxford University Press, 2002]. Datawith these means and variances are simulated, first to achieve the estimated Pearson correlations, and then to achieve the estimated Spearman correlations. The second example is of ahypothetical time series of Poisson counts with seasonal mean ranging between 1 and 9 andan autoregressive(1) dependence structure.Keywords: Count data; Pearson correlation; Rank correlation; Spearman correlation;AMS Subject Classification: 65C101.IntroductionDependent non-normal data, particularly count-valued data, arise in many fieldsof study. The ability to simulate data resembling observed data is necessary tocompare and investigate the behaviour of analytical procedures. It is customaryto include simulation studies in statistical methodology research articles. Thesestudies can be used to compare statistical procedures, to conduct power analyses,and to explore robustness. Another use of simulated data is the parametric bootstrap, where one simulates data according to a null hypothesized model, and thedistribution of a test statistic emerges from repeated simulations.It is surprisingly difficult to simulate dependent discrete random variables, particularly count-valued random variables with infinite support such as negative binomial or Poisson. One of the challenges to simulating dependent discrete randomdata is that it is difficult to find a method capable of simulating data from the entire range of possible dependence. Limits to Pearson correlation between Bernoullirandom variables are well known. These limits are a consequence of the FréchetHoeffding bounds [1], which induce margin-dependent bounds on correlation andon other measures of monotone dependence.Another challenge is characterising dependence. For normal data, Pearson correlation perfectly describes dependence. For highly skewed distributions, researchers Correspondingauthor. Email: madsenl@onid.orst.eduISSN: print/ISSN onlinec 0000 Taylor & Francis DOI:http://www.informaworld.com

August 28, 201310:62Journal of Statistical Computation & SimulationMadsenBirkes2013Madsen and Birkesoften choose to characterise monotone dependence by nonparametric measures suchas Kendall’s tau or Spearman’s rho.In this article, we describe a method to simulate random vectors of arbitrarylength with specified discrete univariate marginal distributions and pairwise dependence, which may be specified by either Pearson correlation or Spearman correlation. Our method simulates data from a multivariate Gaussian copula and canachieve any Pearson or Spearman correlation within the constraints imposed bythe Fréchet-Hoeffding bounds.Other methods of simulating dependent discrete data suffer from more restrictive limitations on the degree of achievable dependence than those imposed by thetheoretical bounds. Park et al. [2] develop a method for simulating correlated binary random variables, based on the observation of Holgate [3] that if Y1 , Y2 , andY are independent Poisson with means λ1 , λ2 , and λ, then Y1 Y and Y2 Yare dependent Poisson with covariance λ. Park and Shin [4] extend the method forclasses of distributions closed under summation. Madsen and Dalthorp [5] buildon the algorithm of Park and Shin [4] to develop an “overlapping sums” methodfor generating vectors of count random variables with given mean, variance, andPearson correlation. This method allows for high correlations between count random variables with similar means, but suffers from correlation limits well below theFréchet-Hoeffding bounds when means are only moderately different. Furthermore,the method does not allow negative correlations.Simulating a lognormal-Poisson hierarchy is a simple method to generate dependent counts, but cannot achieve even moderate correlation levels when the meansare small. With this method, a vector of correlated normal random variables aregenerated, then exponentiated to form a vector of lognormal random variables.These lognormal random variables serve as means for a vector of conditionally independent Poisson random variables. Madsen and Dalthorp [5] give formulas formoments and correlations of the normal vector that will yield a lognormal-Poissonvector with specified moments and correlations.Pearson and Spearman correlation are discussed in Section 2. Section 3 describesthe simulation method. In Section 4 we show that the method can achieve anyPearson or Spearman correlation within the Fréchet-Hoeffding bounds. Section 5gives two examples. The first example employs the epileptic seizure example ofDiggle et al. [6]. We estimate marginal means and variances, as well as Pearsonand Spearman correlation, from the data, then simulate data with these momentsas targets. For the second example, we simulate data from a hypothetical Poissontime series with seasonally-varying mean and AR(1) Pearson correlation.For the special case when the target marginal distributions are Bernoulli, thesimulation method developed in this article is given by Emrich and Piedmonte[7]. Spearman correlation, when rescaled as in Section 2, is equal to Pearson correlation for Bernoulli random variables. Shin and Pasupathy [8] give the methodfor Poisson random variables with specified Pearson correlation. We generalize themethod to count-valued random variables with infinite support and either Pearsonor Spearman correlation.2.Pearson correlation and Spearman correlationThe linear association between random variables X and Y is described by the population correlation coefficient, also called the Pearson product-moment correlation

August 28, 201310:6Journal of Statistical Computation & SimulationMadsenBirkes20133coefficient,ρ(X, Y ) E(XY ) E(X)E(Y ).[var(X) var(Y )]1/2(1)For bivariate normal (X, Y ), ρ perfectly describes the dependence between X andY . For non-normal distributions, nonparametric measures of monotone dependencesuch as Kendall’s tau or Spearman’s rho may more accurately capture the degreeof dependence unless X and Y have a straight-line relationship. Mari and Kotz [9,Chapter 2] discuss drawbacks and limitations of ρ.Kruskal [10] details several measures of dependence between random variablesX and Y , including the Spearman correlation coefficientρS (X, Y ) 3{P [(X1 X2 )(Y1 Y3 ) 0] P [(X1 X2 )(Y1 Y3 ) 0]}dd(2)dwhere (X1 , Y1 ) (X, Y ), X2 X, Y3 Y such that X2 and Y3 are independent ofone another and of (X1 , Y1 ). For continuous marginals, (2) provides a satisfactorymeasure of monotone dependence. For discrete marginals, however, the non-zeroprobability of ties (two or more jth largest values in the sample) creates somedifficulties. In particular, it can happen that the Spearman correlation of X withitself is less than 1 [11, Example 8]. To remedy this, we can rescale ρS as inNešlehová [12, Definition 11]. For any pair of jointly distributed random variables Xand Y , let p(x) P (X x) and q(y) P (Y y). Define the rescaled Spearmancorrelation coefficient to beρRS (X, Y ) {[1 ρS (X, Y )P.33 1/2x p(x) ][1 y q(y) ]}P(3)Note that when X and Y are continuous, the probability of ties is zero, and norescaling is necessary. Accordingly, the denominator of (3) is 1 because p(x) q(y) 0 for all x, y. When X and Y are discrete, p(x) and q(y) are the respectiveprobability mass functions.An appealing feature of ρRS is that its sample analog is equal to the samplePearson correlation coefficient of the midranks. Specifically, for a bivariate sample(X1 , Y1 ), . . . , (Xn , Yn ), if the distribution of (X, Y ) is taken to be the empirical distribution function of the sample, (3) coincides with the sample Pearson correlationcoefficient of the midranks [12, Theorem 15], commonly called the sample rank correlation. Midranks are used for ranking in the presence of ties and are computedas follows. If Xi 1 . . . Xi u would have been assigned Pranks p1 , . . . , pu hadthey not been tied, for j i 1, . . . , i u assign r(Xj ) u 1 uk 1 pk , the averagerank of these u observations in the absence of ties.3.Simulation methodThis section describes the method for simulating a vector Y of length n where eachYi has a given discrete marginal distribution function Fi , and each pair (Yi , Yj ) hasa given Pearson correlation (1) or rescaled Spearman correlation coefficient (3).The simulation method begins by generating an n-vector Z of standard normalrandom variables with Pearson correlation matrix ΣZ , i.e. the ijth element of ΣZis ρ(Zi , Zj ). Each Zi is then transformed to Ui Φ(Zi ), where Φ is the univariatestandard normal distribution function. The Ui are uniform on (0, 1) [13, Theorem

August 28, 201310:64Journal of Statistical Computation & SimulationMadsenBirkes2013Madsen and Birkes2.1.10], and ρS (Zi , Zj ) ρS (Ui , Uj ). Ui is then transformed to Yi Fi 1 (Ui ) whereFi 1 (u) inf{y : Fi (y) u},(4)ensuring that Yi Fi , even when Fi is not continuous.The elements of ΣZ are chosen to yield the desired Pearson or Spearman correlations among the Yi . Details are given below for count-valued Yi and for BernoulliYi .When the Yi are discrete, one must take care to distinguish Spearman correlationρS from its rescaled version ρRS . In particular, if target Spearman correlations areobtained from the midranksof data,estimate is of ρRS and must bePP the resulting331/2multiplied by {[1 x p(x) ][1 y q(y) ]} , the denominator of (3), to obtainthe target ρS . This is the situation illustrated by the seizure example of Section 5.3.1.Connection of the simulation method to the Gaussian copulaA bivariate copula is a bivariate distribution function with uniform marginals. Thebivariate Gaussian copula is given by C(u, v) Φδ [Φ 1 (u), Φ 1 (v)] where Φ is theunivariate standard normal distribution function, and Φδ is the bivariate standardnormal distribution function with correlation parameter δ. By Sklar’s theorem [14],H(y1 , y2 ) Φδ {Φ 1 [F1 (y1 )], Φ 1 [F2 (y2 )]} defines a bivariate probability distribution with marginals F1 and F2 . The multivariate Gaussian copula is the logicalextension to n-dimensional distributions, and, since Sklar’s theorem holds for arbitrary n, yields a joint distribution function for random vector [Y1 , . . . , Yn ] withgiven marginal distribution functions F1 , . . . , Fn :H(y1 , . . . , yn ) ΦΣ {Φ 1 [F1 (y1 )], . . . Φ 1 [Fn (yn )]},(5)where ΦΣ represents the n-variate standard normal distribution function with correlation matrix Σ.The relationship between standard normal Zi and count-valued Yi is Yi Fi 1 [Φ(Zi )]. Equation (4) implies that for integer y,Yi y if and only if Zi Φ 1 [Fi (y)]Yi y if and only if Zi Φ 1 [Fi (y 1)].(6)Zi and Zj are elements of multivariate normal vector Z, so (Zi , Zj ) is bivariatenormal.Proposition 3.1: The simulation method proposed in this section produces[Y1 , . . . , Yn ] with marginal distribution functions F1 , . . . , Fn and joint distributiongiven by (5).Proof : Let Yi , Zi , and Fi 1 , i 1, . . . n be defined as above. By (6), P (Y1 y1 , . . . , Yn yn ) P {Z1 Φ 1 [Fi (y)], . . . , Zn Φ 1 [Fn (y)]}, which is (5). Any 2-dimensional marginal H(yi , yj ) of (5) is given by a bivariate Gaussiancopula. The elements of the n n copula correlation matrix Σ in (5) are determinedby finding the correlation parameter δ for each 2-dimensional marginal.

August 28, 201310:6Journal of Statistical Computation & SimulationMadsenBirkes201353.2.Simulating countsSuppose the target marginals are count-valued with distribution functions Fi andprobability mass functions fi , i 1, . . . , n. Let µi and σi2 denote E(Yi ) and var(Yi )respectively. We first describe the method to simulate Yi Fi , i 1, . . . , n withspecified pairwise Pearson correlations ρ(Yi , Yj ).P P For count-valued random variables Yi and Yj , E(Yi Yj ) r 0s 0 P (Yi r, Yj s). Thus, using the two-dimensional marginal distribution function of (5),Pearson correlation (1) can be written as1ρ(Yi , Yj ) σi σj( XX(1 Fi (r) Fj (s) Φδ {Φ 1 [Fi (r)], Φ 1 [Fj (s)]}) µi µj).r 0 s 0(7)Given target Pearson correlation ρ(Yi , Yj ) for each pair i 6 j, the necessary correlation ρ(Zi , Zj ) is found by numerically solving (7) for δ. Correlation matrix ΣZis obtained by solving (7) for each unique combination {Fi , Fj , ρ(Yi , Yj )}.A similar method achieves specified Spearman correlation. Denote the target(unrescaled) Spearman correlations by ρS (Yi , Yj ). Using the expression in (2) andsupposing Yi0 Fi and Yj0 Fj but Yi0 and Yj0 are independent of each other andof Yi and Yj , ρS (Yi , Yj ) can be written asρS (Yi , Yj ) 3[P (Yi Yi0 , Yj Yj0 ) P (Yi Yi0 , Yj Yj0 ) P (Yi Yi0 , Yj Yj0 ) P (Yi Yi0 , Yj Yj0 )] 3 X X(8)fi (r)fj (s)[P (Yi r 1, Yj s 1) P (Yi r 1, Yj s 1)r 0 s 0 P (Yi r 1, Yj s 1) P (Yi r 1, Yj s 1)].(9)Using (6), the right-hand side of equation (9) can be written as:ρS (Yi , Yj ) 3 X Xfi (r)fj (s)(Φδ {Φ 1 [Fi (r 1)], Φ 1 [Fj (s 1)]}r 0 s 0 Φδ {Φ 1 [1 Fi (r)], Φ 1 [1 Fj (s)]} Φ δ {Φ 1 [Fi (r 1)], Φ 1 [1 Fj (s)]} Φ δ {Φ 1 [1 Fi (r)], Φ 1 [Fj (s 1)]}). (10)Again, correlation matrix ΣZ is obtained by solving (10) for each unique combination {Fi , Fj , ρS (Yi , Yj )}.The simulation algorithm requires that ΣZ is positive definite or has a positivedefinite submatrix and the remaining elements are 1 (see Section 4 for details).3.3.Simulating binary variatesFor completeness, we summarise the method for the special case of Bernoullimarginals. This algorithm was developed by Emrich and Piedmonte [7]. A littlealgebra verifies that for Yi Bernoulli(µi ), ρRS (Yi , Yj ) ρ(Yi , Yj ). To simulate

August 28, 201310:6Journal of Statistical Computation & Simulation6MadsenBirkes2013Madsen and Birkeswith given ρ(Yi , Yj ), correlation δ ρ(Zi , Zj ) must be the solution toΦδ [Φ 1 (µi ), Φ 1 (µj )] ρRS (Yi , Yj )[µi (1 µi )µj (1 µj )]1/2 µi µj ,and½Yi Fi 1 (Zi ) 3.4.1 if Φ(Zi ) 1 µi0 if Φ(Zi ) 1 µi .Simulating continuous non-normal random variables with specifiedSpearman correlationIf the marginals Fi are continuous, then each Fi 1 is a strictly increasing functionon (0, 1), so ρS (Yi , Yj ) ρS (Ui , Uj ) ρS (Zi , Zj ). Elements ρ(Zi , Zj ) of ΣZ neededto yield target ρS (Yi , Yj ) are determined by the relationρS (Zi , Zj ) 6arcsin[ρ(Zi , Zj )/2]πgiven by Kruskal [10].AchievingRtargetPearson correlation in the continuous case entails approximatingRE(Yi Yj ) y1 y2 Φδ {Φ 1 [F1 (y1 )], Φ 1 [F2 (y2 )]}]dy1 dy2 . The numerical approximation method will vary depending on the marginal distributions. Since our focusis discrete marginals, we do not pursue this problem here.3.5.ComputingThe algorithm described in Section 3.2 is computationally intensive. The difficultyis that equations (7) or (10) must be solved numerically, and that they must besolved multiple times in order to obtain correlation matrix ΣZ . We have implemented the algorithm in R [15], which costs nothing but is much slower thana compiled language like C or Fortran. To minimise computing effort, our codeavoids loops and makes use of vectorised functions. We also solve (7) or (10) onlyfor unique combinations of {Fi , Fj , ρ(Yi , Yj )} or {Fi , Fj , ρS (Yi , Yj )}. Because ΣZis symmetric with 1’s along the diagonal, it will be necessary to solve (7) or (10)at most n(n 1)/2 times. Each unique combination of marginal distributions andcorrelation does not depend on any other combination, so solving (7) or (10) for δcan easily be done in parallel, which would reduce computing time.In Section 5, we give computing time required to solve (7) or (10) for each of thethree examples described.To implement the algorithm, the infinite sums in (7) and (10) must be approximated with finite sums. Appendix C gives a bound on the error in approximating(10). Given a tolerance ² for approximating (10) by a finite sum, set the upperlimit for the index r toKi dFi 1 [(1 ²/6)1/2 ]e,(11)where dxe denotes the smallest integer x. Replacing i with j in (11) gives theupper limit for s. Plugging Ki and Kj into the bound given by Lemma C.1 impliesthat the absolute difference between (10) and the approximation is no more than².Shin and Pasupathy [8] bound the error in approximating (7) when the marginaldistributions are Poisson, but their bound employs a single upper limit K for both

August 28, 201310:6Journal of Statistical Computation & SimulationMadsenBirkes20137sums. For the examples in Section 5, we found Ki 4dFi 1 (0.9975)e sufficient.We have been unable to find an error bound for approximating (7) for arbitrarymarginal distributions.4.Limits on dependenceFor any bivariate distribution function with marginals F1 and F2 , the pointwiseupper bound is M (y1 , y2 ) min[F1 (y1 ), F2 (y2 )] and the pointwise lower bound isW (y1 , y2 ) max[F1 (y1 ) F2 (y2 ) 1, 0]. These are the Fréchet-Hoeffding bounds [1].Furthermore, M and W define upper and lower limits for Pearson correlation (1),that is, if we let ρ(M ) and ρ(W ) denote the Pearson correlation between randomvariables with joint distribution M and W respectively, then for any (Y1 , Y2 ) withmarginals F1 and F2 , ρ(Y1 , Y2 ) [ρ(W ), ρ(M )] [16]. Corollary 3.2 of [17] establishesthat Spearman’s ρ similarly falls between bounds determined by M and W .Chaganty and Joe [18] discuss the consequences of these bounds on correlationmatrices for vectors of Bernoulli variates. They conduct a simulation study to compare methods of generating vectors of correlated Bernoulli data, and observe thatEmrich and Piedmonte’s method [7] generally achieves a wider range of correlationsthan other methods. This observation illustrates the result we prove in Theorem 4.1.Madsen and Dalthorp [5] give an expression for the maximum Pearson correlationbetween count-valued random variables and show that the simulation method ofPark and Shin [4] imposes more restrictive limits.The bivariate Gaussian copula achieves the Fréchet-Hoeffding bounds M and Wby setting δ 1 and δ 1 respectively. Thus our simulation method is capable ofsimulating (Yi , Yj ) with maximum or minimum ρ or ρS . Note however that settingan off-diagonal entry of ΣZ to 1 will destroy the positive-definiteness of ΣZ . Ifmaximal or minimal ρ or ρS is desired between Yi and Yj , one would simulate therandom vector [Y1 , . . . , Yj 1 , Yj 1 , . . . , Yn ] using the method described in Section3. Then set Yj Fj 1 [Φ(Zi )] to achieve ρ(Yi , Yj ) ρ(M ), or set Yj Fj 1 [Φ( Zi )]to achieve ρ(Yi , Yj ) ρ(W ). The same procedure achieves ρS (M ) or ρS (W ).For n 2 and given marginal distributions F1 and F2 , the simulation method ofSection 3.2 can achieve not only maximum and minimum ρ, but, as the followi

time series with seasonally-varying mean and AR(1) Pearson correlation. For the special case when the target marginal distributions are Bernoulli, the simulation method developed in this article is given by Emrich and Piedmonte [7]. Spearman correlation, when rescaled as in Section 2, is equa

Related Documents:

2.1 Sampling and discrete time systems 10 Discrete time systems are systems whose inputs and outputs are discrete time signals. Due to this interplay of continuous and discrete components, we can observe two discrete time systems in Figure 2, i.e., systems whose input and output are both discrete time signals.

6 POWER ELECTRONICS SEGMENTS INCLUDED IN THIS REPORT By device type SiC Silicon GaN-on-Si Diodes (discrete or rectifier bridge) MOSFET (discrete or module) IGBT (discrete or module) Thyristors (discrete) Bipolar (discrete or module) Power management Power HEMT (discrete, SiP, SoC) Diodes (discrete or hybrid module)

The RAND function in the DATA step is a powerful tool for simulating data from univariate distributions. However, the SAS/IML language, an interactive matrix language, is the tool of choice for simulating correlated data from multivariate distributions. SAS/IML software contains many built-in functions for simulating data from standard .

Computation and a discrete worldview go hand-in-hand. Computer data is discrete (all stored as bits no matter what the data is). Time on a computer occurs in discrete steps (clock ticks), etc. Because we work almost solely with discrete values, it makes since that

What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous) objects. Calculus deals with continuous objects and is not part of discrete mathematics. Examples of discrete objects: integers, distinct paths to travel from point A

Definition and descriptions: discrete-time and discrete-valued signals (i.e. discrete -time signals taking on values from a finite set of possible values), Note: sampling, quatizing and coding process i.e. process of analogue-to-digital conversion. Discrete-time signals: Definition and descriptions: defined only at discrete

2.1 Discrete-time Signals: Sequences Continuous-time signal - Defined along a continuum of times: x(t) Continuous-time system - Operates on and produces continuous-time signals. Discrete-time signal - Defined at discrete times: x[n] Discrete-time system - Operates on and produces discrete-time signals. x(t) y(t) H (s) D/A Digital filter .

Discrete Mathematics is the part of Mathematics devoted to study of Discrete (Disinct or not connected objects ) Discrete Mathematics is the study of mathematical structures that are fundamentally discrete rather than continuous . As we know Discrete Mathematics is a back